Bash Scripting
Scripts
Scripts are files that hold a series of commands that are to be interpreted and executed.
Starting a Script
Because scripts can be written in any language, the computer needs to know which interpreter to use to interpret the commands. The hashbang, #!
(or shebang), followed by the location, tells the computer where to find the interpreter for the specific language that is being used.
These are usually found in specific locations, but can vary slightly. The options below are worth trying.
#!/bin/bash
#!/usr/bin/env bash
Using a Script
Using any script is similar to running the commands we’ve learned.
bash script.sh
./script.sh
– need to have permissions to run it this way
Putting this into practice – Processig RNA-seq Data
Let’s put this into action with an example RNA-seq workflow.
RNA-sequencing
RNA-sequencing (RNA-seq) is a method that allows for the sequencing of complex mixtures of RNA. Different analyses can be peformed with these data, including assessing
- gene expression (changes over time, or differences between groups/treatments)
- transcript splicing
- post-transcriptional modifications
- gene fusions
- and more
General Analysis Steps
While there are variations to RNA-seq analyses, generally, the following steps are performed
- Demultiplex reads
- Initial read quality check
- Filter and trim sequencing reads
- Normalize sequencing reads
- De novo assembly of transcripts (if a reference genome is not available)
- Map (align) sequencing reads to reference genome or transcriptome
- Annotate transcripts assembled or to which reads have been mapped
- Count mapped reads to estimate transcript abundance
- Perform statistical analysis to identify differential expression (or differential splicing) among samples or treatments
- Perform multivariate statistical analysis/visualization to assess transcriptome-wide differences among samples
Tools/Software
There are MANY tools that can be used at each of these steps, each with their own pros and cons. For this lesson, we will use the following, stopping after quantification:
Step | Tool |
---|---|
Demultiplex | Performed already (usually using bcl2fastq ) |
Initial read QC | FastQC |
Filter and trim reads | TrimGalore! |
Alignment | STAR |
Abundance quantification | Subread |
Sample Datasets
Our dataset here sequencing data files (fastq
files) from 6 samples, that are divided into 2 groups (1 and 2).
Sample Name | Group |
---|---|
sample_01 | 1 |
sample_02 | 1 |
sample_03 | 1 |
sample_04 | 2 |
sample_05 | 2 |
sample_06 | 2 |
Analysis
We’ll implement this analysis using bash scripts.
I will walk through the initial step here, then we will work more hands-on in each step, building the scripts together.
Initial Read QC
One of the first steps will be to check the quality of the sequencing reads with FastQC
. The options we will be using are shown below.
Option | Description |
---|---|
-o |
output directory |
-noextract |
do not unzip (extract) the final output |
-f |
the format of the input file is fastq |
-t |
use this number of threads for processing |
[input files] |
input file(s) |
The following is how this command could be run at the command line for sample_01
.
Note: The \
at the end of the line allows me to break up the command onto different lines, but the command in still interpreted as a single command.
fastqc \
-o ~/Desktop/rnaseq \
--noextract \
-f fastq \
-t 4 \
sample_01_1.fastq
How would we put this into a bash script?
- put the command in a plain text file, saved with a suffix
.sh
- specify the interpreter with the hashbang
- add in the code
- generalize if possibe, with variables, etc.
Try to create a bash script for this first step.
Read QC Trimming
A common next step, should the FastQC
analysis show a high-level of adapter content, or poor quality sequencing, would be to perform adapter trimming and read filtering or quality trimming. We’re going to pretend we need to trim and will be using TrimGalore!
.
Option | Description |
---|---|
-q |
quality value threshold for base calls, cuts bases with score below value |
--length |
remove reads that become shorter than this length |
--basename |
the basename of the sample/file |
-o |
output directory |
--paired |
if paired-end reads, the paired read files |
How would we put this into a bash script?
Aligning Reads
The next step is to align filtered reads. We will use STAR
for this, with the following options.
Note: The trimming tool will output a specific file format of the sample name, followed by _val_
, then the read, then end in .fq
. This must be taken into account for the next step.
Option | Description |
---|---|
-genomeDir |
path to the genome directory, with the index files |
--readFilesIn |
the path to the reads files |
--sjdbGTFfile |
the path to the GTF file associated with the genome file |
-outFileNamePrefix |
output prefix, with directory in front |
--runThreadN |
number of threads to use, if possible |
--outSAMattributes |
alignment attributes to report |
--outSAMtype |
the format to output (SAM/BAM, etc) and whether to sort |
How would we put this into a bash script?
Quantifying Expression
In the next step, we use featureCounts
from Subread
to quantify the expression of genes.
Option | Description |
---|---|
-O |
assign reads to their overlapping features |
-p |
count as fragments, with paired end reads |
-B |
count only reads where both pairs align |
-C |
do not count if reads map to different chromosomes or different strands |
-T |
threads to use |
-s |
strand-specific read counting |
-a |
annotation file (GTF, etc.) |
-o |
output file name |
[file] |
the format to output (SAM/BAM, etc) and whether to sort |
How would we put this into a bash script?
These scripts used here, can be improved upon, by combining them, generalizing more, adding in extra sanity checks, and taking input from the commandline, etc.
I encourage you to explore more, build and use these skills.