The Unix Shell
Scripts are files that hold a series of commands that are to be interpreted and executed.
Because scripts can be written in any language, the computer needs to know which interpreter to use to interpret the commands. The hashbang, #!
(or shebang), followed by the location, tells the computer where to find the interpreter for the specific language that is being used.
These are usually found in specific locations, but can vary slightly. The options below are worth trying.
#!/bin/bash
#!/usr/bin/env bash
Using any script is similar to running the commands we’ve learned.
bash script.sh
./script.sh
– need to have permissions to run it this wayLet’s put this into action with an example RNA-seq workflow.
RNA-sequencing (RNA-seq) is a method that allows for the sequencing of complex mixtures of RNA. Different analyses can be peformed with these data, including assessing
While there are variations to RNA-seq analyses, generally, the following steps are performed
1. Demultiplex reads
2. Initial read quality check
3. Filter and trim sequencing reads
4. Normalize sequencing reads
5. _De novo_ assembly of transcripts (if a reference genome is not available)
6. Map (align) sequencing reads to reference genome or transcriptome
7. Annotate transcripts assembled or to which reads have been mapped
8. Count mapped reads to estimate transcript abundance
9. Perform statistical analysis to identify differential expression (or differential splicing) among samples or treatments
10. Perform multivariate statistical analysis/visualization to assess transcriptome-wide differences among samples
There are MANY tools that can be used at each of these steps, each with their own pros and cons. For this lesson, we will use the following, stopping after quantification:
Step | Tool |
---|---|
Demultiplex | Performed already (usually using bcl2fastq ) |
Initial read QC | FastQC |
Filter and trim reads | TrimGalore! |
Alignment | STAR |
Abundance quantification | Subread |
Our dataset here sequencing data files (fastq
files) from 6 samples, that are divided into 2 groups (1 and 2).
Sample Name | Group |
---|---|
sample_01 | 1 |
sample_02 | 1 |
sample_03 | 1 |
sample_04 | 2 |
sample_05 | 2 |
sample_06 | 2 |
We’ll implement this analysis using bash scripts.
I will walk through the initial step here, then we will work more hands-on in each step, building the scripts together.
One of the first steps will be to check the quality of the sequencing reads with FastQC
. The options we will be using are shown below.
Option | Description |
---|---|
-o |
output directory |
-noextract |
do not unzip (extract) the final output |
-f |
the format of the input file is fastq |
-t |
use this number of threads for processing |
[input files] |
input file(s) |
The following is how this command could be run at the command line for sample_01
.
Note: The \
at the end of the line allows me to break up the command onto different lines, but the command in still interpreted as a single command.
How would we put this into a bash script?
.sh
Try to create a bash script for this first step.
A common next step, should the FastQC
analysis show a high-level of adapter content, or poor quality sequencing, would be to perform adapter trimming and read filtering or quality trimming. We’re going to pretend we need to trim and will be using TrimGalore!
.
Option | Description |
---|---|
-q |
quality value threshold for base calls, cuts bases with score below value |
--length |
remove reads that become shorter than this length |
--basename |
the basename of the sample/file |
-o |
output directory |
--paired |
if paired-end reads, the paired read files |
How would we put this into a bash script?
The next step is to align filtered reads. We will use STAR
for this, with the following options.
Option | Description |
---|---|
-genomeDir |
path to the genome directory, with the index files |
--readFilesIn |
the path to the reads files |
--sjdbGTFfile |
the path to the GTF file associated with the genome file |
-outFileNamePrefix |
output prefix, with directory in front |
--runThreadN |
number of threads to use, if possible |
--outSAMattributes |
alignment attributes to report |
--outSAMtype |
the format to output (SAM/BAM, etc) and whether to sort |
How would we put this into a bash script?
In the next step, we use featureCounts
from Subread
to quantify the expression of genes.
Option | Description |
---|---|
-O |
assign reads to their overlapping features |
-p |
count as fragments, with paired end reads |
-B |
count only reads where both pairs align |
-C |
do not count if reads map to different chromosomes or different strands |
-T |
threads to use |
-s |
strand-specific read counting |
-a |
annotation file (GTF, etc.) |
-o |
output file name |
[file] |
the format to output (SAM/BAM, etc) and whether to sort |
How would we put this into a bash script?
These scripts used here, can be improved upon, by combining them, generalizing more, adding in extra sanity checks, and taking input from the commandline, etc.
I encourage you to explore more, build and use these skills.