The Unix Shell
Values can be temporarily stored into items called variables
.
This is very useful in looping and scripting, particularly when we may not know or be able to keep track of values.
Interestingly, we use diffent syntax when assigning/unsetting and using variables.
setting variables – use variable=value
using variables – use $variable
unsetting variables – use unset variable
#create a variable named file_type and assign it a value of fastq
file_type="fastq"
#call the file_type variable, print it to the screen
echo "the value after setting:" $file_type
#unset (or remove) the variable assignment
unset file_type
#check for the value of file_type
echo "the value after unsetting:" $file_type
the value after setting: fastq
the value after unsetting:
Q&A: Which of the following correctly assigns the value of fastq
to a variable named file_suffix
?
fastq=$file_suffix
fastq = $file_suffix
fastq=file_suffix
file_suffix=fastq
file_suffix=$fastq
Q&A: Which of the following correctly assigns the value of fastq
to a variable named file_suffix
?
Q&A: Which of the following correctly assigns the value of trt
to a variable named var1
?
var1=${trt}
var1 =trt
var1=trt
var1=$trt
var1="trt"
Q&A: Which of the following correctly assigns the value of trt
to a variable named var1
?
Q&A: How can I save the value of the directory that I am in, as a variable named start_dir
?
Q&A: How can I save the value of the directory that I am in, as a variable named start_dir
?
Q&A: What would the value of out_var=$"(ls)"
be?
Q&A: What would the value of out_var=$"(ls)"
be?
Q&A: What would the final output be after running the following in a terminal?
Q&A: What would the final output be after running the following in a terminal?
Q&A: What would the final output be after running the following in a terminal?
Q&A: What would the final output be after running the following in a terminal?
Q&A: What would the final output be after running the following in a terminal?
Q&A: What would the final output be after running the following in a terminal?
Q&A: What would the final output be after running the following in a terminal?
Q&A: What would the final output be after running the following in a terminal?
Q&A: What would the final output be after running the following in a terminal?
Q&A: What would the final output be after running the following in a terminal?
Let’s see how we can use variables, combined with previous commands/methods in a quick analysis.
From the example.gtf
file (downloaded and used in the previous lesson), which chromosome has the highest number of genes? What about exons?
The initial structure is below
# cd to ~/Desktop/shell-lesson-data
cd ~/Desktop/shell-lesson-data
# view first few lines of the file
head example.gtf
#!genome-build CNA3
#!genome-version CNA3
#!genome-date 2015-11
#!genome-build-accession GCA_000149245.3
#!genebuild-last-updated 2015-11
1 ena gene 100 5645 . - . gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1 ena transcript 100 5645 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1 ena exon 5494 5645 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";
1 ena CDS 5494 5645 . - 0 gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
1 ena start_codon 5643 5645 . - 0 gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
From last time – we need to remove the leading lines of the file to make it easier to work with, using grep -v '^#'
, then we can cut
the fields that we need, sort and count the total genes with sort | uniq -c | grep 'gene'
.
# cd to directory
cd ~/Desktop/shell-lesson-data
# pull out gene biotype totals
cat example.gtf | grep -v '^#' | cut -f1,3 | sort | uniq -c | grep 'gene'
1033 1 gene
474 10 gene
663 11 gene
326 12 gene
322 13 gene
417 14 gene
706 2 gene
725 3 gene
503 4 gene
812 5 gene
640 6 gene
641 7 gene
639 8 gene
554 9 gene
42 Mt gene
We’re not quite there yet. Let’s capture the output as a variable, named chr_n
, to use for later.
Note: We’re introducing awk
here, a language the is quite useful in parsing text, to print out the second column $2
.
# cd to directory
cd ~/Desktop/shell-lesson-data
# pull out gene biotype totals
# grab the first line
# use awk to print the 2nd column
biotype_gene="gene"
biotype_exon="exon"
chr_n_gene=$(cat example.gtf | grep -v '^#' | cut -f1,3 | sort | uniq -c | grep $biotype_gene | head -n 1 | awk '{print $2;}')
chr_n_exon=$(cat example.gtf | grep -v '^#' | cut -f1,3 | sort | uniq -c | grep $biotype_exon | head -n 1 | awk '{print $2;}')
echo "The chromosome with the most "$biotype_gene" is: "$chr_n_gene
echo "The chromosome with the most "$biotype_exon" is: "$chr_n_exon
The chromosome with the most gene is: 1
The chromosome with the most exon is: 1
The option to capture values and use them in further commands is really evident when we get into loops.
Loops allow us to perform a command (or set of commands) on each item in a list.
Bash for loops follow a specific syntax.
Key components of the syntax
for
, in
, do
, done
– tell bash when portions of the loop are comingitem
– a variable that holds the value of an item from the list for an iteration of the looplist
– a set of items (list or array) to iterate overcommands
– the command(s) performed with each item in the list or arrayLet’s work through an example from ~/Desktop/shell-lesson-data/exercise-data/creatures
, printing out the first two lines of each file.
Walking through the 4 lines, line-by-line.
for
tells the computer we are entering a loop.filename
is created, which is initially empty.in
tells the computer to create an empty list.basilisk.dat
, minotour.dat
, and unicorn.dat
are added to the list.do
tells the computer to listen for the following commands perform on each item in the list.$filename
.In the example above, there are 3 iterations of the loop. Notice how the value of filename
changes with each iteration.
Iteration | filename |
list |
---|---|---|
1 | basilisk.dat |
basilisk.dat minotaur.dat unicorn.dat |
2 | minotaur.dat |
basilisk.dat minotaur.dat unicorn.dat |
3 | unicorn.dat |
basilisk.dat minotaur.dat unicorn.dat |
A while loop is another useful type of loop in bash and follows a specific syntax.
Key components of the syntax
while
, do
, done
– tell bash when portions of the loop are comingcondition
– a condition to be met for the loop to continue (“while true”)commands
– the command(s) performed with each item in the list or arrayLet’s see an example where we print out numbers less than or equal to 7 (-le
).
Note: We can increment num
by 1 each time by reassigning the value of num
, num=$(($num+1))
.
1 is less than or equal to 7.
2 is less than or equal to 7.
3 is less than or equal to 7.
4 is less than or equal to 7.
5 is less than or equal to 7.
6 is less than or equal to 7.
7 is less than or equal to 7.
Returning to our earlier gtf example, we can now identify the chromosomes with the most of several biotypes with a loop.
# cd to directory
cd ~/Desktop/shell-lesson-data
for bt in gene exon transcript CDS start_codon
do
chr_n=$(cat example.gtf | grep -v '^#' | cut -f1,3 | sort | uniq -c | grep $bt | head -n 1 | awk '{print $2;}')
echo "The chromosome with the most "$bt" is: "$chr_n
done
The chromosome with the most gene is: 1
The chromosome with the most exon is: 1
The chromosome with the most transcript is: 1
The chromosome with the most CDS is: 1
The chromosome with the most start_codon is: 1
We can take this futher and capture all of the types of biotypes as an array to pass to the loop as a variable.
Note: An item at position x
in an array can be accessed via array[x]
. In a loop, we use ${array[@]}
to access the item.
# cd to directory
cd ~/Desktop/shell-lesson-data
# capture the types of biotypes as an array
btype_array=$(cat example.gtf | grep -v '^#' | cut -f3 | sort | uniq)
for bt in ${btype_array[@]}
do
chr_n=$(cat example.gtf | grep -v '^#' | cut -f1,3 | sort | uniq -c | grep $bt | head -n 1 | awk '{print $2;}')
echo "The chromosome with the most "$bt" is: "$chr_n
done
The chromosome with the most CDS is: 1
The chromosome with the most exon is: 1
The chromosome with the most five_prime_utr is: 1
The chromosome with the most gene is: 1
The chromosome with the most start_codon is: 1
The chromosome with the most stop_codon is: 1
The chromosome with the most three_prime_utr is: 1
The chromosome with the most transcript is: 1
Q&A: Write a loop that would print out the months of the year. Create an array that holds the months.
Q&A: Write a loop that would print out the months of the year. Create an array that holds the months.
Q&A: Look at the following code and output.
What would be the output of the following code?
Q&A: Look at the following code and output.
What would be the output of the following code?
Hopefully you’ve seen how helpful variables and loops can be. Next, we’ll put things together with bash scripts.