The Unix Shell
Values can be temporarily stored into items called variables
.
This is very useful in looping and scripting, particularly when we may not know or be able to keep track of values.
Interestingly, we use diffent syntax when assigning/unsetting and using variables.
setting variables – use variable=value
using variables – use $variable
unsetting variables – use unset variable
#create a variable named file_type and assign it a value of fastq
file_type="fastq"
#call the file_type variable, print it to the screen
echo "the value after setting:" $file_type
#unset (or remove) the variable assignment
unset file_type
#check for the value of file_type
echo "the value after unsetting:" $file_type
the value after setting: fastq
the value after unsetting:
Tips and Tricks with Variables
=
– variable = value
will not do what we want.command "$variable"
prevents issues when variable values have spaces, etc.$()
– variable=$(command x)
stores the output of command x
as variable
.${variable}
– "${file_type}1"
from above would be fastq1
Q&A: Which of the following correctly assigns the value of fastq
to a variable named file_suffix
?
fastq=$file_suffix
fastq = $file_suffix
fastq=file_suffix
file_suffix=fastq
file_suffix=$fastq
Q&A: Which of the following correctly assigns the value of fastq
to a variable named file_suffix
?
Answer
fastq=$file_suffix
– No. Refers to a variable that doesn’t exist and wrong order.fastq = $file_suffix
– No. The added space tries to call a command named fastq. Also, this is the wrong order.fastq=file_suffix
– No. This is the wrong order and would create a variable called fastq.file_suffix=fastq
– Yes.file_suffix=$fastq
– No. Refers to a variable that doesn’t exist.Q&A: Which of the following correctly assigns the value of trt
to a variable named var1
?
var1=${trt}
var1 =trt
var1=trt
var1=$trt
var1="trt"
Q&A: Which of the following correctly assigns the value of trt
to a variable named var1
?
Answer
var1=${trt}
– No. Refers to variable that doesn’t exist.var1 =trt
– No. The added space tries to call a command named var1.var1=trt
– Yes.var1=$trt
– No. Refers to a variable that doesn’t exist.var1="trt"
– Yes.Q&A: How can I save the value of the directory that I am in, as a variable named start_dir
?
Q&A: How can I save the value of the directory that I am in, as a variable named start_dir
?
Answer
start_dir="$(pwd)"
and start_dir=$(pwd)
Q&A: What would the value of out_var=$"(ls)"
be?
Q&A: What would the value of out_var=$"(ls)"
be?
Answer
(ls)
. Why not the command output?
Q&A: What would the final output be after running the following in a terminal?
Q&A: What would the final output be after running the following in a terminal?
Answer
.txt
The value of name1
begins with $file1
, which is not a variable name, so it has no value. The only value assigned comes from ext1
, references as ${ext1}
.
Q&A: What would the final output be after running the following in a terminal?
Q&A: What would the final output be after running the following in a terminal?
Answer
sampleXext1
The value of name1
begins with $base1
, which is correctly referenced and holds the value of "sampleX"
. This is followed by $"ext1"
. There is no variable named "ext1"
, the variable is actually named ext1
, which would be referenced by $ext1
, or ${ext1}
, or "$ext1"
, or "${ext1}"
. By having $
before the quotes, we’re really just adding in a string value at the end.
Q&A: What would the final output be after running the following in a terminal?
Q&A: What would the final output be after running the following in a terminal?
Answer
sampleX.txt
The value of name1
begins with $base1
, which is correctly referenced and holds the value of "sampleX"
. This is followed by "${ext1}"
, which is correctly references and holds the value of .txt
.
Q&A: What would the final output be after running the following in a terminal?
Q&A: What would the final output be after running the following in a terminal?
Answer
sampleX.txt
The value of name1
begins with $base1
, which is correctly referenced and holds the value of "sampleX"
. This is followed by "${ext1}"
, which is correctly references and holds the value of .txt
. To unset, we need to pass the variable name (name1
), not a reference to the variable ($name1
).
Q&A: What would the final output be after running the following in a terminal?
Q&A: What would the final output be after running the following in a terminal?
Answer
.otherstuff
The value of name1
is unset right before we reference it, so it holds not value. The last line, we print $name1
, followed by a string ".otherstuff"
.
Let’s see how we can use variables, combined with previous commands/methods in a quick analysis.
From the example.gtf
file (downloaded and used in the previous lesson), which chromosome has the highest number of genes? What about exons?
The initial structure is below
# cd to ~/Desktop/shell-lesson-data
cd ~/Desktop/shell-lesson-data
# view first few lines of the file
head example.gtf
#!genome-build CNA3
#!genome-version CNA3
#!genome-date 2015-11
#!genome-build-accession GCA_000149245.3
#!genebuild-last-updated 2015-11
1 ena gene 100 5645 . - . gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1 ena transcript 100 5645 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1 ena exon 5494 5645 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";
1 ena CDS 5494 5645 . - 0 gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
1 ena start_codon 5643 5645 . - 0 gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
From last time – we need to remove the leading lines of the file to make it easier to work with, using grep -v '^#'
, then we can cut
the fields that we need, sort and count the total genes with sort | uniq -c | grep 'gene'
.
# cd to directory
cd ~/Desktop/shell-lesson-data
# pull out gene biotype totals
cat example.gtf | grep -v '^#' | cut -f1,3 | sort | uniq -c | grep 'gene'
1033 1 gene
474 10 gene
663 11 gene
326 12 gene
322 13 gene
417 14 gene
706 2 gene
725 3 gene
503 4 gene
812 5 gene
640 6 gene
641 7 gene
639 8 gene
554 9 gene
42 Mt gene
We’re not quite there yet. Let’s capture the output as a variable, named chr_n
, to use for later.
Note: We’re introducing awk
here, a language the is quite useful in parsing text, to print out the second column $2
.
# cd to directory
cd ~/Desktop/shell-lesson-data
# pull out gene biotype totals
# grab the first line
# use awk to print the 2nd column
biotype_gene="gene"
biotype_exon="exon"
chr_n_gene=$(cat example.gtf | grep -v '^#' | cut -f1,3 | sort | uniq -c | grep $biotype_gene | head -n 1 | awk '{print $2;}')
chr_n_exon=$(cat example.gtf | grep -v '^#' | cut -f1,3 | sort | uniq -c | grep $biotype_exon | head -n 1 | awk '{print $2;}')
echo "The chromosome with the most "$biotype_gene" is: "$chr_n_gene
echo "The chromosome with the most "$biotype_exon" is: "$chr_n_exon
The chromosome with the most gene is: 1
The chromosome with the most exon is: 1
The option to capture values and use them in further commands is really evident when we get into loops.
Loops allow us to perform a command (or set of commands) on each item in a list.
Bash for loops follow a specific syntax.
Figure 1: The syntax of a bash for loop.
Key components of the syntax
for
, in
, do
, done
– tell bash when portions of the loop are comingitem
– a variable that holds the value of an item from the list for an iteration of the looplist
– a set of items (list or array) to iterate overcommands
– the command(s) performed with each item in the list or arrayLet’s work through an example from ~/Desktop/shell-lesson-data/exercise-data/creatures
, printing out the first two lines of each file.
Walking through the 4 lines, line-by-line.
for
tells the computer we are entering a loop.filename
is created, which is initially empty.in
tells the computer to create an empty list.basilisk.dat
, minotour.dat
, and unicorn.dat
are added to the list.do
tells the computer to listen for the following commands perform on each item in the list.$filename
.In the example above, there are 3 iterations of the loop. Notice how the value of filename
changes with each iteration.
Iteration | filename |
list |
---|---|---|
1 | basilisk.dat |
basilisk.dat minotaur.dat unicorn.dat |
2 | minotaur.dat |
basilisk.dat minotaur.dat unicorn.dat |
3 | unicorn.dat |
basilisk.dat minotaur.dat unicorn.dat |
Note
The variable could be named anything – in the example above, we can say
for x in basilisk.dat minotaur.dat unicorn.dat
instead.
A while loop is another useful type of loop in bash and follows a specific syntax.
Figure 2: The syntax of a bash while loop.
Key components of the syntax
while
, do
, done
– tell bash when portions of the loop are comingcondition
– a condition to be met for the loop to continue (“while true”)commands
– the command(s) performed with each item in the list or arrayLet’s see an example where we print out numbers less than or equal to 7 (-le
).
Note: We can increment num
by 1 each time by reassigning the value of num
, num=$(($num+1))
.
1 is less than or equal to 7.
2 is less than or equal to 7.
3 is less than or equal to 7.
4 is less than or equal to 7.
5 is less than or equal to 7.
6 is less than or equal to 7.
7 is less than or equal to 7.
Returning to our earlier gtf example, we can now identify the chromosomes with the most of several biotypes with a loop.
# cd to directory
cd ~/Desktop/shell-lesson-data
for bt in gene exon transcript CDS start_codon
do
chr_n=$(cat example.gtf | grep -v '^#' | cut -f1,3 | sort | uniq -c | grep $bt | head -n 1 | awk '{print $2;}')
echo "The chromosome with the most "$bt" is: "$chr_n
done
The chromosome with the most gene is: 1
The chromosome with the most exon is: 1
The chromosome with the most transcript is: 1
The chromosome with the most CDS is: 1
The chromosome with the most start_codon is: 1
We can take this futher and capture all of the types of biotypes as an array to pass to the loop as a variable.
Note: An item at position x
in an array can be accessed via array[x]
. In a loop, we use ${array[@]}
to access the item.
# cd to directory
cd ~/Desktop/shell-lesson-data
# capture the types of biotypes as an array
btype_array=$(cat example.gtf | grep -v '^#' | cut -f3 | sort | uniq)
for bt in ${btype_array[@]}
do
chr_n=$(cat example.gtf | grep -v '^#' | cut -f1,3 | sort | uniq -c | grep $bt | head -n 1 | awk '{print $2;}')
echo "The chromosome with the most "$bt" is: "$chr_n
done
The chromosome with the most CDS is: 1
The chromosome with the most exon is: 1
The chromosome with the most five_prime_utr is: 1
The chromosome with the most gene is: 1
The chromosome with the most start_codon is: 1
The chromosome with the most stop_codon is: 1
The chromosome with the most three_prime_utr is: 1
The chromosome with the most transcript is: 1
Q&A: Write a loop that would print out the months of the year. Create an array that holds the months.
Q&A: Write a loop that would print out the months of the year. Create an array that holds the months.
Q&A: Look at the following code and output.
What would be the output of the following code?
Q&A: Look at the following code and output.
What would be the output of the following code?
Answer
cubane.pdb
. The list that is iterated over is any file that startes with c.
Hopefully you’ve seen how helpful variables and loops can be. Next, we’ll put things together with bash scripts.