The Unix Shell
Wildcards (special characters) can be used in several ways:
Standard wildcards are used for globbing files – pulling together files to perform an action on them.
A table of commonly used wildcards. (You might need to scroll down)
Wildcard | Represents | |
---|---|---|
* |
0 or more characters
|
|
? |
Any single character
|
|
[] |
Any one of the characters within the brackets (comma separated list)
|
|
{} |
Any term within the brackets (comma separated list)
|
|
[!] |
Anything except (negate) the character within the brackets (comma separated list)
|
|
\ |
“Escapes” the following character, to treat it as a non-special character
|
Q&A: How can we list all files in shell-lesson-data/north-pacific-gyre
that end with .txt
?
Q&A: How can we list all files in shell-lesson-data/north-pacific-gyre
that end with .txt
?
Q&A: List the files in shell-lesson-data/north-pacific-gyre
that do not end with .txt
.
Q&A: List the files in shell-lesson-data/north-pacific-gyre
that do not end with .txt
.
Q&A: List the files in shell-lesson-data/north-pacific-gyre
with the last two positions before the suffix are a number lower than 5, followed by not Z.
Q&A: List the files in shell-lesson-data/north-pacific-gyre
with the last two positions before the suffix are a number lower than 5, followed by not Z.
regex
)A complex form of pattern matching that combines wildcards to create powerful patterns for text matching and manipulation in files.
To efficiently represent a pattern, we need to develop a language that specifies
Character classes are used to represent atoms. (You might need to scroll down)
Character class – Example | Matches |
---|---|
Non-special characters – a |
a matches a |
Dot – . |
. matches ANYTHING |
Range – [a-z] |
[a-z] matches any letter from a through z |
Character set – [abc] |
[abc] matches a , b , or c |
Character set – [[:alnum:]] |
[[:alnum:]] matches any alpha-numeric character |
Character set – [[:lower:]] |
[[:lower:]] matches any lowercase letter |
Character set – [[:space:]] |
[[:space:]] matches any whitespace |
Character set – [[:digit:]] | [[:digit:]] matches any digit | |
|
Negated character set – [^abc] |
[^abc] matches anything except a , b , or c |
Whitespace – \s |
\s matches any whitespace character |
Non-whitespace – \S |
\S matches any non-whitespace character |
Word – \w |
\w an entire word (continuous alpha-numeric or underscores) |
Non-word – \W |
\W not a word |
Digit – \d |
\d any digit |
Non-digit – \D |
\D not a digit |
Anchors are used to specify the location of characters or set of characters – so the pattern will only match if the position also matches.
Anchor | Example(s) |
---|---|
Start of line/string – ^ |
^a matches the a in apple , but not sandal |
End of line/string – $ |
a$ matches the a in spa , but not space |
Quantifiers are used to specify the number of times preceeding characters or sets of characters are repeated. (You might need to scroll down)
Quantifier | Example(s) |
---|---|
0 or 1 time – ? |
re?d matches rd , red , NOT reed , read |
0 or more times – * |
re*d matches rd , red , reed , NOT read |
1 or more times – + |
re+d matches red , reed , NOT rd , read |
Specified number of times – {} |
re{1}d matches red , NOT rd , reed , read |
Range of times – {1,3} |
re{1,3}d matches red , reed , NOT rd , read |
Or – | |
re(e|a)d matches reed , read , NOT rd , red |
Matched atoms can be grouped together and referenced later, perhaps to keep or replace.
Grouping/Reference | Example(s) |
---|---|
Capture the group – () |
(re)d groups re together |
Reference the group – \1 |
\1 references the first group captured |
Learning regex
takes time and practice, practice, practice!
Q&A:
Which expression will select only the
in the following?
“The great thing about learning is that the experience itself teaches you something, though it may not be the thing you wanted to learn.”
the
(T|t)e
[Tt]he
*he
Q&A:
Which expression will select only the
in the following?
“The great thing about learning is that the experience itself teaches you something, though it may not be the thing you wanted to learn.”
the
(T|t)e
[Tt]he
*he
Q&A:
Which expression will select all of the following?
foxes boxes loxes
.oxes
[fbl]oxes
(f|b|l)oxes
*oxes
Q&A:
Which expression will select all of the following?
foxes boxes loxes
.oxes
[fbl]oxes
(f|b|l)oxes
*oxes
Q&A:
Which expression will select all of the following?
nd ned need
ne+d
ne?d
ne*d
ne.d
Q&A:
Which expression will select all of the following?
nd ned need
ne+d
ne?d
ne*d
ne.d
For more complex practice, I recommend RegexOne.
grep
Regular expressions are most effective when used with specific commands. One that we’ll learn about is called grep
.
grep
– search a regular expression and printSearches for a pattern within a file and returns the line containing the pattern.
Command | Options/Flags | Arguments |
---|---|---|
grep |
flags |
pattern /path/to/file |
grep
– search a regular expression and printgrep
– search a regular expression and printLet’s try this on a few files in our shell-lesson-data/exercise-data/creatures
directory.
grep
– search a regular expression and printIf we take a look at top 5 lines of each file (head
command) we see:
# cd into the directory
cd ~/Desktop/shell-lesson-data/exercise-data/creatures
# print the first 5 lines each file
head -n 5 *
==> basilisk.dat <==
COMMON NAME: basilisk
CLASSIFICATION: basiliscus vulgaris
UPDATED: 1745-05-02
CCCCAACGAG
GAAACAGATC
==> minotaur.dat <==
COMMON NAME: minotaur
CLASSIFICATION: bos hominus
UPDATED: 1765-02-17
CCCGAAGGAC
CGACATCTCT
==> unicorn.dat <==
COMMON NAME: unicorn
CLASSIFICATION: equus monoceros
UPDATED: 1738-11-24
AGCCGGGTCG
CTTTACCTTA
grep
– search a regular expression and printUsing grep
, let’s pull out the common names line of all of the creatures.
grep
– search a regular expression and printNow we will check how many times the CCC
is seen in each creatures genomic sequence.
grep
– search a regular expression and printWhat if we want just the first line following the common name unicorn
?
grep
– search a regular expression and printWhat if we wanted anything updates in the 1740’s? We need to use -E
option to use the extended regular expressions we covered earlier.
# cd into the directory
cd ~/Desktop/shell-lesson-data/exercise-data/creatures
# grep COMMON NAME from all files ending in .dat
grep -E '174\d-\d{2}-\d{2}' *.dat
basilisk.dat:UPDATED: 1745-05-02
As we can see, grep
and pattern matching is useful, but it becomes even more powerful it we combine it with filtering.
In unix, we can filter data in many ways. Here we’ll cover a few light, but useful commands to do so.
cut
– filtering data from each lineFilters data (“cuts”) based upon a separator.
Command | Options/Flags | Arguments |
---|---|---|
cut |
flags |
file/input |
cut
– filtering data from each lineLet’s take a look at the animals.csv
file in shell-lesson-data/exercise-data/animal-counts
, using head -n 3
to look at the first 3 lines.
cut
– filtering data from each lineNow let’s keep only the animals and counts – fields 2 and 3 if we consider the comma as the field separator.
We use the -f
flag to set the columns to keep and the -d
flag to tell the command to use a comma as a field separator.
uniq
– report or filter out repeated linesFilters out repeated ADJACENT lines, but also allows for counting them, or ignoring a specific number of them.
Command | Options/Flags | Arguments |
---|---|---|
uniq |
flags |
input/file output/file |
uniq
– report or filter out repeated linesLet’s use uniq
to count the number of unique lines.
If a file looked like this:
Then uniq
, with counts (using the -c
flag) would output
sort
– order lines of a fileSorts a file or input in a highly customizable way.
Command | Options/Flags | Arguments |
---|---|---|
sort |
flags |
file/input |
sort
– order lines of a fileLet’s sort the animals file by the second field (the -k
flag), using the commma as the field separator (-t
flag).
# cd to directory
cd ~/Desktop/shell-lesson-data/exercise-data/animal-counts
# cut to keep the second field (-f), using comma as a field separator (-t)
sort -t , -k 2 animals.csv
2012-11-07,bear,1
2012-11-06,deer,2
2012-11-05,deer,5
2012-11-06,fox,4
2012-11-07,rabbit,16
2012-11-06,rabbit,19
2012-11-05,rabbit,22
2012-11-05,raccoon,7
grep
– search a regular expression and printWe revisit grep
here to highlight the ability to not only return matching lines, but also to negate matching lines using the -v
flag.
Command | Options/Flags | Arguments |
---|---|---|
grep |
flags |
pattern /path/to/file |
Pipes (|
) are used to quickly connect unix commands by “piping” output of one command to the input of another.
command1 | command2 | command3
Piping or chaining together commands in this way allows us to make even greater use of the commands we just learned about. :)
We’ll illustrate this by working with a gtf
file.
We will download the file with wget
or curl
.
In the terminal, type which wget
, and which curl
.
You can download the file as below, depending on the command you want to use.
# cd to ~/Desktop/shell-lesson-data
cd ~/Desktop/shell-lesson-data
# wget file (using capital -O as the flag to create a gzipped file named example.gtf.gz)
wget -O example.gtf.gz ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz
# cd to ~/Desktop/shell-lesson-data
cd ~/Desktop/shell-lesson-data
# wget file (using lowercase -o as the flag to create a gzipped file name example.gtf.gz)
curl -o example.gtf.gz ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz
You can unzip the file as below, depending on the command you want to use.
#!genome-build CNA3
#!genome-version CNA3
#!genome-date 2015-11
#!genome-build-accession GCA_000149245.3
#!genebuild-last-updated 2015-11
1 ena gene 100 5645 . - . gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1 ena transcript 100 5645 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1 ena exon 5494 5645 . - . gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";
1 ena CDS 5494 5645 . - 0 gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
1 ena start_codon 5643 5645 . - 0 gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
GTF files contain the following information, as columns (fields)
Using the commands we’ve learned thus far, let’s explore the example.gtf
file to answer the following:
How many chromosomes does the organism have?
# cd to directory
cd ~/Desktop/shell-lesson-data
# print the file to the screen to pipe it into grep
# remove the lines with #! because they'll get in the way
# cut to keep the first column (chromosomes)
# sort the chromosomes numerically, removing duplicates
cat example.gtf | grep -v '^#' | cut -f1 | sort -nu
Mt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
This organism has 14 + Mt chromosomes.
How many genes does the organism have?
# cd to directory
cd ~/Desktop/shell-lesson-data
# print the file to the screen to pipe it into grep
# remove the lines with #! because they'll get in the way
# cut to keep the third column (biotype)
# sort values
# keep unique values, specifying counts of each unique value
cat example.gtf | grep -v '^#' | cut -f3 | sort | uniq -c
49063 CDS
52036 exon
6923 five_prime_utr
8497 gene
7860 start_codon
3167 stop_codon
7034 three_prime_utr
9348 transcript
This organism has 8497 genes.
Which chromosome has the most genes?
# cd to directory
cd ~/Desktop/shell-lesson-data
# print the file to the screen to pipe it into grep
# remove the lines with #! because they'll get in the way
# cut to keep the first and third columns (chromosome, biotype)
# sort values
# keep unique values, specifying counts of each unique value to get totals of biotypes by chromosome
# pull out gene biotype totals
cat example.gtf | grep -v '^#' | cut -f1,3 | sort | uniq -c | grep 'gene'
1033 1 gene
474 10 gene
663 11 gene
326 12 gene
322 13 gene
417 14 gene
706 2 gene
725 3 gene
503 4 gene
812 5 gene
640 6 gene
641 7 gene
639 8 gene
554 9 gene
42 Mt gene
Chromosome 1 has the most genes, 1033. Mt has the least, 42.
As we can see, piping commands together allows us to easily perform analyses as a set of commands. In the next lesson, we’ll learn about how we can use loops and scripts to do this even more efficiently.