Patterns, Filters, and Pipes
Pattern Matching
Wildcards (special characters) can be used in several ways:
- Standard wildcards (globbing) – matching to work on a set of files
- Regular expressions – matching to work within files
Standard Expansion Patterns
Standard wildcards are used for globbing files – pulling together files to perform an action on them.
| Wildcard | Represents | |
|---|---|---|
| * | 0 or more characters 
 | |
| ? | Any single character 
 | |
| [] | Any one of the characters within the brackets (comma separated list) 
 | |
| {} | Any term within the brackets (comma separated list) 
 | |
| [!] | Anything except (negate) the character within the brackets (comma separated list) 
 | |
| \ | “Escapes” the following character, to treat it as a non-special character 
 | |
Let’s try some examples in our shell-lesson-data directory.
Q&A: How can we list all files in shell-lesson-data/north-pacific-gyre that end with .txt?
Q&A: List the files in shell-lesson-data/north-pacific-gyre that do not end with .txt.
Q&A: List the files in shell-lesson-data/north-pacific-gyre with the last two positions before the suffix are a number lower than 5, followed by not Z.
Regular Expressions (regex)
A complex form of pattern matching that combines “wildcards” to create powerful patterns for text matching and manipulation in files.
Used with grep to search for text – which we’ll explain in a bit.
- regexsymbols are interpreted by the commands above
What makes a pattern?
To efficiently represent a pattern, we need to develop a language that specifies
- atom – the actual character that we want to match
- positions – the location of this atom
- number of times – how many times we see the atom
- groups – groups of matched atoms or non-matched
Representing Atoms
Character classes are used to represent atoms.
| Character class – Example | Matches | 
|---|---|
| Non-special characters – a | amatchesa | 
| Dot – . | .matches ANYTHING | 
| Range – [a-z] | [a-z]matches any letter fromathroughz | 
| Character set – [abc] | [abc]matchesa,b, orc | 
| Character set – [[:alnum:]] | [[:alnum:]]matches any alpha-numeric character | 
| Character set – [[:lower:]] | [[:lower:]]matches any lowercase letter | 
| Character set – [[:space:]] | [[:space:]]matches any whitespace | 
| Character set – [[:digit:]]|[[:digit:]]matches any digit | | |
| Negated character set – [^abc] | [^abc]matches anything excepta,b, orc | 
| Whitespace – \s | \smatches any whitespace character | 
| Non-whitespace – \S | \Smatches any non-whitespace character | 
| Word – \w | \wan entire word (continuous alpha-numeric or underscores) | 
| Non-word – \W | \Wnot a word | 
| Digit – \d | \dany digit | 
| Non-digit – \D | \Dnot a digit | 
Positions
Anchors are used to specify the location of characters or set of characters – so the pattern will only match if the position also matches.
| Anchor | Example(s) | 
|---|---|
| Start of line/string – ^ | ^amatches theainapple, but notsandal | 
| End of line/string – $ | a$matches theainspa, but notspace | 
Number of times
Quantifiers are used to specify the number of times preceeding characters or sets of characters are repeated.
| Quantifier | Example(s) | 
|---|---|
| 0 or 1 time – ? | re?dmatchesrd,red, NOTreed,read | 
| 0 or more times – * | re*dmatchesrd,red,reed, NOTread | 
| 1 or more times – + | re+dmatchesred,reed, NOTrd,read | 
| Specified number of times – {} | re{1}dmatchesred, NOTrd,reed,read | 
| Range of times – {1,3} | re{1,3}dmatchesred,reed, NOTrd,read | 
| Or – | | re(e|a)dmatchesreed,read, NOTrd,red | 
Groups and Reference
Matched atoms can be grouped together and referenced later.
| Grouping/Reference | Example(s) | 
|---|---|
| Capture the group – () | (re)dgroupsretogether | 
| Reference the group – \1 | \1references the first group captured | 
Practicing with Regex
Learning regex takes time and practice!
Question 1:
Which expression will select only the in the following?
“The great thing about learning is that the experience itself teaches you something, though it may not be the thing you wanted to learn.”
- the
- (T|t)e
- [Tt]he
- *he
Question 2:
Which expression will select all of the following?
foxes boxes loxes- .oxes
- [fbl]oxes
- (f|b|l)oxes
- *oxes
Question 3:
Which expression will select all of the following?
nd ned need- ne+d
- ne?d
- ne*d
- ne.d
For more practice, I recommend RegexOne
Using regex with grep
Regular expressions are most effective when used with specific commands.
grep – globally search a regular expression and print
Searches for a pattern within a file and returns the line containing the pattern.
By default, grep returns the line containing the pattern and is case-sensitive.
A few of the useful options are below:
- use -ito peform case-insensitive matching
- use -vto return the non-matching lines
- use -wto return the word instead of the line that matches
- use -Ato return the line after the matching line
- use -Bto return the line before the matching line
- use -Eto use extended regular expressions
- use -cto return the number of times a match is seen
- use -nto output the line number that matches
| Command | Options/Flags | Arguments | 
|---|---|---|
| grep | flags | pattern/path/to/file | 
Let’s try this on a few files in our shell-lesson-data/exercise-data/creatures directory.
If we take a look at top 5 lines of each file (head command) we see:
# cd into the directory
cd ~/Desktop/shell-lesson-data/exercise-data/creatures
# print the first 5 lines each file 
head -n 5 *==> basilisk.dat <==
COMMON NAME: basilisk
CLASSIFICATION: basiliscus vulgaris
UPDATED: 1745-05-02
CCCCAACGAG
GAAACAGATC
==> minotaur.dat <==
COMMON NAME: minotaur
CLASSIFICATION: bos hominus
UPDATED: 1765-02-17
CCCGAAGGAC
CGACATCTCT
==> unicorn.dat <==
COMMON NAME: unicorn
CLASSIFICATION: equus monoceros
UPDATED: 1738-11-24
AGCCGGGTCG
CTTTACCTTAUsing grep, let’s pull out the common names line of all of the creatures.
# cd into the directory
cd ~/Desktop/shell-lesson-data/exercise-data/creatures
# grep COMMON NAME from all files ending in .dat 
grep 'COMMON NAME' *.datbasilisk.dat:COMMON NAME: basilisk
minotaur.dat:COMMON NAME: minotaur
unicorn.dat:COMMON NAME: unicornUsing grep, let’s check how many times the CCC is seen in each creatures genomic sequence.
# cd into the directory
cd ~/Desktop/shell-lesson-data/exercise-data/creatures
# grep COMMON NAME from all files ending in .dat 
grep -c 'CCC' *.datbasilisk.dat:22
minotaur.dat:18
unicorn.dat:22What if we want just the first line following the common name unicorn?
# cd into the directory
cd ~/Desktop/shell-lesson-data/exercise-data/creatures
# grep COMMON NAME from all files ending in .dat 
grep -A 1 'unicorn' *.datunicorn.dat:COMMON NAME: unicorn
unicorn.dat-CLASSIFICATION: equus monocerosWhat if we wanted anything updates in the 1740’s? We need to use -E option to use the extended regular expressions we covered earlier.
# cd into the directory
cd ~/Desktop/shell-lesson-data/exercise-data/creatures
# grep COMMON NAME from all files ending in .dat 
grep -E '174\d-\d{2}-\d{2}' *.datbasilisk.dat:UPDATED: 1745-05-02As we can see, grep and pattern matching is useful, but it becomes even more powerful it we combine it with filtering.
Filtering
In unix, we can filter data in many ways. Here we’ll cover a few light, but useful commands to do so.
cut – filtering data from each line, cutting columns/fields out
Filter data (“cut”) based upon a separator.
| Command | Options/Flags | Arguments | 
|---|---|---|
| cut | flags | file/input | 
The cut command separates fields by tabs by default.
Some useful flags are below:
- use -dto set the delimeter between fields to another character
- use -fto list the fields to cut (can create a list-f 2,3cuts field 2 and 3.-f 3-5cuts field 3 to 5.)
Let’s take a look at the animals.csv file in shell-lesson-data/exercise-data/animal-counts.
# cd to directory
cd ~/Desktop/shell-lesson-data/exercise-data/animal-counts
# look at file
head -n 3 animals.csv2012-11-05,deer,5
2012-11-05,rabbit,22
2012-11-05,raccoon,7Let’s keep only the animals and counts – fields 2 and 3 if we consider the comma as the field separator.
# cd to directory
cd ~/Desktop/shell-lesson-data/exercise-data/animal-counts
# cut to keep the second field (-f), using comma as a field separator (-d)
cut -f2,3 -d ',' animals.csvdeer,5
rabbit,22
raccoon,7
rabbit,19
deer,2
fox,4
rabbit,16
bear,1uniq – report or filter out repeated lines
Filters out repeated ADJACENT lines, but also allows for counting them, or ignoring a specific number of them.
| Command | Options/Flags | Arguments | 
|---|---|---|
| uniq | flags | input/fileoutput/file | 
The uniq command is case sensitive by default and removes all duplicated adjacent lines. Thus, sorting prior is recommended.
Some useful flags are below:
- use -cadd the count of the number of times the line occurs
- use -ito ignore case
If a file looked like this:
$ cat animals.txt
bear
deer
deer
fox
rabbit
rabbit
rabbit
raccoonThen uniq, with counts would output
uniq -c animals.txt
   1 bear
   2 deer
   1 fox
   3 rabbit
   1 raccoonsort – order lines of a file
Sorts a file or input in a highly customizable way.
| Command | Options/Flags | Arguments | 
|---|---|---|
| sort | flags | file/input | 
The sort command is case sensitive by default sorts lexiconically
Some useful flags are below:
- use -tto specify the field separator
- use -ito ignore case
- use -cto check if a file is sorted
- use -kto specify a field to sort on
- use -uto keep unique lines
- use -nto perform a numberic sort
Let’s sort the animals file by the second field (-k), using the commma as the field separator (-t).
# cd to directory
cd ~/Desktop/shell-lesson-data/exercise-data/animal-counts
# cut to keep the second field (-f), using comma as a field separator (-d)
sort -t , -k 2 animals.csv2012-11-07,bear,1
2012-11-06,deer,2
2012-11-05,deer,5
2012-11-06,fox,4
2012-11-07,rabbit,16
2012-11-06,rabbit,19
2012-11-05,rabbit,22
2012-11-05,raccoon,7grep – globally search a regular expression and print
Returns filtered lines, can also negate lines.
| Command | Options/Flags | Arguments | 
|---|---|---|
| grep | flags | pattern/path/to/file | 
# cd to directory
cd ~/Desktop/shell-lesson-data/exercise-data/animal-counts
# give me the lines that do not have an animal that ends in r
grep -Ev ',\w+r,' animals.csv2012-11-05,rabbit,22
2012-11-05,raccoon,7
2012-11-06,rabbit,19
2012-11-06,fox,4
2012-11-07,rabbit,16Pipes
Pipes (|) are used to quickly connect unix commands by “piping” output of one command to the input of another.
command1 | command2 | command3Download the gtf
We will download the file with wget or curl.
In the terminal, type which wget, and which curl.
- If you see a path returned when you type one of those commands and press enter, then you have that command.
#check for wget
which wget
#check for curl
which curl/Users/csifuentes/miniconda3/bin/wget
/usr/bin/curlYou can download the file as below, depending on the command you want to use.
# cd to ~/Desktop/shell-lesson-data
cd ~/Desktop/shell-lesson-data
# wget file (using capital -O as the flag to create a gzipped file named example.gtf.gz)
wget -O example.gtf.gz ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz# cd to ~/Desktop/shell-lesson-data
cd ~/Desktop/shell-lesson-data
# wget file (using lowercase -o as the flag to create a gzipped file name example.gtf.gz)
curl -o example.gtf.gz ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gzYou can unzip the file as below, depending on the command you want to use.
# cd to ~/Desktop/shell-lesson-data
cd ~/Desktop/shell-lesson-data
# gunzip file
gunzip example.gtf.gz# cd to ~/Desktop/shell-lesson-data
cd ~/Desktop/shell-lesson-data
# view it
head example.gtf#!genome-build CNA3
#!genome-version CNA3
#!genome-date 2015-11
#!genome-build-accession GCA_000149245.3
#!genebuild-last-updated 2015-11
1   ena gene    100 5645    .   -   .   gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1   ena transcript  100 5645    .   -   .   gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1   ena exon    5494    5645    .   -   .   gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";
1   ena CDS 5494    5645    .   -   0   gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
1   ena start_codon 5643    5645    .   -   0   gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";GTF files contain the following information, as columns (fields)
- chromosome name
- annotation source
- feature-type
- genomic start
- genomic end
- score
- strand
- genomic phase
- additonal information (gene_id, etc.)
Analysis
Using the commands we’ve learned thus far, let’s explore the example.gtf file to answer the following:
- How many chromosomes does the organism have?
- How many unique gene ids does the organism have?
- Which chromosome has the most genes?
How many chromosomes does the organism have?
# cd to directory
cd ~/Desktop/shell-lesson-data
# print the file to the screen to pipe it into grep
# remove the lines with #! because they'll get in the way
# cut to keep the first column (chromosomes)
# sort the chromosomes numerically, removing duplicates
cat example.gtf | grep -v '^#' | cut -f1 | sort -nu Mt
1
2
3
4
5
6
7
8
9
10
11
12
13
14This organism has 14 + Mt chromosomes.
How many genes does the organism have?
# cd to directory
cd ~/Desktop/shell-lesson-data
# print the file to the screen to pipe it into grep
# remove the lines with #! because they'll get in the way
# cut to keep the third column (biotype)
# sort values
# keep unique values, specifying counts of each unique value
cat example.gtf | grep -v '^#' | cut -f3 | sort | uniq -c49063 CDS
52036 exon
6923 five_prime_utr
8497 gene
7860 start_codon
3167 stop_codon
7034 three_prime_utr
9348 transcriptThis organism has 8497 genes.
Which chromosome has the most genes?
# cd to directory
cd ~/Desktop/shell-lesson-data
# print the file to the screen to pipe it into grep
# remove the lines with #! because they'll get in the way
# cut to keep the first and third columns (chromosome, biotype)
# sort values
# keep unique values, specifying counts of each unique value to get totals of biotypes by chromosome
# pull out gene biotype totals
cat example.gtf | grep -v '^#' | cut -f1,3 | sort | uniq -c | grep 'gene'1033 1  gene
 474 10 gene
 663 11 gene
 326 12 gene
 322 13 gene
 417 14 gene
 706 2  gene
 725 3  gene
 503 4  gene
 812 5  gene
 640 6  gene
 641 7  gene
 639 8  gene
 554 9  gene
  42 Mt geneChromosome 1 has the most genes, 1033. Mt has the least, 42.
As we can see, piping commands together allows us to easily perform analyses as a set of commands. In the next lesson, we’ll learn about how we can use loops and scripts to do this even more efficiently.