Patterns, Filters, and Pipes

The Unix Shell

Christopher Sifuentes

Pattern Matching

Wildcards (special characters) can be used in several ways:

Standard wildcards (globbing) – matching to work on a set of files
Regular expressions – matching to work within files

Standard Expansion Patterns

Standard wildcards are used for globbing files – pulling together files to perform an action on them.

Standard Expansion Patterns

A table of commonly used wildcards. (You might need to scroll down)

Wildcard	Represents
`*`	0 or more characters `a*e` would match `ae`, `a103e`, `apple`
`?`	Any single character `a?e` would match `a1e`, `ape`, `are`
`[]`	Any one of the characters within the brackets (comma separated list) `m[a,3,n]s` would match `mas`, `m3s`, `mns` `[1-3]a` would match `1a`, `2a`, `3a`
`{}`	Any term within the brackets (comma separated list) `ls {.doc, .pdf}` would list all files ending in `.doc` and `.pdf`
`[!]`	Anything except (negate) the character within the brackets (comma separated list) `ls *[!A,B].txt` would match `123.txt`, `ZNEBF.txt`, `C.txt`
`\`	“Escapes” the following character, to treat it as a non-special character `ls \..txt` would match `this.file.txt`, NOT `this.txt`

Standard Expansion Patterns

Q&A: How can we list all files in shell-lesson-data/north-pacific-gyre that end with .txt?

Standard Expansion Patterns

Q&A: How can we list all files in shell-lesson-data/north-pacific-gyre that end with .txt?

Answer

# change into directory 
cd ~/Desktop/shell-lesson-data/north-pacific-gyre

# use * wildcard to list all ending in .txt
ls *.txt

NENE01729A.txt
NENE01729B.txt
NENE01736A.txt
NENE01751A.txt
NENE01751B.txt
NENE01812A.txt
NENE01843A.txt
NENE01843B.txt
NENE01971Z.txt
NENE01978A.txt
NENE01978B.txt
NENE02018B.txt
NENE02040A.txt
NENE02040B.txt
NENE02040Z.txt
NENE02043A.txt
NENE02043B.txt
this.file.txt
this.txt

Standard Expansion Patterns

Q&A: List the files in shell-lesson-data/north-pacific-gyre that do not end with .txt.

Standard Expansion Patterns

Q&A: List the files in shell-lesson-data/north-pacific-gyre that do not end with .txt.

Answer

# change into directory 
cd ~/Desktop/shell-lesson-data/north-pacific-gyre

# use ! to with the [] to negate all files ending in .txt
ls *[!.txt]

goodiff.sh
goostats.sh

Standard Expansion Patterns

Q&A: List the files in shell-lesson-data/north-pacific-gyre with the last two positions before the suffix are a number lower than 5, followed by not Z.

Standard Expansion Patterns

Q&A: List the files in shell-lesson-data/north-pacific-gyre with the last two positions before the suffix are a number lower than 5, followed by not Z.

Answer

# change into directory 
cd ~/Desktop/shell-lesson-data/north-pacific-gyre

# use ! to with the [] to negate all files ending in .txt
ls *[0-4][!Z].*

NENE01751A.txt
NENE01751B.txt
NENE01812A.txt
NENE01843A.txt
NENE01843B.txt
NENE02040A.txt
NENE02040B.txt
NENE02043A.txt
NENE02043B.txt

Regular Expressions (`regex`)

A complex form of pattern matching that combines wildcards to create powerful patterns for text matching and manipulation in files.

Tip

Used with grep to search for text – which we’ll explain in a bit.

regex symbols are interpreted by the commands above

What makes a pattern?

To efficiently represent a pattern, we need to develop a language that specifies

atom – the actual character that we want to match
positions – the location of this atom
number of times – how many times we see the atom
groups – groups of matched atoms or non-matched

Representing Atoms

Character classes are used to represent atoms. (You might need to scroll down)

Character class – Example	Matches
Non-special characters – `a`	`a` matches `a`
Dot – `.`	`.` matches ANYTHING
Range – `[a-z]`	`[a-z]` matches any letter from `a` through `z`
Character set – `[abc]`	`[abc]` matches `a`, `b`, or `c`
Character set – `[[:alnum:]]`	`[[:alnum:]]` matches any alpha-numeric character
Character set – `[[:lower:]]`	`[[:lower:]]` matches any lowercase letter
Character set – `[[:space:]]`	`[[:space:]]` matches any whitespace
Character set – `[[:digit:]]` \| `[[:digit:]]` matches any digit \|
Negated character set – `[^abc]`	`[^abc]` matches anything *except* `a`, `b`, or `c`
Whitespace – `\s`	`\s` matches any whitespace character
Non-whitespace – `\S`	`\S` matches any non-whitespace character
Word – `\w`	`\w` an entire word (continuous alpha-numeric or underscores)
Non-word – `\W`	`\W` *not* a word
Digit – `\d`	`\d` any digit
Non-digit – `\D`	`\D` *not* a digit

Positions

Anchors are used to specify the location of characters or set of characters – so the pattern will only match if the position also matches.

Anchor	Example(s)
Start of line/string – `^`	`^a` matches the `a` in `apple`, but not `sandal`
End of line/string – `$`	`a$` matches the `a` in `spa`, but not `space`

Number of times

Quantifiers are used to specify the number of times preceeding characters or sets of characters are repeated. (You might need to scroll down)

Quantifier	Example(s)
0 or 1 time – `?`	`re?d` matches `rd`, `red`, NOT `reed`, `read`
0 or more times – `*`	`re*d` matches `rd`, `red`, `reed`, NOT `read`
1 or more times – `+`	`re+d` matches `red`, `reed`, NOT `rd`, `read`
Specified number of times – `{}`	`re{1}d` matches `red`, NOT `rd`, `reed`, `read`
Range of times – `{1,3}`	`re{1,3}d` matches `red`, `reed`, NOT `rd`, `read`
Or – `\|`	`re(e\|a)d` matches `reed`, `read`, NOT `rd`, `red`

Groups and Reference

Matched atoms can be grouped together and referenced later, perhaps to keep or replace.

Grouping/Reference	Example(s)
Capture the group – `()`	`(re)d` groups `re` together
Reference the group – `\1`	`\1` references the first group captured

Practicing with Regex

Learning regex takes time and practice, practice, practice!

Practicing with Regex

Q&A:

Which expression will select only the in the following?

“The great thing about learning is that the experience itself teaches you something, though it may not be the thing you wanted to learn.”

the
(T|t)e
[Tt]he
*he

Practicing with Regex

Q&A:

Which expression will select only the in the following?

“The great thing about learning is that the experience itself teaches you something, though it may not be the thing you wanted to learn.”

the
(T|t)e
[Tt]he
*he

Answer

Yes. This will match the.
No. This will also match The.
No. This will also match The.
No. This will also match The.

Practicing with Regex

Q&A:

Which expression will select all of the following?

foxes boxes loxes

.oxes
[fbl]oxes
(f|b|l)oxes
*oxes

Practicing with Regex

Q&A:

Which expression will select all of the following?

foxes boxes loxes

.oxes
[fbl]oxes
(f|b|l)oxes
*oxes

Answer

Yes. . will match anything for the first character.
Yes. Uses character set matching.
Yes. Uses or matching.
No. * is a quantifier and references nothing.

Practicing with Regex

Q&A:

Which expression will select all of the following?

nd ned need

ne+d
ne?d
ne*d
ne.d

Practicing with Regex

Q&A:

Which expression will select all of the following?

nd ned need

ne+d
ne?d
ne*d
ne.d

Answer

No. + matches e 1 or more times.
No. ? matches e 0 or 1 times.
Yes. * matches e 0 or more times.
No. . matches anything one time exactly.

Practicing with Regex

For more complex practice, I recommend RegexOne.

Using regex with `grep`

Regular expressions are most effective when used with specific commands. One that we’ll learn about is called grep.

`grep` – search a regular expression and print

Searches for a pattern within a file and returns the line containing the pattern.

Command	Options/Flags	Arguments
`grep`	`flags`	`pattern` `/path/to/file`

`grep` – search a regular expression and print

Tip

By default, grep returns the line containing the pattern and is case-sensitive.

A few of the useful options are below:

use -i to peform case-insensitive matching
use -v to return the non-matching lines
use -w to return the word instead of the line that matches
use -A to return the line after the matching line
use -B to return the line before the matching line
use -E to use extended regular expressions
use -c to return the number of times a match is seen
use -n to output the line number that matches

`grep` – search a regular expression and print

Let’s try this on a few files in our shell-lesson-data/exercise-data/creatures directory.

`grep` – search a regular expression and print

If we take a look at top 5 lines of each file (head command) we see:

# cd into the directory
cd ~/Desktop/shell-lesson-data/exercise-data/creatures

# print the first 5 lines each file 
head -n 5 *

==> basilisk.dat <==
COMMON NAME: basilisk
CLASSIFICATION: basiliscus vulgaris
UPDATED: 1745-05-02
CCCCAACGAG
GAAACAGATC

==> minotaur.dat <==
COMMON NAME: minotaur
CLASSIFICATION: bos hominus
UPDATED: 1765-02-17
CCCGAAGGAC
CGACATCTCT

==> unicorn.dat <==
COMMON NAME: unicorn
CLASSIFICATION: equus monoceros
UPDATED: 1738-11-24
AGCCGGGTCG
CTTTACCTTA

`grep` – search a regular expression and print

Using grep, let’s pull out the common names line of all of the creatures.

# cd into the directory
cd ~/Desktop/shell-lesson-data/exercise-data/creatures

# grep COMMON NAME from all files ending in .dat 
grep 'COMMON NAME' *.dat

basilisk.dat:COMMON NAME: basilisk
minotaur.dat:COMMON NAME: minotaur
unicorn.dat:COMMON NAME: unicorn

`grep` – search a regular expression and print

Now we will check how many times the CCC is seen in each creatures genomic sequence.

# cd into the directory
cd ~/Desktop/shell-lesson-data/exercise-data/creatures

# grep COMMON NAME from all files ending in .dat 
grep -c 'CCC' *.dat

basilisk.dat:22
minotaur.dat:18
unicorn.dat:22

`grep` – search a regular expression and print

What if we want just the first line following the common name unicorn?

# cd into the directory
cd ~/Desktop/shell-lesson-data/exercise-data/creatures

# grep COMMON NAME from all files ending in .dat 
grep -A 1 'unicorn' *.dat

unicorn.dat:COMMON NAME: unicorn
unicorn.dat-CLASSIFICATION: equus monoceros

`grep` – search a regular expression and print

What if we wanted anything updates in the 1740’s? We need to use -E option to use the extended regular expressions we covered earlier.

# cd into the directory
cd ~/Desktop/shell-lesson-data/exercise-data/creatures

# grep COMMON NAME from all files ending in .dat 
grep -E '174\d-\d{2}-\d{2}' *.dat

basilisk.dat:UPDATED: 1745-05-02

As we can see, grep and pattern matching is useful, but it becomes even more powerful it we combine it with filtering.

Filtering

In unix, we can filter data in many ways. Here we’ll cover a few light, but useful commands to do so.

`cut` – filtering data from each line

Filters data (“cuts”) based upon a separator.

Command	Options/Flags	Arguments
`cut`	`flags`	`file/input`

Tip

The cut command separates fields by tabs by default.

Some useful flags are below:

use -d to set the delimeter between fields to another character
use -f to list the fields to cut (can create a list -f 2,3 cuts field 2 and 3. -f 3-5 cuts field 3 to 5.)

`cut` – filtering data from each line

Let’s take a look at the animals.csv file in shell-lesson-data/exercise-data/animal-counts, using head -n 3 to look at the first 3 lines.

# cd to directory
cd ~/Desktop/shell-lesson-data/exercise-data/animal-counts

# look at file
head -n 3 animals.csv

2012-11-05,deer,5
2012-11-05,rabbit,22
2012-11-05,raccoon,7

`cut` – filtering data from each line

Now let’s keep only the animals and counts – fields 2 and 3 if we consider the comma as the field separator.

We use the -f flag to set the columns to keep and the -d flag to tell the command to use a comma as a field separator.

# cd to directory
cd ~/Desktop/shell-lesson-data/exercise-data/animal-counts

# cut to keep the second field (-f), using comma as a field separator (-d)
cut -f2,3 -d ',' animals.csv

deer,5
rabbit,22
raccoon,7
rabbit,19
deer,2
fox,4
rabbit,16
bear,1

`uniq` – report or filter out repeated lines

Filters out repeated ADJACENT lines, but also allows for counting them, or ignoring a specific number of them.

Command	Options/Flags	Arguments
`uniq`	`flags`	`input/file` `output/file`

Tip

The uniq command is case sensitive by default and removes all duplicated adjacent lines. Thus, sorting prior is recommended.

Some useful flags are below:

use -c add the count of the number of times the line occurs
use -i to ignore case

`uniq` – report or filter out repeated lines

Let’s use uniq to count the number of unique lines.

If a file looked like this:

$ cat animals.txt
bear
deer
deer
fox
rabbit
rabbit
rabbit
raccoon

Then uniq, with counts (using the -c flag) would output

uniq -c animals.txt
  1 bear
  2 deer
  1 fox
  3 rabbit
  1 raccoon

`sort` – order lines of a file

Sorts a file or input in a highly customizable way.

Command	Options/Flags	Arguments
`sort`	`flags`	`file/input`

Tip

The sort command is case sensitive by default sorts lexiconically

Some useful flags are below:

use -t to specify the field separator
use -i to ignore case
use -c to check if a file is sorted
use -k to specify a field to sort on
use -u to keep unique lines
use -n to perform a numberic sort

`sort` – order lines of a file

Let’s sort the animals file by the second field (the -k flag), using the commma as the field separator (-t flag).

# cd to directory
cd ~/Desktop/shell-lesson-data/exercise-data/animal-counts

# cut to keep the second field (-f), using comma as a field separator (-t)
sort -t , -k 2 animals.csv

2012-11-07,bear,1
2012-11-06,deer,2
2012-11-05,deer,5
2012-11-06,fox,4
2012-11-07,rabbit,16
2012-11-06,rabbit,19
2012-11-05,rabbit,22
2012-11-05,raccoon,7

`grep` – search a regular expression and print

We revisit grep here to highlight the ability to not only return matching lines, but also to negate matching lines using the -v flag.

Command	Options/Flags	Arguments
`grep`	`flags`	`pattern` `/path/to/file`

# cd to directory
cd ~/Desktop/shell-lesson-data/exercise-data/animal-counts

# give me the lines that do not have an animal that ends in r
grep -Ev ',\w+r,' animals.csv

2012-11-05,rabbit,22
2012-11-05,raccoon,7
2012-11-06,rabbit,19
2012-11-06,fox,4
2012-11-07,rabbit,16

Pipes

Pipes (|) are used to quickly connect unix commands by “piping” output of one command to the input of another.

command1 | command2 | command3

Piping or chaining together commands in this way allows us to make even greater use of the commands we just learned about. :)

We’ll illustrate this by working with a gtf file.

Pipes

We will download the file with wget or curl.

In the terminal, type which wget, and which curl.

If you see a path returned when you type one of those commands and press enter, then you have that command.

#check for wget
which wget

#check for curl
which curl

/Users/csifuentes/miniconda3/bin/wget
/usr/bin/curl

Pipes

You can download the file as below, depending on the command you want to use.

wget
curl

# cd to ~/Desktop/shell-lesson-data
cd ~/Desktop/shell-lesson-data

# wget file (using capital -O as the flag to create a gzipped file named example.gtf.gz)
wget -O example.gtf.gz ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz

# cd to ~/Desktop/shell-lesson-data
cd ~/Desktop/shell-lesson-data

# wget file (using lowercase -o as the flag to create a gzipped file name example.gtf.gz)
curl -o example.gtf.gz ftp://ftp.ensemblgenomes.org/pub/release-39/fungi/gtf/fungi_basidiomycota1_collection/cryptococcus_neoformans_var_grubii_h99/Cryptococcus_neoformans_var_grubii_h99.CNA3.39.gtf.gz

Pipes

You can unzip the file as below, depending on the command you want to use.

# cd to ~/Desktop/shell-lesson-data
cd ~/Desktop/shell-lesson-data

# gunzip file
gunzip example.gtf.gz

# cd to ~/Desktop/shell-lesson-data
cd ~/Desktop/shell-lesson-data

# view it
head example.gtf

#!genome-build CNA3
#!genome-version CNA3
#!genome-date 2015-11
#!genome-build-accession GCA_000149245.3
#!genebuild-last-updated 2015-11
1   ena gene    100 5645    .   -   .   gene_id "CNAG_04548"; gene_source "ena"; gene_biotype "protein_coding";
1   ena transcript  100 5645    .   -   .   gene_id "CNAG_04548"; transcript_id "AFR92135"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";
1   ena exon    5494    5645    .   -   .   gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; exon_id "AFR92135-1";
1   ena CDS 5494    5645    .   -   0   gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding"; protein_id "AFR92135"; protein_version "1";
1   ena start_codon 5643    5645    .   -   0   gene_id "CNAG_04548"; transcript_id "AFR92135"; exon_number "1"; gene_source "ena"; gene_biotype "protein_coding"; transcript_source "ena"; transcript_biotype "protein_coding";

Pipes

GTF files contain the following information, as columns (fields)

chromosome name
annotation source
feature-type
genomic start
genomic end
score
strand
genomic phase
additonal information (gene_id, etc.)

Pipes

Using the commands we’ve learned thus far, let’s explore the example.gtf file to answer the following:

How many chromosomes does the organism have?
How many unique gene ids does the organism have?
Which chromosome has the most genes?

Pipes

How many chromosomes does the organism have?

# cd to directory
cd ~/Desktop/shell-lesson-data

# print the file to the screen to pipe it into grep
# remove the lines with #! because they'll get in the way
# cut to keep the first column (chromosomes)
# sort the chromosomes numerically, removing duplicates
cat example.gtf | grep -v '^#' | cut -f1 | sort -nu

This organism has 14 + Mt chromosomes.

Pipes

How many genes does the organism have?

# cd to directory
cd ~/Desktop/shell-lesson-data

# print the file to the screen to pipe it into grep
# remove the lines with #! because they'll get in the way
# cut to keep the third column (biotype)
# sort values
# keep unique values, specifying counts of each unique value
cat example.gtf | grep -v '^#' | cut -f3 | sort | uniq -c

49063 CDS
52036 exon
6923 five_prime_utr
8497 gene
7860 start_codon
3167 stop_codon
7034 three_prime_utr
9348 transcript

This organism has 8497 genes.

Pipes

Which chromosome has the most genes?

# cd to directory
cd ~/Desktop/shell-lesson-data

# print the file to the screen to pipe it into grep
# remove the lines with #! because they'll get in the way
# cut to keep the first and third columns (chromosome, biotype)
# sort values
# keep unique values, specifying counts of each unique value to get totals of biotypes by chromosome
# pull out gene biotype totals
cat example.gtf | grep -v '^#' | cut -f1,3 | sort | uniq -c | grep 'gene'

1033 1  gene
 474 10 gene
 663 11 gene
 326 12 gene
 322 13 gene
 417 14 gene
 706 2  gene
 725 3  gene
 503 4  gene
 812 5  gene
 640 6  gene
 641 7  gene
 639 8  gene
 554 9  gene
  42 Mt gene

Chromosome 1 has the most genes, 1033. Mt has the least, 42.

Pipes

As we can see, piping commands together allows us to easily perform analyses as a set of commands. In the next lesson, we’ll learn about how we can use loops and scripts to do this even more efficiently.

Patterns, Filters, and Pipes

Pattern Matching

Standard Expansion Patterns

Standard Expansion Patterns

Standard Expansion Patterns

Standard Expansion Patterns

Standard Expansion Patterns

Standard Expansion Patterns

Standard Expansion Patterns

Standard Expansion Patterns

Regular Expressions (regex)

What makes a pattern?

Representing Atoms

Positions

Number of times

Groups and Reference

Practicing with Regex

Practicing with Regex

Practicing with Regex

Practicing with Regex

Practicing with Regex

Practicing with Regex

Practicing with Regex

Practicing with Regex

Using regex with grep

grep – search a regular expression and print

grep – search a regular expression and print

grep – search a regular expression and print

grep – search a regular expression and print

grep – search a regular expression and print

grep – search a regular expression and print

grep – search a regular expression and print

grep – search a regular expression and print

Filtering

cut – filtering data from each line

cut – filtering data from each line

cut – filtering data from each line

uniq – report or filter out repeated lines

uniq – report or filter out repeated lines

sort – order lines of a file

sort – order lines of a file

grep – search a regular expression and print

Pipes

Pipes

Pipes

Pipes

Pipes

Pipes

Pipes

Pipes

Pipes

Pipes

Regular Expressions (`regex`)

Using regex with `grep`

`grep` – search a regular expression and print

`grep` – search a regular expression and print

`grep` – search a regular expression and print

`grep` – search a regular expression and print

`grep` – search a regular expression and print

`grep` – search a regular expression and print

`grep` – search a regular expression and print

`grep` – search a regular expression and print

`cut` – filtering data from each line

`cut` – filtering data from each line

`cut` – filtering data from each line

`uniq` – report or filter out repeated lines

`uniq` – report or filter out repeated lines

`sort` – order lines of a file

`sort` – order lines of a file

`grep` – search a regular expression and print