Introduction to the Command Line for Genomics

1 - Check your location

pwd

2 - Change to the directory which contains the files.

cd ~/Desktop/data

3 - List the files to see the content of the directory

ls -l

4 - What is the structure of the `apple.genome` file?

Looking at the apple.genome file we see that there is genomic information in multi-fasta format.

more apple.genome

5 - How many chromosomes are there in the example genome?

The command grep allows us to search a file line by line looking for a specific pattern.

In the apple.genome file, each new chromosome is designated by a “>” symbol followed by an identifier then the sequence. We can use grep to search the file line by line for the pattern “>” to determine the number of chromosomes

grep ">" apple.genome

6 - How do we count the output of `grep`?

What if this file had 100 chromosomes, 1000 chromosomes, it is helpful to get a count of the lines that grep found. there are 2 options for this:

grep -c ">" apple.genome

grep ">" apple.genome | wc -l

7 - How many genes are there?

In the file apple.genes each line represents a different transcript variant, the columns for apple.genes are:

“geneID”, “transcriptID”, “chrom”, “strand”, “transcriptStart”, “transcriptEnd”, “transcriptStructure”

Look at column #1 (field 1; “geneID”) since counting the unique ‘geneID’ will give us the number of distinct genes. To analyze column #1 cut it, this trims the file to a one column file, then we will sort using -u flag to only sort distinct (unique) names, finally we will pipe that data to wc with -l flag to count the number of lines. This will give us the number of unique genes in apple.genes.

cut -f1 apple.genes | sort -u | wc -l

8 - How many transcripts in the apple genome?

Look at the number of distinct names in column#2 (field2;“transcriptID”). Following the previous approach:

cut -f2 apple.genes | sort -u | wc -l

9- Which genes have single variants?

Look at column#1 because the number of splice variants for a gene is given by the number of instances of a geneID in column#1. ex: if a gene has a single variant then there is one corresponding geneID, if it has multiple variants the geneID appears multiple times in column#1. To determine the number of single variants extract only the genes that have exactly one line. This can be done using uniq a command which reports the lines that are repeated in a file, the -c flag gives a count of the number of times the line occurred in the input.

cut -f1 apple.genes | sort | uniq -c | grep -c "1 "

10- Which genes have multiple variants?

To answer this slightly change up the approach from #9 above by counting the number of genes with more than one line. grep -v allows us to find all lines not matching the pattern given

cut -f1 apple.genes | sort | uniq -c | grep -v "1 " | wc -l

11 - How many genes are on the +/- strand?

Now that we know about the genes and their variants, we want to find out which strand of the DNA has more genes on it. Column#4 has the strand information but merely looking at column#4 alone will not give us the answer since some genes have multiple transcript variants all of which are located on the same strand so we have to remove redundancy. After cutting and sorting we will then cut again just the strand column and count the number of entities on the + or -.

cut -f1,4 apple.genes | sort -u | cut -f2 | sort | uniq -c

12 - How many genes are on each chromosome?

Similar to the prior question we will cut columns to isolate (subset) geneID’s and chrom

cut -f1,3 apple.genes | sort -u | cut -f2 | sort | uniq -c

13 - How many transcripts are on each chromosome?

To answer this we will approach the question in a similar way to the prior questions but since we are looking to find transcripts per chromosome we will need to examine column#2 and column#3.

cut -f2,3 apple.genes | sort -u | cut -f2 | sort | uniq -c

14 - Which genes are common between experimental conditions, and which are specific to either one or another experimental condition.

To execute on this task we will learn about a new command not covered yet. comm The comm utility reads file1 and file2, which should be sorted lexically,and produces three text columns as output: lines only in file1; lines only in file2; and lines in both files. First we will find which genes are in common between condition A and condition B. GeneID’s are found in column#1 of the apple.condition* files. We need to isolate the unique geneID’s of each condition to compare between them. The best way to approach this is sending the output of the sort to a new file. After doing this for each condition we can then try out the new command comm. To limit what output is displayed from comm use option (-1, -2, -3)

cut -f1 apple.conditionA | sort -u > apple.condA.sorted.genes
cut -f1 apple.conditionB | sort -u > apple.condB.sorted.genes
cut -f1 apple.conditionC | sort -u > apple.condC.sorted.genes
comm -1 -2 apple.condA.sorted.genes apple.condB.sorted.genes | wc -l

Using the options with comm and the special purpose files we created, find out how many genes correspond to condition A or condition B.

Condition A:

comm -3 -2 apple.condA.sorted.genes apple.condB.sorted.genes | wc -l

Condition B:

comm -3 -1 apple.condA.sorted.genes apple.condB.sorted.genes | wc -l

How many genes are in common in all 3 conditions? we have to think of an approach that doesnt use comm as looking at all 3 conditions doesnt allow for a pairwise comparison. To do this we can create a list of genes for each of the 3 conditions. To combine the gene lists that came from the sorted files we would use the command cat which combines files together end to end “concatenating” them. The rest is determining the counts for each gene, and then determining the number of entries with that appear “3” times.

cat apple.cond{A,B,C}.sorted.genes | sort | uniq -c | grep -c " 3 "