Working with Files and Directories II

Objectives
  • View, search within, copy, move, and rename files. Create new directories
  • Make a file read only.
Questions
  • How can I view and search file contents?
  • How can I create, copy and delete files and directories?
  • How can I control who has permission to modify a file?

Details on the FASTQ format

Although it looks complicated (and it is), it’s easy to understand the fastq format with a little decoding. Some rules about the format include…

Line Description
1 Always begins with ‘@’ and then information about the read
2 The actual DNA sequence
3 Always begins with a ‘+’ and sometimes the same info in line 1
4 Has a string of characters which represent the quality scores; must have same number of characters as line 2

We can view the first complete read in one of the files in our dataset by using head to look at the first four lines.

$ head -n 4 SRR098026.fastq
@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
NNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNN
+SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!

All but one of the nucleotides in this read are unknown (N). This is a pretty bad read!

Line 4 shows the quality for each nucleotide in the read. Quality is interpreted as the probability of an incorrect base call (e.g. 1 in 10) or, equivalently, the base call accuracy (e.g. 90%). To make it possible to line up each individual nucleotide with its quality score, the numerical score is converted into a code where each individual character represents the numerical quality score for an individual nucleotide. For example, in the line above, the quality score line is:

!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!

The # character and each of the ! characters represent the encoded quality for an individual nucleotide. The numerical value assigned to each of these characters depends on the sequencing platform that generated the reads. The sequencing machine used to generate our data uses the standard Sanger quality PHRED score encoding, Illumina version 1.8 onwards. Each character is assigned a quality score between 0 and 42 as shown in the chart below.

Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJK
                  |         |         |         |         |
Quality score:    0........10........20........30........40..                          

Each quality score represents the probability that the corresponding nucleotide call is incorrect. This quality score is logarithmically based, so a quality score of 10 reflects a base call accuracy of 90%, but a quality score of 20 reflects a base call accuracy of 99%. These probability values are the results from the base calling algorithm and dependent on how much signal was captured for the base incorporation.

Looking back at our read:

@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
NNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNN
+SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!

we can now see that the quality of each of the Ns is 0 and the quality of the only nucleotide call (C) is also very poor (# = a quality score of 2). This is indeed a very bad read.

Creating, moving, copying, and removing

Now we can move around in the file structure, look at files, and search files. But what if we want to copy files or move them around or get rid of them? Most of the time, you can do these sorts of file manipulations without the command line, but there will be some cases (like when you’re working with a remote computer like we are for this lesson) where it will be impossible. You’ll also find that you may be working with hundreds of files and want to do similar manipulations to all of those files. In cases like this, it’s much faster to do these operations at the command line.

Copying Files

When working with computational data, it’s important to keep a safe copy of that data that can’t be accidentally overwritten or deleted. For this lesson, our raw data is our FASTQ files. We don’t want to accidentally change the original files, so we’ll make a copy of them and change the file permissions so that we can read from, but not write to, the files.

First, let’s make a copy of one of our FASTQ files using the cp command.

Navigate to the shell_data/untrimmed_fastq directory and enter:

$ cp SRR098026.fastq SRR098026-copy.fastq
$ ls
SRR097977.fastq  SRR098026-copy.fastq  SRR098026.fastq

We now have two copies of the SRR098026.fastq file, one of them named SRR098026-copy.fastq. We’ll move this file to a new directory called backup where we’ll store our backup data files.

Creating Directories

The mkdir command is used to make a directory. Enter mkdir followed by a space, then the directory name you want to create:

$ mkdir backup

Moving / Renaming

We can now move our backup file to this directory. We can move files around using the command mv:

$ mv SRR098026-copy.fastq backup
$ ls backup
SRR098026-copy.fastq

The mv command is also how you rename files. Let’s rename this file to make it clear that this is a backup:

$ cd backup
$ mv SRR098026-copy.fastq SRR098026-backup.fastq
$ ls
SRR098026-backup.fastq

File Permissions

We’ve now made a backup copy of our file, but just because we have two copies, it doesn’t make us safe. We can still accidentally delete or overwrite both copies. To make sure we can’t accidentally mess up this backup file, we’re going to change the permissions on the file so that we’re only allowed to read (i.e. view) the file, not write to it (i.e. make new changes).

View the current permissions on a file using the -l (long) flag for the ls command:

$ ls -l
-rw-r--r-- 1 <username> 43332 <last modified date time> SRR098026-backup.fastq

The first part of the output for the -l flag gives you information about the file’s current permissions. There are ten slots in the permissions list. The first character in this list is related to file type, not permissions, so we’ll ignore it for now. The next three characters relate to the permissions that the file owner has, the next three relate to the permissions for group members, and the final three characters specify what other users outside of your group can do with the file. We’re going to concentrate on the three positions that deal with your permissions (as the file owner).

Permissions breakdown

Here the three positions that relate to the file owner are rw-. The r means that you have permission to read the file, the w indicates that you have permission to write to (i.e. make changes to) the file, and the third position is a -, indicating that you don’t have permission to carry out the ability encoded by that space (this is the space where x or executable ability is stored. This controls your ability to run files that are programs or cd into a directory).

Our goal for now is to change permissions on this file so that you no longer have w or write permissions. We can do this using the chmod (change mode) command and subtracting (-) the write permission -w.

$ chmod -w SRR098026-backup.fastq
$ ls -l 
-r--r--r-- 1 <username> 43332 <last modified date time> SRR098026-backup.fastq

Removing

To prove to ourselves that you no longer have the ability to modify this file, try deleting it with the rm command:

$ rm SRR098026-backup.fastq

You’ll be asked if you want to override your file permissions:

rm: remove write-protected regular file ‘SRR098026-backup.fastq'? 

You should enter n for no. If you enter n (for no), the file will not be deleted. If you enter y, you will delete the file. This gives us an extra measure of security, as there is one more step between us and deleting our data files.

Important: The rm command permanently removes the file. Be careful with this command. It doesn’t just nicely put the files in the Trash. They’re really gone.

By default, rm will not delete directories. You can tell rm to delete a directory using the -r (recursive) option. Let’s delete the backup directory we just made.

Enter the following command:

$ cd ..
$ rm -r backup

This will delete not only the directory, but all files within the directory. If you have write-protected files in the directory, you will be asked whether you want to override your permission settings.

Exercise

Starting in the shell_data/untrimmed_fastq/ directory, do the following:

  1. Make sure that you have deleted your backup directory and all files it contains.
  2. Create a backup of each of your FASTQ files using cp. (Note: You’ll need to do this individually for each of the two FASTQ files. We haven’t learned yet how to do this with a wildcard.)
  3. Use a wildcard to move all of your backup files to a new backup directory.
  4. Change the permissions on all of your backup files to be write-protected.
  1. rm -r backup
  2. cp SRR098026.fastq SRR098026-backup.fastq and cp SRR097977.fastq SRR097977-backup.fastq
  3. mkdir backup and mv *-backup.fastq backup
  4. chmod -w backup/*-backup.fastq
    It’s always a good idea to check your work with ls -l backup. You should see something like:
-r--r--r-- 1 <username> 47552 <modified date time> SRR097977-backup.fastq
-r--r--r-- 1 <username> 43332 <username> SRR098026-backup.fastq
Key points
  • The commands cp, mv, and mkdir are useful for manipulating existing files and creating new directories.
  • You can view file permissions using ls -l and change permissions using chmod.