Introducing the Shell

Learning Objectives

Describe key reasons for learning shell.
Navigate your file system using the Bash command line.
Access and read help files for bash programs and use help files to identify useful command options.
Demonstrate the use of tab completion, and explain its advantages.
Translate an absolute path into a relative path and vice versa.
Learn to use the nano text editor.
Understand how to move, create, and delete files.
Write a basic shell script.
Use the bash command to execute a shell script.
Use chmod to make a script an executable program.

Questions to be answered in this lesson

What is a command shell and why would I use one?
How can I move around on my computers file system?
How can I see what files and directories I have?
How can I specify the location of a file or directory on my computer?

What is a shell?

A shell is a program that provides a command-line interface (CLI), allowing you to control your computer by typing commands instead of using graphical user interfaces (GUIs) with a mouse or touchscreen. Think of it as conversing directly with your computer in its native language, granting you more power and flexibility—no point-and-click needed!

There are various types of shells, each with its own flair. On macOS and Linux systems, you’ll often encounter Bash (Bourne Again Shell) and Zsh (Z Shell). For Windows users, there’s the Command Prompt and PowerShell, which is like Command Prompt’s more sophisticated sibling. Mastering the shell can help you work more efficiently, automate tasks, and unlock your computer’s full potential!

For this workshop, we will be using Bash in a Ubuntu virtual machine specifically set up for this class. The commands we explore will also work in zsh (available in the Mac Terminal) with minimal adjustments. While PowerShell (the Windows shell) has a different syntax, many of the fundamental commands and concepts are similar and can be adapted with little or no modification.

Why should I learn the Shell?

There are many reasons that may benefit you from learning about about the shell:

Automate repetitive tasks. If you often need to repeat tasks with a large number of files, with the shell, you can automate those.
Make your work less error-prone. The shell makes your work less error-prone. When humans do the same thing a hundred different times (or even ten times), they’re likely to make a mistake. Your computer can do the same thing a thousand times with no mistakes.
Make your work reproductible. When using the command-line your computer keeps a record of every command that you’ve run, which you can use to re-do your work when you need to. It also gives you a way to communicate unambiguously what you’ve done, so that others can inspect or apply your process to new data.
Save on computing capacity. Sometimes, when we work with data some tasks may require large amounts of computing power and can’t realistically be run on your own machine. These tasks are best performed using remote computers or cloud computing, which can only be accessed through a shell.
Get advantage of the command line tools. Many bioinformatics tools can only be used through the shell. Many offer features and parameter options which are not available in the GUI. BLAST is an example. Many of the advanced functions are only accessible to users who know how to use a shell.

The Shell Window

As you open your Terminal Window, you may see something like the following:

ComputerUserName-####ABC MINGW64 ~
$

The dollar sign is a prompt, which shows us that the shell is waiting for input; your shell may use a different character as a prompt and may add information before the prompt. When typing commands, either from these lessons or from other sources, do not type the prompt, only the commands that follow it.

This symbol may be different if you are using a Linux or Mac computer.

Navigating your file system

The file system manages files and directories in an operating system, organizing data into files (which store information) and directories or folders (which contain files or other directories(sub-directories)).

Common commands allow you to create, inspect, rename, and delete files and directories. To determine your location, use the pwd (“print working directory”) command.

Your current working directory is the default location where commands are executed unless you specify a different directory.

Command:

$ pwd

Let’s look at how our file system is organized. We can see what files and subdirectories are in this directory by running ls, which stands for “listing”:

Command:

$ ls

ls prints the names of the files and directories in the current directory in alphabetical order, arranged neatly into columns. We’ll be working within the shell_data subdirectory, and creating new subdirectories, throughout this workshop.

The command to change locations in our file system is cd, followed by a directory name to change our working directory. cd stands for “change directory”.

Let’s say we want to navigate to the shell_data directory we saw above. We can use the following command to get there:

Command:

$ cd shell_data

Let’s look at what is in this directory:

Command:

$ ls

Output:

sra_metadata  untrimmed_fastq

We can make the ls output more comprehensible by using the flag -F, which tells ls to add a trailing / to the names of directories:

Command:

$ ls -F

Output:

sra_metadata/  untrimmed_fastq/

Anything with a “/” after it is a directory. Things with a “*” after them are programs. If there are no decorations, it’s a file. ls has lots of other options.

To find out what they are, we can type:

$ man ls

man ls displays detailed documentation for commands in Bash. It is a powerful resource to explore bash commands, understand their usage and flags. Some manual files are very long. You can scroll through the file using your keyboard’s down arrow or use the Space key to go forward one page and the b key to go backwards one page. When you are done reading, hit q to quit.

Try it Yourself!

Use the -l option for the ls command to display more information for each item in the directory.

What is one piece of additional information this long format gives you that you don’t see with the bare ls command?

Solution

Command:

$ ls -l

Output: Note: This is an example

total 8
drwxr-x--- user user 4096 Jul 30  2015 sra_metadata
drwxr-xr-x user user 4096 Nov 15  2017 untrimmed_fastq

The additional information given includes the name of the owner of the file, when the file was last modified, and whether the current user has permission to read and write to the file.

Tip

No one can possibly learn all of these arguments, that’s what the man page is for. You can (and should) refer to the man page or other help files as needed.

Let’s go into the untrimmed_fastq directory and see what is in there.

Command:

$ cd untrimmed_fastq
$ ls -F

Output:

SRR097977.fastq  SRR098026.fastq

This directory contains two files with .fastq extensions. FASTQ is a format for storing information about sequencing reads and their quality. We will be learning more about FASTQ files in a later lesson.

Shortcut: Tab Completion

Typing out file or directory names can waste a lot of time and it’s easy to make typing mistakes.Instead we can use tab complete as a shortcut. When you start typing out the name of a directory or file, then hit the Tab key, the shell will try to fill in the rest of the directory or file name.

Return to your home directory and navigate to your desktop.

Command:

$ cd 
$ cd Desktop
$ cd shell_data

Use the tab shorcut to return to shell_data directory.

$ cd she<tab>

The shell will fill in the rest of the directory name for shell_data.

Now, change directories to untrimmed_fastq

$ cd un<tab><tab>

Using tab complete can be very helpful. However, it will only autocomplete a file or directory name if you’ve typed enough characters to provide a unique identifier for the file or directory you are trying to access.

For example, if we now try to list the files which names start with SR by using tab complete:

$ ls SR<tab>

The shell auto-completes your command to SRR09, because all file names in the directory begin with this prefix. When you hit Tab again, the shell will list the possible choices.

Command:

$ ls SRR09<tab><tab>

Output:

SRR097977.fastq  SRR098026.fastq

Tab completion can also fill in the names of programs, which can be useful if you remember the beginning of a program name.

Command:

$ pw<tab><tab>

Output:

pwck      pwconv    pwd       pwdx      pwunconv

Displays the name of every program that starts with pw.

Shortcut: Moving around the file system

We’ve learned how to use pwd to find our current location within our file system. We’ve also learned how to use cd to change locations and ls to list the contents of a directory. Now we’re going to learn some additional commands for moving around within our file system.

Use the commands we’ve learned so far to navigate to the shell_data/untrimmed_fastq directory, if you’re not already there.

$ cd
$ cd shell_data
$ cd untrimmed_fastq

What if we want to move back up and out of this directory and to our top level directory? Can we type cd shell_data? Try it and see what happens.

Command:

$ cd shell_data

Output: Note: This is an example

$ cd cd: shell_data: No such file or directory

Your computer looked for a directory or file called shell_data within the directory you were already in. It didn’t know you wanted to look at a directory level above the one you were located in.

We have a special command to tell the computer to move us back or up one directory level.

$ cd ..

Now we can use pwd to make sure that we are in the directory we intended to navigate to, and ls to check that the contents of the directory are correct.

Command:

$ pwd

Command:

$ ls

Output:

sra_metadata  untrimmed_fastq

From this output, we can see that .. did indeed take us back one level in our file system.

Command:

$ ls ../../

Try it Yourself!

Finding hidden directories

First navigate to the shell_data directory. There is a hidden directory within this directory. Explore the options for ls to find out how to see hidden directories. List the contents of the directory and identify the name of the text file in that directory.

Hint: hidden files and folders in Unix start with ., for example .my_hidden_directory

First use the man page to look at the options for ls.

$ man ls

Solution

The -a option is short for all and says that it causes ls to “not ignore entries starting with .” This is the option we want.

Command:

$ ls -a

Output:

.  ..  .hidden  sra_metadata  untrimmed_fastq

The name of the hidden directory is .hidden. We can navigate to that directory using cd.

$ cd .hidden

And then list the contents of the directory using ls.

Command:

$ ls

Output:

youfoundit.txt

The name of the text file is youfoundit.txt.

In most commands the flags can be combined together in no particular order to obtain the desired results/output.

$ ls -Fa
$ ls -laF

Examining the contents of other directories

By default, the ls commands lists the contents of the working directory (i.e. the directory you are in). You can always find the directory you are in using the pwd command. However, you can also give ls the names of other directories to view Navigate to your home directory if you are not already there.

$ cd

Then enter the command:

$ ls shell_data

Output:

sra_metadata  untrimmed_fastq

This will list the contents of the shell_data directory without you needing to navigate there. The cd command works in a similar way.

Try entering:

$ cd
$ cd shell_data/untrimmed_fastq

This will take you to the untrimmed_fastq directory without having to go through the intermediate directory.

Exercice

Navigating practice

Navigate to your home directory. From there, list the contents of the untrimmed_fastq directory.

Solution

Command:

$ cd
$ ls shell_data/untrimmed_fastq/

Output:

SRR097977.fastq  SRR098026.fastq

Summary

We now know how to move around our file system using the command line. This gives us an advantage over interacting with the file system through a GUI as it allows us to work on a remote server, carry out the same set of operations on a large number of files quickly, and opens up many opportunities for using bioinformatic software that is only available in command line versions of software.

Mastering Absolute vs. Relative Paths with the `cd` Command

Navigating the file system efficiently is crucial for managing files and directories. The cd (change directory) command is a fundamental tool that allows you to move between directories using absolute and relative paths. Mastering these path types will enhance your command-line proficiency.

Understanding Path Types

Absolute Path

Definition: An absolute path specifies the complete address of a directory from the root of the file system.
Starts With: A forward slash /.
Example: /home/user/shell_data/untrimmed_fastq

Relative Path

Definition: A relative path specifies the location of a directory relative to your current directory.
Does Not Start With: A forward slash /.
Example: untrimmed_fastq (if you are already in /home/user/shell_data)

Directory Hierarchy Overview

Visualizing the directory structure helps in understanding absolute and relative paths.

/
├── home
│   └── user
│       ├── shell_data
│       │   └── sra_metadata
│       │   └── untrimmed_fastq
│       └── Documents
└── etc

Root Directory (/): The top-level directory.
Subdirectories: Branch out from the root or other directories.

Using the `cd` Command

Navigate to the home directory, then enter the pwd command.

Command:

$ cd 
$ pwd

Output: Note: This is an example

$ /usr/home/⟨username⟩

Here pwd displays the full name of your home directory. The very top of the hierarchy is a directory called / which is usually referred to as the root directory.

Now lets navigate directly to the .hidden folder using the full path.

$ cd /usr/home/⟨username⟩/shell_data/.hidden

This jumps forward multiple levels to the .hidden directory. Now go back to the home directory.

$ cd

You can also navigate to the .hidden directory using a relative path (relative to our working directory):

$ cd shell_data/.hidden

Both commands navigate to the .hidden directory, but one uses an absolute path from the home directory (~/shell_data/.hidden), while the other employs a relative path from the current working directory (.hidden). Absolute paths always begin with a / or ~, providing the complete address regardless of your location. In contrast, relative paths are shorter and depend on your current directory, making them quicker to type when navigating nearby folders.

Think of a relative path as receiving local directions, effective when you’re already nearby, whereas an absolute path is like GPS coordinates that work from anywhere. Depending on where you are in the directory hierarchy, you can choose the most convenient path type: use absolute paths when starting from the home directory and relative paths when working within a specific folder. As you become more familiar with your directory structure, navigating between directories using both path types will become easier and more intuitive.

Navigational Shortcuts

The root directory is the highest level directory in your file system and contains files that are important for your computer to perform its daily work. While you will be using the root (/) at the beginning of your absolute paths, it is important that you avoid working with data in these higher-level directories, as your commands can permanently alter files that the operating system needs to function. Dealing with the home directory is very common. The tilde character, ~, is a shortcut for your home directory.

Now Navigate to the shell_data directory from .hidden directory:

$ cd ..

Then enter the command:

$ ls ~

This prints the contents of your home directory, without you needing to type the full path.

Summary

Absolute Paths provide the full directory address from the root (or home directory using ~), ensuring you can navigate to any directory from anywhere.
Relative Paths are based on your current location, allowing quicker navigation within nearby directories.
Mastering both path types with the cd command enhances your ability to efficiently manage and navigate the file system.
The commands cd, and cd ~ are very useful for quickly navigating back to your home directory.

By practicing these commands and understanding when to use each path type, you’ll become proficient in navigating and managing directories in any operating system.

Writing and Reading Files

Now that we know how to move around and look at things, let’s learn how to read, write, and handle files! We’ll start by moving back to our home directory and creating a scratch directory:

$ cd ~
$ mkdir shell_test
$ cd shell_test

Creating and Editing Text Files

When working in a shell, we will frequently need to create or edit text files. Text is one of the simplest computer file formats, defined as a simple sequence of text lines.

What if we want to make a file? There are a few ways of doing this, the easiest of which is simply using a text editor. For this lesson, we are going to us nano, since it’s more intuitive than many other terminal text editors.

To create or edit a file, type nano <filename>, on the terminal, where <filename> is the name of the file. If the file does not already exist, it will be created. Let’s make a new file now, type whatever you want in it, and save it.

$ nano draft.txt

Nano defines a number of shortcut keys (prefixed by the Control or Ctrl key) to perform actions such as saving the file or exiting the editor. Here are the shortcut keys for a few common actions:

Ctrl+O — save the file (into a current name or a new name).
Ctrl+X — exit the editor. If you have not saved your file upon exiting, nano will ask you if you want to save.
Ctrl+K — cut (“kill”) a text line. This command deletes a line and saves it on a clipboard. If repeated multiple times without any interruption (key typing or cursor movement), it will cut a chunk of text lines.
Ctrl+U — paste the cut text line (or lines). This command can be repeated to paste the same text elsewhere.

Examining Files

Next, let’s explore how to view the contents of files.

One way to examine a file is by using the cat command, which prints the entire contents of the file to the terminal.

Enter the following command from within the untrimmed_fastq directory:

$ cat SRR098026.fastq

This will print out all of the contents of the SRR098026.fastq to the screen.

Try it Yourself!

Print out the contents of the ~/shell_data/untrimmed_fastq/SRR097977.fastq file. What is the last line of the file?
From your home directory, and without changing directories, use one short command to print the contents of all of the files in the ~/shell_data/untrimmed_fastq directory.

Solution

The last line of the file is C:CCC::CCCCCCCC<8?6A:C28C<608'&&&,'$.
cat Desktop/unix_lesson/shell_data/untrimmed_fastq/*

While cat is a fantastic tool for displaying the contents of a file, it can become cumbersome when dealing with very large files. In such cases, the less command is more efficient and user-friendly. less opens the file in read-only mode and allows you to navigate through its contents interactively.

Enter the following command:

$ less SRR097977.fastq

Some navigation commands in less:

key	action
`Space`	to go forward
`b`	to go backward
`g`	to go to the beginning
`G`	to go to the end
`q`	to quit

less also gives you a way of searching through files. Use the “/” key to begin a search. Enter the word you would like to search for and press enter. The screen will jump to the next location where that word is found.

Shortcut:

If you hit “/” then “enter”, less will repeat the previous search. less searches from the current location and works its way forward. Scroll up a couple lines on your terminal to verify you are at the beginning of the file. Note, if you are at the end of the file and search for the sequence “CAA”, less will not find it. You either need to go to the beginning of the file (by typing g) and search again using / or you can use ? to search backwards in the same way you used / previously.

For instance, let’s search forward for the sequence TTTTT in our file. You can see that we go right to that sequence, what it looks like, and where it is in the file. If you continue to type / and hit return, you will move forward to the next instance of this sequence motif. If you instead type ? and hit return, you will search backwards and move up the file to previous examples of this motif.

Try it Yourself!

What are the next three nucleotides (characters) after the first instance of the sequence quoted above?

Solution

CAC

Remember, the man program actually uses less internally and therefore uses the same commands, so you can search documentation using “/” as well!

Viewing Parts of Files with head and tail

Another method to inspect files is to view only a portion of their contents, which is especially useful when you want to see the beginning or the end of a file or examine its formatting without loading the entire file. The head and tail commands facilitate this by allowing you to display the first few lines or the last few lines of a file, respectively. These commands are invaluable for quickly assessing large files, checking recent entries in log files, or verifying the structure of a document without the need to scroll through the entire content.

Command:

$ head SRR098026.fastq

Output:

@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
NNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNN
+SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
@SRR098026.2 HWUSI-EAS1599_1:2:1:0:312 length=35
NNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNN
+SRR098026.2 HWUSI-EAS1599_1:2:1:0:312 length=35
!!!!!!!!!!!!!!!!#!!!!!!!!!!!!!!!!!!
@SRR098026.3 HWUSI-EAS1599_1:2:1:0:570 length=35
NNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNN

Command:

$ tail SRR098026.fastq

Output:

+SRR098026.247 HWUSI-EAS1599_1:2:1:2:1311 length=35
#!##!#################!!!!!!!######
@SRR098026.248 HWUSI-EAS1599_1:2:1:2:118 length=35
GNTGNGGTCATCATACGCGCCCNNNNNNNGGCATG
+SRR098026.248 HWUSI-EAS1599_1:2:1:2:118 length=35
B!;?!A=5922:##########!!!!!!!######
@SRR098026.249 HWUSI-EAS1599_1:2:1:2:1057 length=35
CNCTNTATGCGTACGGCAGTGANNNNNNNGGAGAT
+SRR098026.249 HWUSI-EAS1599_1:2:1:2:1057 length=35
A!@B!BBB@ABAB#########!!!!!!!######

The -n option to either of these commands can be used to print the first or last n lines of a file.

Command:

$ head -n 1 SRR098026.fastq

Output:

@SRR098026.1 HWUSI-EAS1599_1:2:1:0:968 length=35

Command:

$ tail -n 1 SRR098026.fastq

Output:

A!@B!BBB@ABAB#########!!!!!!!######

Summary

In this section, you learned how to view the contents of files using several Bash commands. You can display the entire file with cat, navigate through large files interactively with less, view the beginning of a file using head, and examine the end of a file with tail. These tools offer flexible options for accessing and inspecting file data efficiently.

Creating, moving, copying, and removing

Now we can move around in the file structure, look at files, and search files. But what if we want to copy files or move them around or get rid of them? Most of the time, you can do these sorts of file manipulations without the command line, but there will be some cases (like when you’re working with a remote computer) where it will be impossible. You’ll also find that you may be working with hundreds of files and want to do similar manipulations to all of those files. In cases like this, it’s much faster to do these operations at the command line.

Copying Files

While navigating the file structure, viewing files, and searching through them are essential tasks, there are times when you need to create, move, copy, or delete files. Although many of these file manipulations can be performed using graphical interfaces, there are scenarios—such as working on a remote computer or handling large batches of files—where the command line becomes indispensable. When dealing with hundreds of files that require similar operations, executing these tasks via the command line is significantly faster and more efficient.

First, let’s make a copy of one of our FASTQ files using the cp command.

Navigate to the shell_data/untrimmed_fastq directory and enter:

$ cp SRR098026.fastq SRR098026-copy.fastq
$ ls

SRR097977.fastq  SRR098026-copy.fastq  SRR098026.fastq

We now have two copies of the SRR098026.fastq file, one of them named SRR098026-copy.fastq. We’ll move this file to a new directory called backup where we’ll store our backup data files.

Creating Directories

The mkdir command is used to make a directory. Enter mkdir followed by a space, then the directory name you want to create:

$ mkdir backup

Moving / Renaming

We can now move our backup file to this directory. We can move files around using the command mv:

$ mv SRR098026-copy.fastq backup
$ ls backup

SRR098026-copy.fastq

The mv command is also how you rename files. Let’s rename this file to make it clear that this is a backup:

$ cd backup
$ mv SRR098026-copy.fastq SRR098026-backup.fastq
$ ls

SRR098026-backup.fastq

File Permissions

We’ve now made a backup copy of our file, but just because we have two copies, it doesn’t make us safe. We can still accidentally delete or overwrite both copies. To make sure we can’t accidentally mess up this backup file, we’re going to change the permissions on the file so that we’re only allowed to read (i.e. view) the file, not write to it (i.e. make new changes).

View the current permissions on a file using the -l (long) flag for the ls command:

$ ls -l

-rw-r--r-- 1 <username> 43332 <last modified date time> SRR098026-backup.fastq

The first part of the output for the -l flag gives you information about the file’s current permissions. There are ten slots in the permissions list. The first character in this list is related to file type, not permissions, so we’ll ignore it for now. The next three characters relate to the permissions that the file owner has, the next three relate to the permissions for group members, and the final three characters specify what other users outside of your group can do with the file. We’re going to concentrate on the three positions that deal with your permissions (as the file owner).

Here the three positions that relate to the file owner are rw-. The r means that you have permission to read the file, the w indicates that you have permission to write to (i.e. make changes to) the file, and the third position is a -, indicating that you don’t have permission to carry out the ability encoded by that space (this is the space where x or executable ability is stored. This controls your ability to run files that are programs or cd into a directory).

Our goal for now is to change permissions on this file so that you no longer have w or write permissions. We can do this using the chmod (change mode) command and subtracting (-) the write permission -w.

$ chmod -w SRR098026-backup.fastq
$ ls -l

-r--r--r-- 1 <username> 43332 <last modified date time> SRR098026-backup.fastq

Removing

To prove to ourselves that you no longer have the ability to modify this file, try deleting it with the rm command:

$ rm SRR098026-backup.fastq

You’ll be asked if you want to override your file permissions:

rm: remove write-protected regular file ‘SRR098026-backup.fastq'?

You should enter n for no. If you enter n (for no), the file will not be deleted. If you enter y, you will delete the file. This gives us an extra measure of security, as there is one more step between us and deleting our data files.

Important: The rm command permanently removes the file. Be careful with this command. It doesn’t just nicely put the files in the Trash. They’re really gone.

By default, rm will not delete directories. You can tell rm to delete a directory using the -r (recursive) option. Let’s delete the backup directory we just made.

Enter the following command:

$ cd ..
$ rm -r backup

This will delete not only the directory, but all files within the directory. If you have write-protected files in the directory, you will be asked whether you want to override your permission settings.

Try it Yourself!

Starting in the shell_data/untrimmed_fastq/ directory, do the following:

Make sure that you have deleted your backup directory and all files it contains.
Create a backup of each of your FASTQ files using cp. (Note: You’ll need to do this individually for each of the two FASTQ files. We haven’t learned yet how to do this with a wildcard.)
Use a wildcard to move all of your backup files to a new backup directory.
Change the permissions on all of your backup files to be write-protected.

Solution

rm -r backup
cp SRR098026.fastq SRR098026-backup.fastq and cp SRR097977.fastq SRR097977-backup.fastq
mkdir backup and mv *-backup.fastq backup
chmod -w backup/*-backup.fastq
It’s always a good idea to check your work with ls -l backup. You should see something like:

-r--r--r-- 1 <username> 47552 <modified date time> SRR097977-backup.fastq
-r--r--r-- 1 <username> 43332 <username> SRR098026-backup.fastq

Summary

In this section, you learned essential commands for managing your file system efficiently. The cp, mv, and mkdir commands allow you to copy and move existing files as well as create new directories, enabling effective organization and manipulation of your files. Additionally, you can view file permissions using ls -l, which provides detailed information about each file’s access rights, and modify these permissions with chmod to control who can read, write, or execute your files.

Downloading and Extracting Files

When you’re working in the shell, especially on remote computers, its not possible to access a web browser to download files from the internet. Fortunately there are commands we can use to do just this! Also since bioinformatics files tend to be huge we typically have to compress them so they travel faster over the internet. This means we also need tools to extract the compressed files. Lets check out these tools.

Let’s grab and unpack a set of demo files for use later. To do this, we’ll use wget (wget link downloads a file from a link).

$ cd
$ cd shell_data
$ wget http://www.hpc-carpentry.org/hpc-shell/files/bash-lesson.tar.gz
$ ls

You’ll commonly encounter .tar.gz archives while working in UNIX. To extract the files from a .tar.gz file, we run the command tar -xvf filename.tar.gz:

$ tar -xvf bash-lesson.tar.gz

dmel-all-r6.19.gtf
dmel_unique_protein_isoforms_fb_2016_01.tsv
gene_association.fb
SRR307023_1.fastq
SRR307023_2.fastq
SRR307024_1.fastq
SRR307024_2.fastq
SRR307025_1.fastq
SRR307025_2.fastq
SRR307026_1.fastq
SRR307026_2.fastq
SRR307027_1.fastq
SRR307027_2.fastq
SRR307028_1.fastq
SRR307028_2.fastq
SRR307029_1.fastq
SRR307029_2.fastq
SRR307030_1.fastq
SRR307030_2.fastq

Unzipping files

We just unzipped a .tar.gz file for this example. What if we run into other file formats that we need to unzip? Just use the handy reference below:

gunzip extracts the contents of .gz files
unzip extracts the contents of .zip files
tar -xvf extracts the contents of .tar.gz and .tar.bz2 files

Summary

In this section we explored how to download files from the internet using the wget command and extract compressed .tar.gz archives with the tar command. By navigating to the appropriate directory, downloading the necessary demo files, and unpacking them, you can efficiently manage and access large datasets. These command-line tools are essential for handling file transfers and extractions, especially when working in environments where graphical interfaces are unavailable or when dealing with extensive data sets in UNIX-based systems.

Wildcards and Pipes

Now that we know some of the basic UNIX commands, we are going to explore some more advanced features. The first of these features is the wildcard *. In our examples before, we’ve done things to files one at a time and otherwise had to specify things explicitly. The * character lets us speed things up and do things across multiple files.

Ever wanted to move, delete, or just do “something” to all files of a certain type in a directory? * lets you do that, by taking the place of one or more characters in a piece of text. So *.txt would be equivalent to all .txt files in a directory for instance. * by itself means all files. Let’s use our example data to see what I mean.

Wildcards

Navigate to your shell_data directory:


$ cd shell_data

We are interested in looking at the FASTQ files in this directory. We can list all files with the .fastq extension using the command:

$ ls *.fastq

Output:

SRR307023_1.fastq SRR307025_1.fastq SRR307027_1.fastq SRR307029_1.fastq
SRR307023_2.fastq SRR307025_2.fastq SRR307027_2.fastq SRR307029_2.fastq
SRR307024_1.fastq SRR307026_1.fastq SRR307028_1.fastq SRR307030_1.fastq
SRR307024_2.fastq SRR307026_2.fastq SRR307028_2.fastq SRR307030_2.fastq

The * character is a special type of character called a wildcard, which can be used to represent any number of any type of character. Thus, *.fastq matches every file that ends with .fastq.

Try it Yourself!

Now let’s try cleaning up our working directory a bit. Create a folder called “fastq” and move all of our .fastq files there in one mv command.

Solution

mkdir fastq
mv *.fastq fastq/

Redirecting output

Each of the commands we’ve used so far does only a very small amount of work. However, we can chain these small UNIX commands together to perform otherwise complicated actions!

For our first foray into piping, or redirecting output, we are going to use the > operator to write output to a file. When using >, whatever is on the left of the > is written to the filename you specify on the right of the arrow. The actual syntax looks like command > filename.

Let’s try using >. echo simply prints back, or echoes whatever you type after it.

Command:

$ echo "this is a test"
$ echo "this is a test" > test.txt
$ cat test.txt

Output:

this is a test

this is a test

There are 3 input/output streams for every UNIX program you will run: stdin, stdout, and stderr.

Let’s dissect these three streams of input/output in the command we just ran: echo "this is a test" > test.txt

stdin is the input to a program. In the command we just ran, stdin is represented by "this is a test", which is simply every filename in our current directory.
stdout contains the actual, expected output. In this case, > redirected stdout to the file test.txt.
stderr typically contains error messages and other information that doesn’t quite fit into the category of “output”. If we insist on redirecting both stdout and stderr to the same file we would use &> instead of >. (We can redirect just stderr using 2>.)

Now we will add another line of text to our file, but how do we do that without overwriting our original file? For this there is a different redirect >>.

Command:

$ echo "adding more text to the end of the file" >> test.txt
$ cat test.txt

Output:

this is a test
adding more text to the end of the file

Try it Yourself!

Let me introduce you to a new command that tells us how long a file is: wc. wc -l file tells us the length of a file in lines.

Command:

$ wc -l dmel-all-r6.19.gtf

Output:

542048 dmel-all-r6.19.gtf

Interesting, there are over 540000 lines in our dmel-all-r6.19.gtf file.

What if we wanted to run wc -l on every .fastq file and save that to a file fastq_length.txt?

Solution

This is where * comes in really handy! *.fastq would match every file ending in .fastq. and > allows us to output the result to file.

Command:

$ cd
$ cd shell_data
$ wc -l fastq/*.fastq > fastq_length.txt

Notice how there was no output to the console that time. Let’s check that the error message went to the file like we specified.

$ cat fastq_length.txt

Chaining commands together

We now know how to redirect stdout and stderr to files. We can actually take this a step further and redirect output (stdout) from one command to serve as the input (stdin) for the next. To do this, we use the | (pipe) operator.

grep is an extremely useful command. It finds things for us within files. Basic usage (there are a lot of options for more clever things, see the man page) uses the syntax grep whatToFind fileToSearch. Let’s use grep to find all of the entries pertaining to the Act5C gene in Drosophila melanogaster.

Command:

$ grep Act5C dmel-all-r6.19.gtf

The output is nearly unintelligible since there is so much of it. Let’s send the output of that grep command to head so we can just take a peek at the first line. The | operator lets us send output from one command to the next:

Command:

$ grep Act5C dmel-all-r6.19.gtf | head -n 1

Output:

X   FlyBase gene    5900861 5905399 .   +   .   gene_id "FBgn0000042"; gene_symbol "Act5C";

Nice work, we sent the output of grep to head. Let’s try counting the number of entries for Act5C with wc -l. We can do the same trick to send grep’s output to wc -l:

Command:

$ grep Act5C dmel-all-r6.19.gtf | wc -l

Output:

Note that this is just the same as redirecting output to a file, then reading the number of lines from that file. ### Summary

The “Wildcards and Pipes” section expands on basic UNIX commands by introducing the * wildcard for handling multiple files simultaneously, such as *.txt for all text files. It explains output redirection using > to write command output to files and >> to append without overwriting, while detailing the three I/O streams: stdin, stdout, and stderr. The section also covers piping with the | operator, allowing the output of one command to serve as the input for another, enabling more complex operations.

Scripts, variables, and loops

We now know a lot of UNIX commands! Wouldn’t it be great if we could save certain commands so that we could run them later or not have to type them out again? As it turns out, this is straightforward to do. A “shell script” is essentially a text file containing a list of UNIX commands to be executed in a sequential manner. These shell scripts can be run whenever we want, and are a great way to automate our work.

Writing scripts

So how do we write a shell script, exactly? It turns out we can do this with a text editor. Start editing a file called “demo.sh” (to recap, we can do this with nano demo.sh). The “.sh” is the standard file extension for shell scripts that most people use (you may also see “.bash” used).

Our shell script will have two parts:

On the very first line, add #!/bin/bash. The #! (pronounced “hash-bang” or “shebang”) tells our computer what program to run our script with. In this case, we are telling it to run our script with our command-line shell (what we’ve been doing everything in so far). If we wanted our script to be run with something else, like Perl, we could add #!/usr/bin/perl
Now, anywhere below the first line, add echo "Our script worked!". When our script runs, echo will happily print out Our script worked!.

#!/bin/bash

echo "Our script worked!"

Ready to run our program? Let’s try running it:

Command:

$ demo.sh

Error:

bash: demo.sh: command not found...

Strangely enough, Bash can’t find our script. As it turns out, Bash will only look in certain directories for scripts to run. To run anything else, we need to tell Bash exactly where to look. To run a script that we wrote ourselves, we need to specify the full path to the file, followed by the filename. We could do this one of two ways: either with our absolute path ~/demo.sh, or with the relative path ./demo.sh.

Command:

$ ./demo.sh

Error:

bash: ./demo.sh: Permission denied

There’s one last thing we need to do. Before the file can be run, it needs “permission” to run. Let’s look at our file’s permissions with ls -l and then update:

Command:

$ ls -l

So how do we change permissions? As I mentioned earlier, we need permission to execute our script. Changing permissions is done with chmod. To add executable permissions for all users we could use this:

Command:

$ chmod +x demo.sh
$ ls -l

Now that we have executable permissions for that file, we can run it.

Command:

$ ./demo.sh

Output:

Our script worked!

Fantastic, we’ve written our first program! Before we go any further, let’s learn how to take notes inside our program using comments. A comment is indicated by the # character, followed by whatever we want. Comments do not get run. Let’s try out some comments in the console, then add one to our script!

# This won't show anything.

Now lets try adding this to our script with nano. Edit your script to look something like this:

Command:

#!/bin/bash

# This is a comment... they are nice for making notes!
echo "Our script worked!"

When we run our script, the output should be unchanged from before!

Shell variables

Another important concept that we’ll need to cover are shell variables. Variables are a great way of saving information under a name you can access later. In programming languages like Python and R, variables can store pretty much anything you can think of. In the shell, they usually just store text. The best way to understand how they work is to see them in action.

To set a variable, simply type in a name containing only letters, numbers, and underscores, followed by an = and whatever you want to put in the variable. Shell variable names are often uppercase by convention (but do not have to be).

Command:

$ VAR="This is our variable"

To use a variable, prefix its name with a $ sign. Note that if we want to simply check what a variable is, we should use echo (or else the shell will try to run the contents of a variable).

Command:

$ echo $VAR

Output:

This is our variable

Let’s try setting a variable in our script and then recalling its value as part of a command. We’re going to make it so our script runs wc -l on whichever file we specify with FILE.

Command:

nano demo.sh

Our script:

#!/bin/bash

# set our variable to the name of our GTF file
FILE=dmel-all-r6.19.gtf

# call wc -l on our file
wc -l $FILE

Command:

$ ./demo.sh

Output:

542048 dmel-all-r6.19.gtf

Loops

To end our lesson on scripts, we are going to learn how to write a for-loop to execute a lot of commands at once. This will let us do the same string of commands on every file in a directory (or other stuff of that nature).

for-loops generally have the following syntax:

Command:

#!/bin/bash

for VAR in first second third
do
    echo $VAR
done

When a for-loop gets run, the loop will run once for everything following the word in. In each iteration, the variable $VAR is set to a particular value for that iteration. In this case it will be set to first during the first iteration, second on the second, and so on. During each iteration, the code between do and done is performed.

Let’s run the script we just wrote (I saved mine as loop.sh).

Command:

$ chmod +x loop.sh
$ ./loop.sh

Output:

first
second
third

What if we wanted to loop over a shell variable, such as every file in the current directory? Shell variables work perfectly in for-loops. In this example, we’ll save the result of ls and loop over each file:

Command:

#!/bin/bash

FILES=$(ls)
for VAR in $FILES
do
        echo $VAR
done

Command:

$ ./loop.sh

There’s a shortcut to run on all files of a particular type, say all .gz files:

Command:

#!/bin/bash

for VAR in *.gz
do
    echo $VAR
done

Try it Yourself!

Writing our own scripts and loops is a very rewarding activity! Lets try out cd to our fastq directory from earlier and write a loop to print off the name and top 4 lines of every fastq file in that directory.

Is there a way to only run the loop on fastq files ending in _1.fastq?

Solution

Create the following script in a file called head_all.sh

Command:

#!/bin/bash

for FILE in *.fastq
do
   echo $FILE
   head -n 4 $FILE
done

The “for” line could be modified to be for FILE in *_1.fastq to achieve the second aim.

Documenting your activity on the project

When carrying out wet-lab analyses, most scientists work from a written protocol and keep a hard copy of written notes in their lab notebook, including any things they did differently from the written protocol. This detailed record-keeping process is just as important when doing computational analyses. Luckily, it’s even easier to record the steps you’ve carried out computational than it is when working at the bench.

The history command is a convenient way to document all the commands you have used while analyzing and manipulating your project files. Let’s document the work we have done on our project so far.

View the commands that you have used so far during this session using history:

$ history

The history likely contains many more commands than you have used for the current project. Let’s view the last several commands that focus on just what we need for this project.

View the last n lines of your history (where n = approximately the last few lines you think relevant). For our example, we will use the last 7:

$ history | tail -n 7

Try it Yourself!

Using your knowledge of the shell, use the append redirect >> to create a file called my_project_log_XXXX_XX_XX.md (Use the four-digit year, two-digit month, and two digit day, e.g. my_project_log_2024-10-17.md)

Caution

$ history | tail -n 7 >> my_project_log_2024-10-17.md

Note we used the last 7 lines as an example, the number of lines may vary.

You may have noticed that your history contains the history command itself. To remove this redundancy from our log, let’s use the nano text editor to fix the file:

Command:

$ nano my_project_log_2024-10-17.md

(Remember to replace the 2024-10-17 with the actual date.)

From the nano screen, you can use your cursor to navigate, type, and delete any redundant lines.

Add a date line and comment to the line where you have created the directory. Recall that any text on a line after a # is ignored by bash when evaluating the text as code. For example:

Command:

# 2024-10-17 
# Created sample directories for the Unix workshop

Next, remove any lines of the history that are not relevant by navigating to those lines and using your delete key. Save your file and close nano.

Your file should look something like this:

# 2024-10-17 
# Created sample directories for the Unix workshop

mkdir my_project
mkdir my_project/docs
mkdir my_project/data
mkdir my_project/results

If you keep this file up to date, you can use it to re-do your work on your project if something happens to your results files. To demonstrate how this works, first delete your my_project directory and all of its subdirectories. Look at your directory contents to verify the directory is gone.

$ rm -r my_project
$ ls

shell_data  my_project_2024-10-17.sh

Then run your workshop log file as a bash script. You should see the my_project directory and all of its subdirectories reappear.

$ bash my_project_log_2024-10-17.md
$ ls

shell_data  my_project
my_project_log_2024-10-17.md

Congratulations! You’ve finished your introduction to using the shell. You now know how to navigate your file system, create, copy, move, and remove files and directories, and automate repetitive tasks using scripts and wildcards. With this solid foundation, you’re ready to move on to apply all of these new skills to carrying out more sophisticated bioinformatics analysis work. Don’t worry if everything doesn’t feel perfectly comfortable yet. Practice makes perfect! Get in the habit of applying your new skills to your real work whenever you can.

References

A Quick Guide to Organizing Computational Biology Projects

Learning Objectives

What is a shell?

Why should I learn the Shell?

The Shell Window

Navigating your file system

Try it Yourself!

Shortcut: Tab Completion

Shortcut: Moving around the file system

Try it Yourself!

Examining the contents of other directories

Exercice

Summary

Mastering Absolute vs. Relative Paths with the cd Command

Understanding Path Types

Absolute Path

Relative Path

Directory Hierarchy Overview

Using the cd Command

Navigational Shortcuts

Summary

Writing and Reading Files

Creating and Editing Text Files

Examining Files

Try it Yourself!

Try it Yourself!

Viewing Parts of Files with head and tail

Summary

Creating, moving, copying, and removing

Copying Files

Creating Directories

Moving / Renaming

File Permissions

Removing

Summary

Downloading and Extracting Files

Summary

Wildcards and Pipes

Wildcards

Try it Yourself!

Redirecting output

Try it Yourself!

Chaining commands together

Scripts, variables, and loops

Writing scripts

Shell variables

Loops

Try it Yourself!

Documenting your activity on the project

References

Mastering Absolute vs. Relative Paths with the `cd` Command

Using the `cd` Command