Accessing and manipulating biological sequences

Learning Objectives

  • Be able to name some main NCBI databases and their purposes
  • Become familiar with the rentrez core functions and how they are related.
  • Understand how to programmatically access NCBI data.
  • Use R to analyze and manipulate a biological sequence

NCBI data and the rentrez package

The National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine, maintains several databases that provide access to a wide range of biological data and tools, including databases for DNA and protein sequences (GenBank, RefSeq), scientific literature (PubMed, PMC), and resources for analyzing genetic and genomic information (BLAST, Gene, SRA toolkit). It serves as a central hub for researchers around the world to find, share, and analyze biological data. NCBI is a major partner in the International Nucleotide Sequence Database Collaboration (INSDC) along with EMBL-EBI in Europe and DDBJ in Japan.

Entrez is the name of the integrated search system across all NCBI databases. Programmatic access to the Entrez system is provided by the Entrez Programming Utilities (E-utilities). The E-utilities can be used on the command line, but in this class, we will be using the rentrez package, an R interface to the program.

First, let’s see what databases we can access with rentrez

entrez_dbs()
 [1] "pubmed"          "protein"         "nuccore"         "ipg"            
 [5] "nucleotide"      "structure"       "genome"          "annotinfo"      
 [9] "assembly"        "bioproject"      "biosample"       "blastdbinfo"    
[13] "books"           "cdd"             "clinvar"         "gap"            
[17] "gapplus"         "grasp"           "dbvar"           "gene"           
[21] "gds"             "geoprofiles"     "medgen"          "mesh"           
[25] "nlmcatalog"      "omim"            "orgtrack"        "pmc"            
[29] "proteinclusters" "pcassay"         "protfam"         "pccompound"     
[33] "pcsubstance"     "seqannot"        "snp"             "sra"            
[37] "taxonomy"        "biocollections"  "gtr"            

We can also see what fields are available to search in a database

entrez_db_searchable(db="gene")
Searchable fields for database 'gene'
  ALL    All terms from all searchable fields 
  UID    Unique number assigned to a gene record 
  FILT   Limits the records 
  TITL   gene or protein name 
  WORD   Free text associated with record 
  ORGN   scientific and common names of organism 
  MDAT   The last date on which the record was updated 
  CHR    Chromosome number or numbers; also 'mitochondrial', 'unknown' properties 
  MV     Chromosomal map location as displayed in MapViewer 
  GENE   Symbol or symbols of the gene 
  ECNO   EC number for enzyme or CAS registry number 
  MIM    MIM number from OMIM 
  DIS    Name(s) of diseases associated with this gene. When available, OMIM name will be used 
  ACCN   Nucleotide or protein accession(s) associated with this gene 
  UGEN   UniGene cluster number for this gene 
  PROP   Properties of Gene record 
  CDAT   The date on which this record first appeared 
  NCAC   nucleotide accessions of sequences 
  NUID   nucleotide uids of sequences 
  PACC   protein accessions 
  PUID   protein uids 
  PMID   PubMed ids of accessions linked to the record 
  TID    taxonomy id 
  GO     Gene Ontology 
  DOM    Domain Name 
  DDAT   The date on which the record was discontinued 
  CPOS   Chromosome base position 
  GFN    Gene full name 
  PFN    Protein full name 
  GL     Gene length 
  XC     Exon count 
  GRP    Relationships for this gene 
  PREF   Preferred symbol of the gene 
  AACC   Assembly accession 
  ASM    Assembly name 
  EXPR   Gene expression 

The functions of rentrez are designed to work together. The main ones we will be using in this session are:

Function Purpose
entrez_search() Search a specific NCBI database and retrieve UIDs (unique identifiers).
entrez_summary() Get summary metadata for UIDs returned by a search.
extract_from_esummary() Extract specific fields (e.g., title, publication date) from esummary results.
entrez_fetch() Download full records (e.g., FASTA, GenBank, XML).
entrez_link() Find related data across NCBI databases.

You can check out the rentrez documentation for more information on these and other functions.

Finding and fetching a sequence

Let’s start by searching for the record in the Gene database for human gene TP53, an important gene in the study of cancer. The entrez_search() function takes the following arguments:

  • db, name of the database to search for (required)
  • term, the search term, you can also use MeSH terms to perform your search (required)
  • retmax, the default of retrievable ids is 20, this argument can be used to change that number
  • retmode, to select the format of your output (XML or JSON), by default will be XML
  • use_history, to store a history of searches in NCBI’s server

The term argument can take a simple or more complicated search string linked together with boolean terms AND, OR, and NOT. The name of the field like [gene name] or [orgn] comes after the search term for that field. Note that the [orgn] field can accept scientific or common names or organisms.

search_tp53_hs <- entrez_search(db="gene", 
                                term ="TP53[gene] AND human[orgn]")
Lists

Let’s take a closer look at the results we get from our search. The search_tp53_hs object is a structure called a list. Recall that a vector is a one-dimensional data structure in R where every element must be of the same type. Lists are one-dimensional data structures where elements can be anything, even data frames or other lists. Lists are a common way to store hierarchical data, like XML or JSON data formats. You’ll find that it is very common for data to be stored as a list in bioinformatics workflows.

Homogeneous Heterogeneous
1D Vector
All elements of the same type
List
Can contain any type of R object
2D Matrix
All elements of the same type
Data Frame
Each column can differ in type

To access items in a list, you can use $ or [[]], similar to accessing vectors in a dataframe. There are five elements in our search object:

Element Description
count Total number of records in the database that match the search query.
retmax The number of IDs returned in the ids vector (default is 20, can be changed).
ids A character vector of NCBI record IDs that matched the search.
QueryTranslation Shows how NCBI interpreted and expanded your search query using synonyms etc.
file A temporary file storing your search results (used internally by rentrez).

We can look closer at the ids element:

search_tp53_hs$ids
[1] "7157"

or

search_tp53_hs[[1]]
[1] "7157"

The ids will likely be the most important part of the search object. You can use those ids with entrez_summary(), entrez_link() and entrez_fetch().

Let’s use the id to see if there are records for human TP53 in the nucleotide database. While Gene contains metadata about genes, nucleotide contains the actual sequences. There are three main arguments you need to supply to the link function:

  • dbfrom character name of database from which the id(s) originate
  • ids vector with unique ID(s) for records in database db
  • db character name of database in which to search for links. Use “all” to search for links in all databases.

So we are going to search from gene with the id we retrieved from our search, and look in “nuccore”, the name for nucleotide in the Entrez system.

link_tp53_hs <- entrez_link(dbfrom="gene", 
                            id=search_tp53_hs$ids, 
                            db="nuccore")

Let’s take a closer look at the links element of our link object

link_tp53_hs$links
elink result with information from 4 databases:
[1] gene_nuccore            gene_nuccore_mgc        gene_nuccore_refseqgene
[4] gene_nuccore_refseqrna 

This is saying there are related records for this gene in the general nucleotide collection, as well as in three subsets: Mammalian Gene Collection, RefSeqGene and RefSeqRNA. We will be especially interested in the RefSeqGene sequence. RefSeq, short for Reference Sequence, is NCBI’s curated collection of consensus sequences. These are likely to be higher quality than other sequences, so it is a good idea to use them for your analysis when possible.

We can access the RefSeqGene id:

link_tp53_hs$links$gene_nuccore_refseqgene
[1] "383209646"

Now we have the id for the sequence we want to look at, we use entrez_fetch(), which requires three arguments

  • db character name of the database to use
  • id vector of unique ID(s) for records in database db
  • rettype - character name of the format in which to return record, such as fasta or xml. We will ask for our record in fasta format, a common plain text way of representing sequence data.
fetch_tp53_hs <- entrez_fetch(db="nuccore",
                              id=link_tp53_hs$links$gene_nuccore_refseqgene, 
                              rettype = "fasta")

Now let’s take a look at our sequence in the console. You can see it’s pretty simple, just a header line starting with > and the sequence of bases.

When we print it to the console as usual, we see a lot of \n characters which is the newline symbol.

fetch_tp53_hs

We can use the cat function to print it out a little more nicely.

cat(fetch_tp53_hs)

We can also save this to a file

writeLines(fetch_tp53_hs, "data/tp53_human.fasta")

Try it Yourself!

CHALLENGE
  1. Look up which fields are searchable in the Protein database.
  2. Search Gene for the BRCA1 gene in humans. What id is returned?
  3. Use entrez_link() to find all databases that have a link to this Gene record. How many databases are linked? How many related records are there in Pubmed?
  1. entrez_db_searchable("protein")
  2. brca1_search <- entrez_search(db = "gene", term = "BRCA1[gene] AND human[orgn]")
    brca1_search$ids
  3. brca1_link <- entrez_link(dbfrom = "gene", id = test_search$ids, db = "all")
    brca1_link$links
    brca1_links$links$gene_pubmed

Manipulating sequences with Biostrings

Now we will use the Biostrings package from Bioconductor to read in our fasta file and save it as the specialized class XStringSet which essentially serves as a container sets of biological sequence data. The X can be replaced by the appropriate molecule: DNA, RNA, or AA for proteins. So here we are saving our object as a DNAStringSet

fasta <- readDNAStringSet("data/tp53_human.fasta")

XStringSets are made up of XString objects. That is, each sequence in a XStringSet is a XString object. Even though our fasta file only contains one sequence, we still save it as an DNAStringSet. We can specifically pull out our sequence into a DNAStringSet object.

dna_tp53_human <- fasta[[1]]

As you work more with bioinformatics packages, you’ll find that many of them create these kinds of specialized classes, and you’ll be able to put together workflows of packages designed to work with each other’s data structures.

Now we have our DNAString object, we can use BioStrings to get summary statistics and manipulate it to other objects.

First we can use alphabetFrequency to get the frequence of bases in our sequence.

alphabetFrequency(dna_tp53_human)
   A    C    G    T    M    R    W    S    Y    K    V    H    D    B    N    - 
8647 7986 8058 8081    0    0    0    0    0    0    0    0    0    0    0    0 
   +    . 
   0    0 

In addition to the four bases ACGT, it also contains several other codes that indicate ambiguous bases. This is a high quality consensus RefSeq sequence though, so we do not have any ambiguous bases.

We can specify we are only interested in ACGT counts by specifying the argument baseOnly = TRUE

alphabetFrequency(dna_tp53_human, baseOnly = TRUE)
    A     C     G     T other 
 8647  7986  8058  8081     0 

To look for a specific base or base, we can use letterFrequency instead of alphabetFrequency:

gc_content <- letterFrequency(dna_tp53_human, 
                              letters = c("G", "C"), 
                              as.prob=TRUE)
sum(gc_content)
[1] 0.4895643

We can also use Biostrings to generate the complement of the DNA strand, the reverse, and the reverse complement.

tp53_comp <- complement(dna_tp53_human)
tp53_rev <- reverse(dna_tp53_human)
tp53_rev_comp <- reverseComplement(dna_tp53_human)

Let’s compare:

dna_tp53_human
tp53_comp
tp53_rev
tp53_rev_comp
32772-letter DNAString object
seq: CTCCTTGGTTCAAGTAATTCTCCTGCCTCAGACTCC...AAAAAAAAGATACTACAAAGTCAAGAGACAAACAAT
32772-letter DNAString object
seq: GAGGAACCAAGTTCATTAAGAGGACGGAGTCTGAGG...TTTTTTTTCTATGATGTTTCAGTTCTCTGTTTGTTA
32772-letter DNAString object
seq: TAACAAACAGAGAACTGAAACATCATAGAAAAAAAA...CCTCAGACTCCGTCCTCTTAATGAACTTGGTTCCTC
32772-letter DNAString object
seq: ATTGTTTGTCTCTTGACTTTGTAGTATCTTTTTTTT...GGAGTCTGAGGCAGGAGAATTACTTGAACCAAGGAG