R Community of Practice Week 5

Accessing NCBI data with the Rentrez package

Introduction

This week you will learn about Rentrez, an R interface that allows its users to interact with NCBI API. With Rentrez, you do not need to use any additional program or terminal to access NCBI Data. As an R practitioner, you can request data from multiple databases in the same RStudio Session.

For the last few weeks, you have been working on data that you have downloaded and manage to access it locally using RStudio. Now we are shifting the gears to work with data that is not located in our computers but dynamically requested from an API.

The Data

For this lesson, we are going to use Rentrez to get semi-structured data directly from the PubMed database. Semi-structured data is usually represented through XML, JSON and other formats. Due to its nature semi-structured data can sometimes be hard to tabulate.

Our data will be located in the National Center for BioInformation (NCBI) Database.

The NCBI Database uses its own metadata schema. You can find more information at the Entrez documentation and tutorials to understand how to access information.

We will complete the following tasks:

  1. Install and Load Rentrez.
  2. Perform simple and boolean searches.
  3. Get esummaries (partial information of an article) for a list of ids.
  4. Extract information from the esumaries.

Install and Load Rentrez

Now, remember that you need to install any new package that you want to use in RStudio. Also, once you have the package you need to load it.

#install.packages('rentrez')
library(rentrez)

Now that have installed the package and loaded it into Rstudio. Let’s take a brief look at the documentation available about this resource.

Let’s visit the Rentrez Documentation

This pdf provides details about each function that the R Package Rentrez has. For each function, you may find a description, usage example, arguments, and return value type. Remember that you can always use the helper( ) in R to search function descriptions.

Rentrez Functions

Helper Functions

Helper functions in Rentrez allow you to get acquainted with NCBI databases and their searchable fields. It also allows you to get updates on when was the last time that the database was updated.

entrez_dbs()

This function provides a list of NCBI databases that you can use to perform searches. Let’s see how this function works.

entrez_dbs()
 [1] "pubmed"          "protein"         "nuccore"         "ipg"            
 [5] "nucleotide"      "structure"       "genome"          "annotinfo"      
 [9] "assembly"        "bioproject"      "biosample"       "blastdbinfo"    
[13] "books"           "cdd"             "clinvar"         "gap"            
[17] "gapplus"         "grasp"           "dbvar"           "gene"           
[21] "gds"             "geoprofiles"     "homologene"      "medgen"         
[25] "mesh"            "nlmcatalog"      "omim"            "orgtrack"       
[29] "pmc"             "popset"          "proteinclusters" "pcassay"        
[33] "protfam"         "pccompound"      "pcsubstance"     "seqannot"       
[37] "snp"             "sra"             "taxonomy"        "biocollections" 
[41] "gtr"            

entrez_db_summary()

How do we know that the database is up to date? We can use the entrez_db_summary() function to see the latest updates of the selected NCBI database.

entrez_db_summary("pubmed")
 DbName: pubmed
 MenuName: PubMed
 Description: PubMed bibliographic record
 DbBuild: Build-2023.07.11.20.04
 Count: 35933451
 LastUpdate: 2023/07/11 20:04 

entrez_db_searchable()

How can I build PubMed queries? What search fields does the database have? We can use the entrez_db_searchable() function to see searchable fields and its description.

entrez_db_searchable("pubmed")
Searchable fields for database 'pubmed'
  ALL    All terms from all searchable fields 
  UID    Unique number assigned to publication 
  FILT   Limits the records 
  TITL   Words in title of publication 
  MESH   Medical Subject Headings assigned to publication 
  MAJR   MeSH terms of major importance to publication 
  JOUR   Journal abbreviation of publication 
  AFFL   Author's institutional affiliation and address 
  ECNO   EC number for enzyme or CAS registry number 
  SUBS   CAS chemical name or MEDLINE Substance Name 
  PDAT   Date of publication 
  EDAT   Date publication first accessible through Entrez 
  VOL    Volume number of publication 
  PAGE   Page number(s) of publication 
  PTYP   Type of publication (e.g., review) 
  LANG   Language of publication 
  ISS    Issue number of publication 
  SUBH   Additional specificity for MeSH term 
  SI     Cross-reference from publication to other databases 
  MHDA   Date publication was indexed with MeSH terms 
  TIAB   Free text associated with Abstract/Title 
  OTRM   Other terms associated with publication 
  COLN   Corporate Author of publication 
  CNTY   Country of publication 
  PAPX   MeSH pharmacological action pre-explosions 
  GRNT   NIH Grant Numbers 
  MDAT   Date of last modification 
  CDAT   Date of completion 
  PID    Publisher ID 
  FAUT   First Author of publication 
  FULL   Full Author Name(s) of publication 
  FINV   Full name of investigator 
  TT     Words in transliterated title of publication 
  LAUT   Last Author of publication 
  PPDT   Date of print publication 
  EPDT   Date of Electronic publication 
  LID    ELocation ID 
  CRDT   Date publication first accessible through Entrez 
  BOOK   ID of the book that contains the document 
  ED     Section's Editor 
  ISBN   ISBN 
  PUBN   Publisher's name 
  AUCL   Author Cluster ID 
  EID    Extended PMID 
  DSO    Additional text from the summary 
  AUID   Author Identifier 
  PS     Personal Name as Subject 
  COIS   Conflict of Interest Statements 
  WORD   Free text associated with publication 
  P1DAT      Date publication first accessible through Solr 

Let’s strategize our search! Now that we know the search fields you can use to build a query, let’s perform our first search.

Performing searches with Rentrez

The functions listed under this category allow you to use a search term or query to retrieve a list of article/object ids. This list of ids will later allow you to retrieve partial or full summary records of that article/object.

entrez_summary()

Now let’s retrieve partial information about the records we collected in one of our searches. This function takes a vector of unique IDs or just one id.

But, first, let’s learn the syntax of the entrez_summary() function.

entrez_summary(db="<database name>", id=< >)

The entrez_summary() function takes 7 arguments but you don’t need to use them all. The required ones are db and id.

  • List of arguments and its usage:
    • db, name of the database to search for
    • id, unique ID(s) for records in database
    • web_history, stored article/object ids in NCBI server

EXAMPLES:

#Rentrez summary function to get publication information
#In this command we are giving a list of ids by using pcos_pm$ids. The $ operator extracts a subset of a data object in R.  
summary_pcos_pm <- entrez_summary(db="pubmed", id=pcos_pm$ids)

What does the entrez_summary() function RETURNS?

This function returns an NCBI API object called esummary, which means a list of items for multiple or single records. The elements return behave like a list so you can use $ to call any item.

For the entrez_summary() function, Rentrez has created a function that allows you to extract information from each metadata field avoiding the challenges that come with navigating an XML document.

extract_from_esummary()

This function helps you navigate through an XMLInternalDocument and extract elements from a list of esummary records.

But, first and let’s learn the syntax of the extract_from_esummary() function.

extract_from_esummary(esummaries, elements, simplify = TRUE)

The extract_from_summary() function takes 3 arguments but you don’t need to use them all. The required ones are esummary and elements.

  • List of arguments and its usage:
    • esumaries, either an esummary or an esummary_list (as returned by entrez_summary)
    • elements, unique ID(s) for records in database
    • simplify, if possible return a vector

EXAMPLE:

# If you have been performing searches in PubMed, you can use the following coding lines as a guide.

uids <- extract_from_esummary(summary_pcos_pm,"uid")
authors <- extract_from_esummary(summary_pcos_pm,"authors")["name",]

What does the extract_from_esummary() function RETURNS?

This function returns a list or vector containing information on the requested item.

entrez_fetch()

Another way to get records from the NBCI API is by using the entrez_fetch() function. In some cases, entrez_fetch(), will retrieve a complete bibliographic record of the article or data object. Another difference between entrez_summary and entrez_fetch is in the way both functions arrange the data in RStudio. entrez_summary() in most of the cases will return a data frame with elements that behave as lists, while entrez_fetch() will return an XML Document.

EXAMPLE:

fetch <- entrez_fetch(db= "pubmed", id = pcos_pm$ids, rettype = "xml")

Contrasting NCBI Databases

entrez_global_query()

The entrez_global_query() function allows you to search a term on all NCBI Entrez Databases.

The syntax for this function is

entrez_global_query(term= "search term", config = NULL, ...)

As you can see the function takes config= NULL as an argument, but we are not going to use that in this tutorial.

The argument config means the type of connection that you are requesting from the API, this could be GET or POST. In the NCBI API context if you are performing big data requests they recommend you to use a combination of web_history and config = POST arguments to not saturate their servers.

Let’s test this function and see what it returns.

EXAMPLE:

global_query <- ("PCOS")
all_databases <- entrez_global_query(global_query)
# Lets view the results
all_databases
         pubmed             pmc            mesh           books    pubmedhealth 
          15288           18164               1             474              NA 
           omim      ncbisearch         nuccore          nucgss          nucest 
             22               0          104889               0               0 
        protein          genome       structure        taxonomy             snp 
          12312              72               4               0               0 
          dbvar            gene             sra      biosystems         unigene 
              0             617            3186              NA               0 
            cdd           clone          popset     geoprofiles             gds 
              3               0               3          373883             961 
     homologene      pccompound     pcsubstance         pcassay      nlmcatalog 
              0               0               5              63              33 
          probe             gap proteinclusters      bioproject       biosample 
              0              53               0             199            4180 
 biocollections 
              0 

What does the entrez_global_query() function RETURNS?

This function returns a named vector with counts for each database. This means a list of totals of articles found related to your search term for each database.

Another function that allows you to contrast searches within the NBCI databases is the entrez_links() function.

Exercise: Let’s practice!

Now lets perform our own searches!

We are going to use the Rentrez functions to extract data from an NCBI database.

Our objective is to:

  1. Build your query: Identify your search terms, adequate database, search fields and perform the search using rentrez.

  2. Get article/object summaries.

  3. Select the values that you would like to save (for example author, title, source).

  4. Create a graph that represents the results of your search.

STEP #1

Build your query: Identify your search terms, adequate database, and search fields and perform the search using Rentrez.

# Im interested in seeing Dr. Kristen Stafford Publications, Associate Professor of Epidemiology and Deputy Director - Center for International Health, Education, and Biosecurity

search <- entrez_search(db="pubmed", term = "Kristen A. Stafford[AUTH]",retmax=100)

STEP #2

Get article summaries

summary <- entrez_summary(db="pubmed", id=search$ids)

STEP #3

Select the values that you would like to save (for example author, title, source)

# I want to know how many publications she has and in what journal she publishes the most. 
library(tidyverse)
pubdate <- extract_from_esummary(summary,"pubdate") %>%
  substr(start= 1, stop=4)
source <- extract_from_esummary(summary,"source")

STEP #4 Create a graph that represents the results of your search.

ggplot(mapping = aes(x=source)) + 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 90)) # To rotate journals names

ggplot(mapping = aes(x=pubdate)) + 
  geom_bar() +
  theme(axis.text.x = element_text(angle = 90)) 

Learning +1

Rentrez web_history

To perform big data requests, Rentrez offers the use_history boolean argument. When you enable history in a search, the object returned includes a web_history object that can be used to paginate across results. Without the web_history object results are limited to the first N results set using retmax.

When you enable use_history in a search, the search object includes a small sample of ids. However, using the web_history object in combination of the retmax and retstart you will be able to obtain far more records than you would obtain from a search without web_history.

Currently, pagination is limited to 10,000 records without an API key.

Example

Scenario

I’m interested in gathering a list of genes related to PCOS publications within the years 2021 to 2024.

Obtained Linked Gene IDs from the “gene” NCBI Database

all_pubmed_gene_link = c() # Empty list to accumulate results
page_size <- 100 # retmax and page_size needs to be the same number
max_limit <- 1000 # Can't be higher that 10000 due to API limits
max_results <- min(search$count,max_limit) # Set max result number for efficiency

for(page_start in seq(1,max_results,page_size)){
  links <- entrez_link(
    dbfrom="pubmed", db="gene",
    web_history = search$web_history, 
    by_id = FALSE, # Return a unified list of genes, instead of genes per PubMed publication
    retmax = page_size, 
    retstart = page_start # page_start controls pagination
  )
# Add results to list
  all_pubmed_gene_link <- c(
    all_pubmed_gene_link, 
    links$links$pubmed_gene
  ) 
}

Why do we used a for-loop?

In this instance a for-loop was needed to iteratively query the API. If API queries are performed in bulk, there is a higher risk of hitting the API’s rate limit. Rentrez offers internal functionality that, in theory, should protect the code from getting blocked by the API.

Display first results

head(all_pubmed_gene_link)
[1] "14910" "7157"  "1956"  "7124"  "3569"  "4524"