3. Identifier mapping with BridgeDb

Recipe Overview
Reading Time
30 minutes
Executable Code
No
Difficulty
Mapping identifiers with BridgeDb
FAIRPlus logo
Recipe Type
Hands-on
Maturity Level & Indicator
DSM-4-C4
hover me Tooltip text

3.1. Main Objectives

The main purpose of this recipe is to:

Provide practical examples on how to map identifiers for Genes, Proteins, Metabolites and Pathways between resources using a purpose built tool, namely BridgeDb. Hands on guidance is provided for 2 interfaces (R package and a Python Webservices) provided by BridgeDb.


3.2. Graphical Overview

This recipe will cover the highlighted topics


3.3. Requirements

This recipe has the following requirements:


3.4. Tools

The table below lists the software that is used to execute the examples in this recipe.

Software

Description

version

Biotools record

BridgeDb webservices

BridgeDb is a framework to map identifiers between various databases. It includes a Java library that provides an API to work with identifier-identifier mapping databases and resources.

0.0.9

https://bio.tools/bridgedb

Python

An interpreted, high-level and general-purpose programming language.

3.8.5

pandas

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

1.1.3

R

R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing.

v4.0.3

https://bio.tools/r

tidyverse

tidyverse is an opinionated collection of R packages designed for data science.

1.3.0

BridgeDbR

An R package for BridgeDb

2.0.0


3.5. Identifier mapping with BridgeDb

Interlinking data from different sources is an essential step for data reusability and interoperability. This step requires dedicated tools. With the present recipe, we show how to use BridgeDb to carry out this process.

BridgeDb is an open source tool dedicated to performing identifier mapping 2. BridgeDb offers three different interfaces:

  • Java API

  • R package

  • REST Web-services

📖 In the context of this recipe, we distinguish between two types of identifiers:

  • Local identifiers which refer to identifiers that are minted within an organization or database and thus internally defined (i.e. local to said organization).

  • Global identifiers which refer to identifiers that are globally unique and uniquely point to an entity, as available from BridgeDb’s data sources file

We will focus here on two distinct cases, depending on the nature of the incoming data. Namely, whether our data is already using global identifiers or only relies on local identifiers.

In this recipe, we will cover how BridgeDb’s R package and webservices can be used to map between resource identifiers.

3.5.1. Mapping a global identifier to other global identifiers

In this case, the input data is a list of elements with an identifier that is part of BridgeDb’s data sources. In our example, we will use a list of Homo Sapien Hugo Gene Nomenclature Convention (HGNC) gene identifiers stored in a TSV file. The objective is to map these to other available gene identifiers.

3.5.1.1. BridgeDb via Webservices using Python

❗ For this tutorial Python v3.8.5, pandas v1.1.3, and BridgeDb Webservices v0.9.0 were used.

One of the biggest benefits of using BridgeDb webservices is that these can be accessed using most programming languages. Python has become one of the leading programming languages in data science and predictive modelling. Despite the lack of a dedicated BridgeDb Python library, we show here how to use the BridgeDb Webservices to perform exemplary mappings.

First, the required libraries need to be set up through several ‘pip install’ command (in the terminal)

## Install pandas:
pip install wheel
pip install pandas
## Install API calls library
pip install requests

Second, load the required libraries in your Python code:

##Load library to work with pandas data formats:
import pandas
##Check the version of this library:
pandas.__version__
##Load library to post get requests :
import requests

Then, we start by defining strings containing the URL of the webservices and the specific method from the Webservices we want to use. In our case, a batch cross reference. When doing the query, we need to specify the organism and the source dataset. We can also optionally specify a target data source if we only want to map to a specific data source, e.g. Ensembl.

url = "https://webservice.bridgedb.org/"
batch_request = url+"{org}/xrefsBatch/{source}{}"

If the aim is to map only to a specific target data source, then one can check whether the mapping is supported by invoking the following webservice call:

mapping_available = "{org}/isMappingSupported/{source}/{target}"
query = url+mapping_available.format(org='Homo sapiens', source='H', target='En')
requests.get(query).text

This will return True if the mapping between the given source and target is supported for the given organism or False otherwise.

We then load our data into a pandas dataframe and call the requests library using our query.

query = batch_request.format('?dataSource=En', org='Homo sapiens', source='H')
response = requests.post(query, data=data.to_csv(index=False, header=False))

The webservice response is now stored in the response variable. We can then simply pass this variable to the to_df method provided in the bridgedb_script.py module (see Code). This method will extract the response in text form and turn it into a pandas Dataframe with conveniently named columns and structured data.

The output table will contain the:

  • Original identifier

  • Data source that the identifier is part of

  • Mapped identifier

  • Data source for the mapped identifier

In our case the output of to_df is:

original

source

mapping

target

A1BG

HGNC

ENSG00000121410

En

A1CF

HGNC

ENSG00000148584

En

A2MP1

HGNC

ENSG00000256069

En

If we were to not specify the target data source (by passing an empty string as the parameter), we would get all the potential mappings for the given identifiers. In our case (top 10 rows):

original

source

mapping

target

A1BG

HGNC

uc002qsd.5

Uc

A1BG

HGNC

8039748

X

A1BG

HGNC

GO:0072562

T

A1BG

HGNC

uc061drj.1

Uc

A1BG

HGNC

ILMN_2055271

Il

A1BG

HGNC

Hs.529161

U

A1BG

HGNC

GO:0070062

T

A1BG

HGNC

GO:0002576

T

A1BG

HGNC

uc061drt.1

Uc

A1BG

HGNC

51020_at

X

As one can see, using the BridgeDb webservice via Python is extremely simple and can be easily integrated in an annotation pipeline.

3.5.1.2. BridgeDb via the dedicated R package

Note

For this tutorial R v4.0.3, tidyverse v1.3.0, and BridgeDbR v2.0.0 were used.

After having loaded the required R libraries, we read the data and create a new column to include the source of the identifier.

data_df <- read_tsv(filepath, col_names=c('identifier'))
data_df$source = 'H'

We then load the data for the organism we are mapping from.

 location <- getDatabase('Homo sapiens')
 mapper <- loadDatabase(location)

And use the library’s dedicated function to map the identifiers:

mapping = maps(mapper, data_df, target='En')

This will return:

identifier

source

target

mapping

A1BG

H

En

ENSG00000121410

A1CF

H

En

ENSG00000148584

A2MP1

H

En

ENSG00000256069

As seen earlier when using Python language, we can obtain all possible mappings simply by not specifying the target. This will result in (top 10)

identifier

source

target

mapping

A1BG

H

Uc

uc002qsd.5

A1BG

H

X

8039748

A1BG

H

T

GO:0072562

A1BG

H

Uc

uc061drj.1

A1BG

H

Il

ILMN_2055271

A1BG

H

U

Hs.529161

A1BG

H

T

GO:0070062

A1BG

H

T

GO:0002576

A1BG

H

Uc

uc061drt.1

A1BG

H

X

51020_at

Warning

An error message indicating “Error in download.file” may be thrown. This may be caused by the timeout variable being set to too small a value. To remediate the issue, try increasing the timeout variable value by calling options(timeout=300).

3.5.2. Mapping local identifier to a different global identifier

Note

In this section, we assume that we already have an equivalence file containing the mapping of a local identifier to one of the global identifiers. In our case, this will be contained in a TSV where we map our local gene identifier to HGNC. One may consult the list of other potential data formats in the Interlinking data from different sources recipe. The mapping should be one-to-one for this recipe.

The TSV mapping file looks as follows:

local

source

aa11

A1BG

bb34

A1CF

eg93

A2MP1

You may notice the source identifiers correspond with those used in the previous example.

This is how the mapping will work

3.5.2.1. Webservices in Python

As before, we will define variables including the web-service's URL and the method that we will use, in this instance: xRefsBatch. We then pass the source column to the post request as follows

source_data = case2.source.to_csv(index=False, header=False)
query = batch_request.format('', org=org, source=source)
response2 = requests.post(query, data = source_data)

You may notice here that we did not pass a target source, this could be done as specified before. Then, we use to_df again and as expected obtain the same dataframe as before. To see the equivalences with our local identifiers, we can simply join the dataframes, as follows:

local_mapping = mappings.join(case2.set_index('source'), on='original')

which will return the following table (first 10 rows)

original

source

mapping

target

local

A1BG

HGNC

uc002qsd.5

Uc

aa11

A1BG

HGNC

8039748

X

aa11

A1BG

HGNC

GO:0072562

T

aa11

A1BG

HGNC

uc061drj.1

Uc

aa11

A1BG

HGNC

ILMN_2055271

Il

aa11

A1BG

HGNC

Hs.529161

U

aa11

A1BG

HGNC

GO:0070062

T

aa11

A1BG

HGNC

GO:0002576

T

aa11

A1BG

HGNC

uc061drt.1

Uc

aa11

A1BG

HGNC

51020_at

X

aa11

In case we did specify the target argument to be Ensembl (En), we would instead get

original

source

mapping

target

local

A1BG

HGNC

ENSG00000121410

En

aa11

A1CF

HGNC

ENSG00000148584

En

bb34

A2MP1

HGNC

ENSG00000256069

En

eg93

Here, we see a one-to-one relation between the identifiers in HGNC and En while the relation between HGNC and UCSC Genome Browser (Uc) or Gene Ontology (T) is one-to-many. Depending on the identifiers and resources, the relation could also be many-to-many as shown below.

Note

This many-to-many relationship stems from different scientific lenses in the data sources. You can read more about these in 1. The core idea is that depending on the domain/application of the data we can consider different entities as unique. While certain proteins could be considered “equal” from a biological perspective they may require differentiation when using a chemical len. This is what then leads to many-to-many relationships.

3.5.2.2. R Package

Here, we will follow the same steps as in the previous case. The only difference is that we need to specify the columns/fields to use when loading the data:

data_df <- read_tsv(filepath, col_names=c('local', 'identifier'))

Then, after computing the mapping, we can join it with the local identifier

right_join(data_df, mapping)

Assuming we did not specify the target data source we obtain the following table (first 10 rows):

local

identifier

source

target

mapping

aa11

A1BG

H

Uc

uc002qsd.5

aa11

A1BG

H

X

8039748

aa11

A1BG

H

T

GO:0072562

aa11

A1BG

H

Uc

uc061drj.1

aa11

A1BG

H

Il

ILMN_2055271

aa11

A1BG

H

U

Hs.529161

aa11

A1BG

H

T

GO:0070062

aa11

A1BG

H

T

GO:0002576

aa11

A1BG

H

Uc

uc061drt.1

aa11

A1BG

H

X

51020_at

In case we did specify the target data source we would get:

local

identifier

source

target

mapping

aa11

A1BG

HGNC

En

ENSG00000121410

bb34

A1CF

HGNC

En

ENSG00000148584

eg93

A2MP1

HGNC

En

ENSG00000256069


3.6. Provenance

BridgeDb provides provenance information through:

  • A call to /properties/ method of the Webservice

  • getProperties() in BridgeDbR (passing the mapper as a parameter)

This returns the following information for each of the data sources for a given organism:

  • Data source name

  • Build date

  • Series

  • Data type

  • Data source version

  • Schema version

Improvements on provenance are under way (see here).


3.7. Code

You can find ready-made methods to map using R and Python for the given use cases here. These assume the data has the structure described in this recipe.


3.8. Conclusion

We showed how to use BridgeDb’s webservices and R package to map identifiers from different data sources using a minimal dataset. BridgeDb provides handy functionality to make ‘omics’ data more interoperable and reusable. As with all annotation services, it is important to bear in mind the version of the service being used as well as the data on which the service invokation has been performed. These are aspects of information provenance which we plan to provide in the future.

3.9. References

3.10. Authors