Search

Which ontology orr terminology to use?


Recipe metadata

identifier: RX.x

version: v0.1

Difficulty level

Reading Time

15 minutes

Recipe Type

Guidance

Executable Code

No

Intended Audience

Principal Investigators

Data Managers

Terminology Managers

Data Scientists

Ontologists


Main Objectives

The main purpose of this recipe to provide guidances on how to select the most suitable semantic artefacts given a specific research context in general, and when it comes to IMI projects, their main themes, i.e. risk assessment, clinical trial, drug discovery or fundamental research.

Graphical Overview of the FAIRification Recipe Objectives

graph TD I1(fa:fa-university what is the context?):::box -->|framework| M1(fa:fa-cube clinical trial context):::box I1(fa:fa-university what is the context?) -->|framework| M2(fa:fa-cube observational patient outcome):::box I1(fa:fa-university what is the context?) -->|framework| M3(fa:fa-cube basic research):::box M1 --> |consider| R1(fa:fa-cubes CDISC Vocabulary):::box M2 --> |consider| R2(fa:fa-cubes OHDSI Athena terminologies):::box M3 --> |consider| R3(fa:fa-cubes OBO Foundry resources):::box I2{fa:fa-university is public
archive
deposition
required? }:::box -->|No |R3:::box I2{fa:fa-university is public
archive
deposition
required? } -->|Yes|M4(consult FAIRsharing):::box M4 --> |EBI resources|M5(EFO):::box linkStyle 0,1,2,3,4,5,6,7,8 stroke:#2a9fc9,stroke-width:1px,color:#2a9fc9,font-family:avenir; classDef box font-family:avenir,font-size:14px,fill:#2a9fc9,stroke:#222,color:#fff,stroke-width:1px

Capability & Maturity Table

Capability Initial Maturity Level Final Maturity Level
Interoperability minimal repeatable

Context is everything

The domain of operation will somehow dictate the semantic framework that makes most sense selecting. This is simply a consequence of the fact that the advances in data standardization in specific fields is such that it is a sound decision to adopt a complete stack of standards, both syntactic and semantic.

We will be giving two examples of such situations now:

Clinical Trial Data

Operating in the field of Clinical Trials means that datasets are generated during interventional studies, meaning that researchers influence and control the predictor variables, which are usually different intensity levels of therapeutic agents in order to gain insights in terms of benefits in patient outcomes. In this context, regulatory requirements make it so that data must be recorded in standard forms to allow for review and appraisal by US FDA reviewers. This means that the CDISC standards are the de-facto standard in the area, which mandates the use of semantics resources such as:

Semantic Resource Domain License Format Service
CDISC vocabulary clinical trial data EVA
NCI Thesaurus biomedicine EVA,Bioportal,OLS
SNOMED-CT pathology EVA,Bioportal(§)
UMLS pathology EVA,Bioportal(§)
LOINC laboratory tests
RxNORM drugs Bioportal
GUDID instruments FDA

All available from the NCBI EVA system.

:bomb: Some resources are only available under restrictive licences, which prevent derivative work, which may limit access and use. Furthermore, some licenses are expensive.

Observational Health Data:

This context refers to data collected during observation studies, which in constrat to interventional studies, draws inferences from a sample to a population where the independent variable is not under the control of the researcher because of ethical concerns or logistical constraints [1]. This is typically the case in the context of epidemiological work or exposure follow-up studies in the context of risk assessment and evaluation of clinical outcomes. Observational health data can also include electronic health records (EHR) or administrative insurance claims and allow research around acquiring real world evidence from large corpora of data. In this specific context, a model and associated set of standards has been particularly successful. With several hundred millions of patient information structured using the Observational Medical Outcomes Partnership (OMOP), the Observational Health Data Sciences and Informatics (ODHSI) open-science community has been particularly successful. Therefore, building a FAIRification process around the standard stack produced by the ODHSI community needs to be considered if operated in such a data context.

Semantic Resource Domain License Format Service
CDISC vocabulary clinical trial data EVA
NCI Thesaurus biomedicine EVA,Bioportal,OLS
SNOMED-CT pathology EVA,Bioportal(§)
UMLS pathology EVA,Bioportal(§)
LOINC laboratory tests
RxNORM drugs Bioportal

For a more detail view and deep-dive into the ODHSI and OMOP semantic support, the reading the chapter dedicated to the controlled terminology in the Book of OHDSI

Basic research context:

This refers to datasets and research output being generated using model organisms and cellular systems in the context of basic, fundamental research. In this arena, the regulatory pressure is much less present but this does not rule out data management good practice and proper archival requirements. As a consequence of fewer constraints, researchers are often confronted with a sea of options. This section aims to provide some guidance when tasked with deciding on which semantic resource to use.

:bell: An important consideration

to bear in mind when writing selecting semantic resources is to assess whether or not data archival in public repositories will be required. For instance, submitting to NCBI Gene Expression Omnibus Data archive places no requirement but if depositing to EMBL-EBI ArrayExpress, then selecting a resource such as the Experimental Factor Ontology could ease deposition.

:bell: the FAIRsharing registry

is an ELIXIR resources which provides invaluable content as the catalogue offers an overview of the various semantics artefacts used by public data repositories.

Selecting Terminologies

Use Cases and Iterative Approach

  1. The use and implementation of common terminologies will enable a normalization/harmonization of variable labels (data label) and allowed values (data term) when querying the eTRIKS database. Implementing use of common terminologies in the curation workflow will ensure consistency of the annotation across all studies.
  2. The clusters of dependent annotations (related data label) also follows the eTRIKS Minimal Information Guidelines (MIGs), a set of core descriptors ensuring that a consistent breadth and depth of information is reported. Continuous feedback will be sought from WP2 and 4 and relevant users. The iterations will feedback into both MIGs and the terminology selections.
  3. As part of this iterative process, the eTRIKS use cases and query cases will be documented in order to evaluate, revise and refine the set of terminologies, and where relevant, the associated selection criteria.

Selection Criteria

A set of widely accepted criteria for selecting terminologies (or other reporting standards) do not exists. However, the initial work by the Clinical and Translational Science Awards’ (CTSA) Omics Data Standards Working Group and BioSharing (http://jamia.bmj.com/content/early/2013/10/03/amiajnl-2013-002066.long) has been used as starting point top define the eTRIKS criteria for excluding and/or including a terminology resource.

  • Exclusion criteria:

    • :x: absent licence or terms of use (indicator of usability)
    • :x: restrictive licences or terms of use with restrictions on redistribution and reuse
    • :x: absence of term definitions
    • :x: absence of sufficient class metadata (indicator of quality)
    • :x: absence of sustainability indicators (absence of funding records)
  • Inclusion criteria:

    • :heavy_check_mark: scope and coverage meets the requirements of the concept identified
    • :heavy_check_mark: unique URI, textual definition and IDs for each term
    • :heavy_check_mark: resource releases are versioned
    • :heavy_check_mark: size of resource (indicator of coverage)
    • :heavy_check_mark: number of classes and subclasses (indicator of depth)
    • :heavy_check_mark: number of terms having definitions and synonyms (indicator of richness)
    • :heavy_check_mark: presence of an help desk and contact point (indicator of community support)
    • :heavy_check_mark: presence of term submission tracker / issue tracker (indicator of resource agility and capability to grow upon request)
    • :heavy_check_mark: potential integrative nature of the resource (as indicator of translational application potential)
    • :heavy_check_mark: licensing information available (as indicator of freedom to use)
    • :heavy_check_mark: use of of top level ontology (as indicator of a resource built for generic use)
    • :heavy_check_mark: pragmatism (as indicator of actual, current real life practice)
    • :heavy_check_mark: possibility of collaborating: the resource accepts complaints/remarks that aim to fix or improve the terminology, while the resource organisation commits to fix or improve the terminology in brief delays (one month after receipt?)

Set of Core Terminologies

The terminologies have been organized by theme and scope. When possible, sections are organized by granularity levels, progressing from macroscopic scale (organism) to microscopic scale (tissue, cells) and molecular scale (macromolecules, proteins, small molecules, xenobiotic chemicals). Domains also cover Processes or Action and their participants or agents but also can be organized from general/generic (disease) to specialized/specific (infectious disease).

Organism, Organism Parts and Developmental Stages

The resources listed here focus on providing structured vocabularies to describe taxonomic and anatomical information. The table below also shows

Scope Name File location Top-Level Ontology Licence Issue Tracker URI Comment
Organism NCBITaxonomy http://purl.obolibrary.org/obo/ncbitaxon.owl none specified This ontology is made available via the UMLS. Users of all UMLS ontologies must abide by the terms of the UMLS license, available at https://uts.nlm.nih.gov/license.html
Vertebrate

Anatomy

UBERON http://purl.obolibrary.org/obo/uberon/ext.owl

http://purl.obolibrary.org/obo/uberon/ext.obo

BFO CC-by 3.0 Unported Licence https://github.com/obophenotype/uberon/issues Integrative Resource

engineered to go across species

Mouse Anatomy MA
Strain Rat Strain Ontology http://data.bioontology.org/ontologies/RS/submissions/46/download?apikey=4ea81d74-8960-4525-810b-fa1baab576ff

In research, many different model organism are used (e.g. Dogs, Monkeys...) and specialized resources may be available. Use the selection criteria introduced earlier to gauge their value in the data management workflow and their impact on data integration tasks.

Diseases and Phenotype

Biology is a complex field and observable manifestations of biological processes in living organisms vary, dependant on genetic background and environmental factors. Working on correlating genetic features with observable (phenotypic) ones, biologists rely heavily on such variables in the quest of disease biomarkers, which could be used to identify possible therapeutic targets. The main challenge is to ensure efficient machine actionable descriptions of these observable features.

Scope Name File location Top-Level Ontology Licence Issue Tracker URI
Pathology/Disease (generic)
SNOMED-CT http://www.ihtsdo.org/licensing/
NCI thesaurus http://evs.nci.nih.gov/ftp1/NCI_Thesaurus http://evs.nci.nih.gov/ftp1/NCI_Thesaurus/ThesaurusTermsofUse.htm
ICD-10 login required [http://apps.who.int/classifications/apps/icd/ClassificationDownloadNR/login.aspx?ReturnUrl=%2fclassifications%2fapps%2ficd%2fClassificationDownload%2fdefault.aspx] http://www.who.int/about/copyright/en/
UMLS http://www.nlm.nih.gov/databases/umls.html
Disease Ontology http://purl.obolibrary.org/obo/doid.owl BFO CC-by 3.0 Unported Licence http://sourceforge.net/p/diseaseontology/feature-requests/
Infection Disease Ontology https://code.google.com/p/infectious-disease-ontology/source/browse/trunk/src/ontology/ido-core/ido-main.owl BFO most probably:

CC-by 3.0 Unported Licence

https://code.google.com/p/infectious-disease-ontology/issues/list
Phenotype Human Phenotype Ontology http://compbio.charite.de/hudson/job/hpo/lastStableBuild/ BFO most probably:

CC-by 3.0 Unported Licence

http://sourceforge.net/p/obo/human-phenotype-requests/
Mouse Phenotype MPO
PATO BFO http://sourceforge.net/p/obo/phenotypic-quality-pato-requests/
MedDRA This ontology is freely accessible on this site for academic and other non-commercial uses. Users anticipating any commercial use of MedDRA must contact the MSSO to obtain a license. https://mssotools.com/webcr/

Login required

Pathology and Disease Specific Resources

Scope Name File location Top-Level Ontology Licence Issue Tracker URI
Influenza FLU BFO
Malaria IDOMAL BFO
Dengue Fever IDODEN BFO
Alzheimer Disease ADO https://www.scai.fraunhofer.de/content/dam/scai/de/downloads/bioinformatik/ontologies/ADO/ADO.zip BFO
Immune disorder
Rare disorder ORDO

Cellular entities

Following on through our review of semantic resources by granularity levels, this section details a number of reference resources which provide coverage for the describing cell types, cell lines and cellular phenotypes.

Scope Name File location Top-Level Ontology Licence Issue Tracker URI
Cell CL http://purl.obolibrary.org/obo/cl.owl

http://purl.obolibrary.org/obo/cl.obo

BFO most probably:

CC-by 3.0 Unported Licence

https://code.google.com/p/cell-ontology/issues/list
Cell Lines Cellosaurus ftp://ftp.expasy.org/databases/cellosaurus/cellosaurus.obo ftp://ftp.expasy.org/databases/cellosaurus https://creativecommons.org/licenses/by/4.0/
Cell Lines CLO http://clo-ontology.googlecode.com/svn/trunk/src/ontology/clo.owl BFO most probably:

CC-by 3.0 Unported Licence

https://code.google.com/p/clo-ontology/issues/list
Cell Molecular Phenotype Ontology CMPO https://github.com/EBISPOT/CMPO/tree/master/release BFO

Molecular Entities

This section highlights the major and most widely used OBO Foundry resources for molecules of biological relevance as well as molecular structures, biological processes and cellular components

Scope Name File location Top-Level Ontology Licence Issue Tracker URI
Chemicals and Small Molecules CHEBI http://ftp.ebi.ac.uk/chebi.owl

http://ftp.ebi.ac.uk/chebi.obo

BFO most probably:

CC-by 3.0 Unported Licence

http://sourceforge.net/p/chebi/annotation-issues/
Gene Function, Molecular Component, Biological Process GO http://purl.obolibrary.org/obo/go.obo

http://purl.obolibrary.org/obo/go.owl

BFO CC-by 4.0 Unported License http://sourceforge.net/p/geneontology/ontology-requests/
Protein/peptide PRO http://ftp.pir.georgetown.edu/pro.obo BFO CC-by 3.0 Unported Licence

Besides these open ontologies, in the context of clinically relevant work where drug formulation require recording and description, the following resource is relevant.

Scope Name File location Top-Level Ontology Licence Issue Tracker URI
Drug National Drug File https://uts.nlm.nih.gov/license.html

Assays and Technologies

The resources listed in the section are providing key descriptors bridging data acquisition procedures (as used in clinical setting and wet lab work) with instruments, units of measurements, endpoints as well as sometimes the biological process or molecular entities of biological significance. Some of the resources are specialized semantic artefact developed to support the standardized reporting of data modalities.

Scope Name File location Top-Level Ontology Licence Issue Tracker URI
Radiology RADLex
Medical Imaging DICOM
Scope Name File location Top-Level Ontology Licence Issue Tracker URI
Sample Processing/Reagents/Instruments

Assay Definition

OBI http://svn.code.sf.net/p/obi/code/releases/2014-03-29/obi.owl BFO CC-by 3.0 Unported Licence http://sourceforge.net/p/obi/obi-terms/
Biological screening assays and their results including high-throughput screening (HTS) BAO http://www.bioassayontology.org/bao/bao_complete_bfo_dev.owl BFO CC-by 3.0 Unported Licence
Mass Spectrometry (instrument/acquisition parameter/spectrum related information) PSI-MS http://psidev.cvs.sourceforge.net/viewvc/psidev/psi/psi-ms/mzML/controlledVocabulary/psi-ms.obo

(No OWL file)

none specified CC-by 3.0 Unported Licence https://lists.sourceforge.net/lists/listinfo/psidev-vocab
NMR Spectroscopy (instrument/acquisition parameter/spectrum related information) NMR-CV http://nmrml.org/cv/v1.0.rc1/nmrCV.owl BFO Creative Commons Public Domain Mark 1.0 https://github.com/nmrML/nmrML/issues?state=open
Laboratory test LOINC LOINC and RELMA Complete Download File (All Formats Included) none specified https://uts.nlm.nih.gov/license.html wait for Bron ‘s feedback regarding CDISC lab test descriptors to handle/avoid overlap with LOINC coverage

Finally, a resource exists that describes statistical measures, statistical tests or methods as well as statistically relevant graphical representations. It may be used for reporting results and annotating experimental results.

Scope Name File location Top-Level Ontology Licence Issue Tracker URI
Experimental Design, Statistical Methods and Statistical Measures STATO https://raw.githubusercontent.com/ISA-tools/stato/dev/src/ontology/stato.owl BFO CC-by 3.0 Unported Licence https://github.com/ISA-tools/stato/issues?state=open

Relations

Also known as OWL.Properties, their importance may be overlooked by data scientists who are not knowledge engineers or ontologists but these are essential components as, when correctly crafted with a proper understanding of the logical constraints available to semantic language such as OWL, are exploited by automatic reasoners to carry the following key tasks:

  • ontology logical consistency checks
  • automatic classification and inference tasks
  • entailments, i.e. detection of logical consequences resulting from axiomatic

This is particularly important when processing billions of facts expressed as RDF statements.

One also needs to understand the current limitations in expressivity afforded by the current semantic web languages and the associated axiomatics as well as computational constraints associated with inference. For more in-depth review of such topics, the reader is invited to consults the following work [ref].

In the field of Biology and Biomedicine, the OBO Foundry coordinates the development of interoperable ontologies. At the core of this interoperation lies the Relation Ontology

Scope File Relation Ontology Variant License
relations ro.owl Relation Ontology Canonical edition https://creativecommons.org/publicdomain/zero/1.0/
relations ro.obo Relation Ontology in obo format Has imports merged in https://creativecommons.org/publicdomain/zero/1.0/
relations ro/core.owl RO Core relations Minimal subset intended to work with BFO-classes [page] https://creativecommons.org/publicdomain/zero/1.0/
relations ro/ro-base.owl RO base ontology Axioms defined within RO and to be used in imports for other ontologies [page] https://creativecommons.org/publicdomain/zero/1.0/
relations ro/subsets/ro-interaction.owl Interaction relations https://creativecommons.org/publicdomain/zero/1.0/
relations ro/subsets/ro-eco.owl Ecology subset For use in ecology and environmental science https://creativecommons.org/publicdomain/zero/1.0/
relations ro/subsets/ro-neuro.owl Neuroscience subset For use in neuroscience [page] https://creativecommons.org/publicdomain/zero/1.0/

As knowledge graphs and property graphs gain importance, we can expect the range and depth of relations to mature and expands are more expressivity is needed and progress is made by reasoner technology to fully exploit their benefits. This would also have to placed in the context of advances in Text Mining and Machine Learning, where unsupervised methods start to demonstrate strong potential to detecting relations between entities.

rdf
B cell, CD19-positive
equivalentClass :
    lymphocyte of B lineage, CD19-positive 
    and ( has plasma membrane part some CD19 molecule) 
    and ( in taxon some Mammalia) 
    and ( capable of some B cell mediated immunity)

Conclusions:

Selecting semantic resources depends on many different factors. However, the most important factor remains the context of the data and associated landscape of data standards as well as the ultimate integration goal, which will dictate the final choice.

The selection process remains guided by the need to maximize the potential of data integration with datasets of similar nature and similar value. It aslo requires a good understanding of the technical and sometimes legals implications these choice will have.

What should I read next?


References:

Smith, B., Ceusters, W., Klagges, B. et al. Relations in biomedical ontologies. Genome Biol 6, R46 (2005). https://doi.org/10.1186/gb-2005-6-5-r46

Rocca-Serra P, Bratfalean D, Richard F, Marshall C, Romacker M., Auffray C, ., … on the behalf of the eTRIKS consortium, . (2016, April 25). eTRIKS Standards Starter Pack Release 1.1 April 2016. Zenodo. http://doi.org/10.5281/zenodo.50825

Malone J, Stevens R, Jupp S, Hancocks T, Parkinson H, Brooksbank C. Ten Simple Rules for Selecting a Bio-ontology. PLoS Comput Biol. 2016;12(2):e1004743. Published 2016 Feb 11. http://doi.org/10.1371/journal.pcbi.1004743

Bairoch A. The Cellosaurus, a cell line knowledge resource. J. Biomol. Tech. (2018) 29:25-38. http://doi.org/10.7171/jbt.18-2902-002.

Sansone, S.-A., McQuilton, P., Rocca-Serra, P., Gonzalez-Beltran, A., Izzo, M., Lister, A.L. and Thurston, M. (2019) FAIRsharing as a community approach to standards, repositories and policies. Nature biotechnology, 37, 358: http://doi.org/10.1038/s41587-019-0080-8.

Hripcsak, G., Ryan, P. B., Duke, J. D., Shah, N. H., Park, R. W., Huser, V., Suchard, M. A., Schuemie, M. J., DeFalco, F. J., Perotte, A., Banda, J. M., Reich, C. G., Schilling, L. M., Matheny, M. E., Meeker, D., Pratt, N., & Madigan, D. (2016). Characterizing treatment pathways at scale using the OHDSI network. Proceedings of the National Academy of Sciences of the United States of America, 113(27), 7329–7336. https://doi.org/10.1073/pnas.1510502113

Hripcsak, George et al. “Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers.” Studies in health technology and informatics vol. 216 (2015): 574-8.


Authors:

Name Affiliation orcid CrediT role
Philippe Rocca-Serra University of Oxford, Data Readiness Group 0000-0001-9853-5668 Writing - Original Draft
Susanna-Assunta Sansone University of Oxford, Data Readiness Group Writing - Review & Editing, Funding acquisition

License:

This page is released under the Creative Commons 4.0 BY license.