Search

ND4BB dataset FAIRification recipe

Version: 2, this recipe is extracted from Version 1

Ingredients

  • Raw Data: AMR Compounds Database

  • Metadata Model

  • Vocabularies and Terminologies

  • Data Format:

    • Excel spreadsheet
PropertyGroup Property Value
  • Tools and Software:

    • Data curation tools: Excel, JAVA
    • FAIRification pipeline tools: KNIME workflow
    • Ontology recommender: ZOOMA, NCBO
    • FAIR assessment: RDA indicator V0.03

Objectives

The current AMR dataset is stored in a local webpage at UNICA. We make the AMR data more accessible by extracting the data to a public repository using machine readable format. Also generic improvement of the FAIR parameters.

Step by Step Process

Dataset description

The AMR database consists of several nested static HTML pages. The information is well structured, results are mainly quantitative numeric data, and for all compounds a complete set of data is available. Thus, it should be easily linkable to other public sources (e.g. PubChem) and a machine-readable data set should be easily created.

To get a good understanding of the AMR dataset, the AMR metadata shall be extracted. The AMR metadata includes four types of metadata: structural metadata, administrative metadata, and descriptive metadata. The structural metadata describes the structure of the dataset, for examples, column names and/or IDs. Administrative metadata contains the author, organization, and other provenance information. The descriptive metadata includes the procedure, usually protocols, in generating the experimental results. The descriptive metadata is always stored in a free-text format without data structure.

Figure 1 is an example of the simplified schematic workflow of FAIRification, which includes the extraction, transform, annotation, licensing and identifier assigning process. Due to the limit of time, we ere, we focus on the extraction and annotation of structural metadata, the administrative metadata, descriptive metadata will be added in the future.

alt_text

Figure 1: Schematic workflow of the general FAIRification pipeline. some steps need repetitions (yellow arrows).

Data extraction

Data are extracted using a KNIME workflow, which can visualize the data extraction steps, handle complex data extraction workflows and be easily reproduced.

Figure 2 is a snapshot of the ND4BB website, which is structured into a central part (the blue section) with data and two side columns with additional information. Here, we focus on data extraction from the central part. The central part of the home page consists of a single table with compound class names as table data configured as heading level 3 (\<h3>, shown in the red box in Figure 3) and compounds as an unsorted list (\<ul>, shown in the yellow box in Figure 3).

alt_text

Figure 2: Snapshot of AMR compound database home page. The blue area listed all compound data to be extracted.

alt_text

Figure 3: Snapshot of the AMR compound database home page source code. The red box shows the compound class header. The yellow box lists one compound.

We first identified all websites that contain the project data. The homepage (Figure 2) describes the compound names, the compound class and links to the compound subpage. Such information was generated using the Xpath nodes in the workflow in Figure 4.

Data structure discrepancy was found in the extraction. In the compound class extraction, unlike the usual compound class structure, which is listed as a table and separated by …, chemical “Oxazolidinones” and “Tetracyclines” uses different data structure. Therefore the extracted XML document was updated before applying further nodes to the XML document. In the subpage link extraction, compound Amikacin and ampicillin have multiple subpages for differently charged molecules. The green boxes in Figure 4 highlighted the discrepancies we found in the original dataset.

alt_text

Figure 4: Workflow to extract antimicrobial classes and compound with their corresponding subpages

Links to all content in the sub-page are also extracted. Figure 5 is an example of the subpage of one compound, which consists of table selection with the compound name, a 2D and 3D image of the compound structures, two tables with links to related files and properties and one table with links to external sources. (See the green boxes in Figure 5). External links were excluded from current data extraction.

The complete workflow to extract the data from the compound/charge webpage is provided in supplementary figure 1.

alt_text

Figure 5: Example of one ND4BB raw data

Data transform

The data were extracted following the schema to facilitate future data annotation: PropertyGroup – Property – Value where PropertyGroup is the heading of the table, Property is the type of property and Value is the corresponding value of the property which will be not part of the annotation process. If the propert is an image, then the “PropertyGroup” is image, “Property” is “2D/3D image”. (See the red box in Figure 6. For each property, the corresponding values in a controlled vocabulary list are collected into a spreadsheet. Missing values were fixed in this transform as well.

One limitation of this schema is that Excel does not explicitly describe the relations between the entities (e.g. Property Group and Property). Therefore predicates between concepts cannot be expressed (e.g. Property hasA PropertyGroup).

alt_text

Figure 6: Example data set for +3 charged Amikacin

Extract and annotate structural metadata

To prepare for the ontology annotation, we first generated lists of different types of attributes, which include “AMRclass”, “AMR compound”, “PropertyGroup”, etc. In each spreadsheet, the values are listed as separate rows for ontology annotation.

The strings went through additional parsing to improve mapping confidence. Duplicated or missing attributes were removed. Stemming and lemmatization were implemented to map the keyword to its root form to avoid mismatch because of spelling/form variations.

all the strings were sent through ZOOMA/NCBO API to search for ontology annotation. The ontology annotation results are listed here (ZOOMA, NCBO). The ontology annotations were ranked and selected based on its confidence. For strings that didn’t find proper ontology mapping, the original values were kept. The ontology annotation preparation workflow is here.

Both ZOOMA and NCBO ontology recommenders returned the nearly same number of annotated terms, also the number searched ontologies for the NCBO Recommender (313) was much higher than the number of searched ontologies for the ZOOMA (11) service. For only few cases the NCBO Recommender showed results (e.g. BAL29880 and MBX2319) were ZOOMA did not find a corresponding ontology.

One difference between these two ontology mappers is they process special characters (e.g. -_#) and spaces differently. For example, in NCBO, among “ ‘beta-lactamase inhibitors’, ‘beta lactamase inhibitors’ and ‘betalactamase inhibitors’ only the ontology annotation of the first item was found. While ZOOMA returned ontology annotation results for all three descriptions. Another example would be Aminonucleosides, Aminonucleoside, Amino nucleoside. While NCBO Recommender found no result, ZOOMA found at least a result for the terms ‘Aminonucleoside’, ‘Amino nucleoside’. This proves the necessity of running stemming or lemmatizing prior to ontology mapping service.

Provenance metadata about the ontology annotation pipeline implementation are stored here in the same file.

Results

Both generated files ‘ExtractedMetadata_20190124_ZOOMA_0329.xlsx’ and ‘ExtractedMetadata_20190124_NCBOREC_0347.xlsx’ show nearly same number of annotated terms, also the number searched ontologies for the NCBO Recommender (313) was much higher than the number of searched ontologies for the ZOOMA (11) service. For only few cases the NCBO Recommender showed results (e.g. BAL29880 and MBX2319) were ZOOMA did not find a corresponding ontology.

The proposed workflow is insufficient to extract adequate and consistent semantic annotations for the structural metadata. In addition the retrieved links do not reflect the used version of the ontology.

FAIR assessment

The FAIRness of the ND4BB was also assessed based on the RDA indicators (v0.03). Although there are a few indicators that are not applicable to the ND4BB dataset because of data type limitations, and some indicators are too ambiguous to provide a objective assessment. We got different data curators evaluating the FAIRness seperately and compared and discussed the conflicting assessment. In general, the FAIRness score against RDA FAIR indicator is 36%, of which the mandatory indicator score is 47% and the recommended indicator score is 32%.

Future plan

  1. Extract administrative metadata, provenance information, e.g. owner, date of creation
  2. Add license to data set
  3. Store data (=experimental results) together with administrative, structural, and descriptive metadata in a repository
  4. Add PID to data set (=digital object)
  5. Add metadata together with PID to a public catalog
  6. Add metrics according to CMMI and add to the public catalog
  7. Add checksums for all files for QC and integrity checks
  8. Expand the ontology annotation to all terms

Summary

The AMR dataset was provided as a first example as it was immediately available. A generic FAIRification workflow was also provided. We reviewed the workflow and derived general principles for the cookbook. However (as for the principles we learnt) the lack of a context for the data, and of goals for the FAIRification process made the actual action of FAIRification not valuable.

As a result or our work on the AMR dataset, we identified useful general principles, including the need for license, availability of the data, the importance of context (e.g.: what ontologies to map to) and other details included in our report.

We also identified key FAIRification steps in the proposed process, some of which non obvious (e.g.: capture modifications done to ingest data). On this basis we started to sketch a generic workflow.

Overall this dataset has been very useful to start our overall process and team activities.

FAIRification process summary table

Defined FAIRification Steps How it is implemented Tools/Process Pros Cons Comments/Questions
Define use case or describe scientific question missing
Fill out ELSI questionnaire https://drive.google.com/drive/u/0/folders/1iOShHkInNUuFoRYADwXKxS1_-UAIRmRK missing
Select/define target repository or schema missing Any restrictions due to the selected repository/schema, e.g. missing fields?
Extract the information (data and metadata) from original source(s) Process: extracts data Tool: KNIME Source: Website KNIME workflow: - Provides a Repeatable processes - Handles complex data extraction flows Easy to explain - Creating KNIME workflow requires a certain level of expertise/or training - Workflows are customized to the data source (web page) each variation at the source requires an additional branch Do we include the ETL processes in FAIRification cookbook ? Having a reusable ETL process might help to continuous FAIRification of the resource
Transform the extracted data into a common schema Process: transforms extracted data into a schema and if needed fills the missing values Tool: KNIME/Excel Output: Excel KNIME workflows can be tailored to fix systematic missing values (e.g. 2D image) - Excel does not explicitly describes the relations between the entities (e.g. Property Group and Property). Therefore predicates between concepts cannot be expressed (e.g. Property hasA PropertyGroup)
Extract administrative metadata
Extract structural metadata and add semantic annotations based on publicly available ontologies it has four sub procedures: 1) Enhancement 2) Annotation via vocabulary services ZOOMA NCBO 3) Assessing the relevance of suggested vocabularies 4) Merge PRS: PID are missing for the searches over zooma and bioportal
Substep1: Enhancement Process: generate alternative syntax of data to increase the annotation performance Tool: KNIME Variations in spelling of concepts (e.g. adding special characters) can be generated with workflows heuristic approach ( these methods can be provided as tips in the cookbook)
Substep2a: Annotation via vocabulary services: ZOOMA Process: Searching related ontology terminologies Recommender Service: ZOOMA Output: Excel - easy to use - performance relies on the recommender service
Substep2b: Annotation via vocabulary services: NCBO Process: Searching related ontology terminologies Recommender Service: NCBO API Output: Excel - easy to use - NCBO has a large collection of ontologies. KNIME workflow restricts the search to a curated list of ontologies (Search Ontologies.txt) - performance relies on the recommender service - creating a curated list of ontologies requires domain expertise and knowledge on existing ontologies. Success of the annotation depends on the vocabulary services. Performance of recommender services can be tested before the annotation process. If needed performance improvement methods can be defined (e.g. restricted search)
Substep3: Assessing the relevance of suggested vocabularies Process: assessing the relevance of the recommended ontology terms Tool: Java functions compareTo() and contains() - readily available - CompareTo() returns the exact matches - contains() function returns positive when the query term is contained in the recommended term. Might require expert review. - recommender systems might suggest a term with a different spelling. These terms will not be considered as a match by the functions and will need an expert review. Automated rating of the recommended terminologies is based on syntactical matches, expert review is required.
Substep4: merging annotations Process: select the results from multiple annotation flows Tool: manual - data can be annotated with multiple ontologies - if there is a need to select one matching ontology term for each entity, no guidance exists Comment by Andrea S:distinguish between suggestive choices vs objective ones. (Yes, but that will not e.g. make a protein a gene, or a liver a hepatocyte). So we need mappings and the subjective part translates into context and using lenses
Extract descriptive metadata and add semantic annotations based on publicly available ontologies
Add license to data set is this defined in the recipe ?
Store data (=experimental results) together with administrative, structural, and descriptive metadata in a repository
Add PID to data set (=digital object) is this defined in the recipe ? PID are missing for the searches over zooma and bioportal
Add metadata together with PID to a public catalog
Add metrics according to CMMI and add to the public catalog
in your opinion if any FAIRification step is missing please add them here
Suggested FAIRification Step Why it is needed Place in the workflow Any suggested tools/processes
Assessment of the current FAIR level. (at the beginning) Subjective. Is having data in html tales 1 or 2 stars ? Out of the recipe, but present in the document. Is this something worth as a genetic first step? What would we use this assessment for?
Data fixing (at the beginning) Manual? Issue detection as a sub-product of workflow definition?
Schema /Data model definition Targeted data model can be defined explicitly and annotated with a set of vocabularies before the data extraction
Add a manifest describing the content of the archive eases dive into the datasets for humans who can obtains basic information about each of the element or bundles in a dataset NA
Add checksums for all files for QC and integrity checks allow integrity checks thus enables reuse various options. e.g. python hashlib
Executing the workflow fails owing to use of localized file references in Knime workflows (which seems unavoidable) avoid hard links to local storage KNIME worklows
Absence of Ontology terms identifiers for all the terms retrieved from BioPortal/Zooma (only the strings are apparently retrieved) curis should be included knime/workflow output should always contains class label + class identifiers
Selected Ontologies ought to be identified more completely abbreviation (.e.g CHEBI) is reported but

Supplementary information

alt_text

Supplement figure 1: Workflow to extract the data from compound/charge webpage


Authors:


License:

This page is released under the Creative Commons 4.0 BY license.