Creating a community annotation compliant ISA descriptor: Dataset Maturity Level 2



Recipe Overview
Reading Time
15 minutes
Executable Code
Yes
Difficulty
Enhanced Data Maturity with ISA - ISA-JSON and ontology markup
FAIRPlus logo
Recipe Type
Hands-on
Maturity Level & Indicator
DSM-1-C1
hover me Tooltip text

Abstract:

With the recipe, we will continue using the ISA model to capture study metadata and dataset description, but we will be moving up on the scale of data set maturity by showing how to implement four important aspects:

  • community agreed experimental design descriptors

  • community agreed annotation levels for description of biological materials

  • extensive semantic annotation and reduction of free text annotation

  • machine-readable and more interoperable format: serialization to ISA-JSON

Let’s get started! Loading the ISA-API

from isatools.model import (
    Comment,
    Investigation,
    Study,
    StudyFactor,
    FactorValue,
    OntologyAnnotation,
    Characteristic,
    OntologySource,
    Material,
    Sample,
    Source,
    Protocol,
    ProtocolParameter,
    ParameterValue,
    Process,
    Publication,
    Person,
    Assay,
    DataFile,
    plink
)

import os

Creating an ISA investigation

investigation = Investigation()

Declaring Semantic Resources used to annotated ISA objects

ncbitaxon = OntologySource(name="NCIBTaxon",
                           description="NCBI Taxonomy")
efo   = OntologySource(name="EFO",
                       description="Experimental Factor Ontology")
obi   = OntologySource(name='OBI',
                       description="Ontology for Biomedical Investigations")
chebi = OntologySource(name="CHEBI",
                       description="Chemical Entity of Biological Interest")
pato  = OntologySource(name='PATO',
                       description="Phenotype and Trait Ontology")

investigation.ontology_source_references = [chebi,efo,obi,pato,ncbitaxon]

Building and Annotation ISA objects to a Community Compliant level

The annotation requirements we are dealing with are:

study = Study(filename="s_BII-S-3-synthesis.txt")
study.identifier = "BII-S-3-synth"
study.title = "Metagenomes and Metatranscriptomes of phytoplankton blooms from an ocean acidification mesocosm experiment"
study.description = "Sequencing the metatranscriptome can provide information about the response of organisms to \
varying environmental conditions. We present a methodology for obtaining random whole-community mRNA from a complex \
microbial assemblage using Pyrosequencing. The metatranscriptome had, with minimum contamination by ribosomal RNA,\
significant coverage of abundant transcripts, and included significantly more potentially novel proteins than in the\
metagenome. This experiment is part of a much larger experiment. We have produced 4 454 metatranscriptomic datasets \
and 6 454 metagenomic datasets. These were derived from 4 samples."
study.submission_date = "2008-08-15"
study.public_release_date = "2008-08-15"

These NCBI SRA related ISA Comments fields are required and must be present for the ISA SRAconverter to be invoked later

src_comment_sra1 = Comment(name="SRA Broker Name", value="OXFORD")
src_comment_sra2 = Comment(name="SRA Center Name", value="OXFORD")
src_comment_sra3 = Comment(name="SRA Center Project Name", value="OXFORD")
src_comment_sra4 = Comment(name="SRA Lab Name", value="Oxford e-Research Centre")
src_comment_sra5 = Comment(name="SRA Submission Action", value="ADD")
study.comments.append(src_comment_sra1)
study.comments.append(src_comment_sra2)
study.comments.append(src_comment_sra3)
study.comments.append(src_comment_sra4)
study.comments.append(src_comment_sra5)

These ISA Comments are optional and may be used to report funding information

src_comment_st1 = Comment(name="Study Funding Agency", value="")
src_comment_st2 = Comment(name="Study Grant Number", value="")
study.comments.append(src_comment_st1)
study.comments.append(src_comment_st2)

Declaring all the protocols used in the ISA study.

tip

Note also the declaration of ISA Protocol Parameters when needed.

study.protocols = [ 
    Protocol(name="environmental material collection - standard procedure 1",
             description="Waters samples were prefiltered through a 1.6 um GF/A glass fibre filter to reduce Eukaryotic \
             contamination. Filtrate was then collected on a 0.2 um Sterivex (millipore) filter which was frozen in \
             liquid nitrogen until nucelic acid extraction. CO2 bubbled through 11000 L mesocosm to simulate ocean \
             acidification predicted conditions. Then phosphate and nitrate were added to induce a phytoplankton bloom.",
             protocol_type=OntologyAnnotation(term="sample collection"),
             parameters=[
              ProtocolParameter(parameter_name=OntologyAnnotation(term="filter pore size"))
             ]
            ),
    Protocol(name="aliquoting-0",
             description="aliquoting",
             protocol_type=OntologyAnnotation(term="sample collection")),
    Protocol(name="nucleic acid extraction",
             description="Total nucleic acid extraction was done as quickly as possible using the method of \
             Neufeld et al, 2007.",
             protocol_type=OntologyAnnotation(term="nucleic acid extraction")
    ),
    Protocol(name="mRNA extraction - standard procedure 3",
             description="RNA MinElute + substrative Hybridization + MEGAclear For transcriptomics, total RNA\
             was separated from the columns using the RNA MinElute clean-up kit (Qiagen) and checked for integrity of\
             rRNA using an Agilent bioanalyser (RNA nano6000 chip). High integrity rRNA is essential for subtractive\
             hybridization. Samples were treated with Turbo DNA-free enzyme (Ambion) to remove contaminating DNA.\
             The rRNA was removed from mRNA by subtractive hybridization (Microbe Express Kit, Ambion), and absence\
             of rRNA and DNA contamination was confirmed using the Agilent bioanalyser. The mRNA was further\
             purified with the MEGAclearTM kit (Ambion). Reverse transcription of mRNA was performed using\
             the SuperScript III enzyme (Invitrogen) with random hexamer primers (Promega).\
             The cDNA was treated with RiboShredderTM RNase Blend (Epicentre) to remove trace RNA contaminants. \
             To improve the yield of cDNA, samples were subjected to random amplification using the \
             GenomiPhi V2 method (GE Healthcare). GenomiPhi technology produces branched DNA molecules \
             that are recalcitrant to the pyrosequencing methodology. Therefore amplified samples were treated with\
             S1 nuclease using the method of Zhang et al.2006.",
             protocol_type=OntologyAnnotation(term="labeling") #nucleic acid extraction
    ),
    Protocol(name="genomic DNA extraction - standard procedure 4",
             description="superscript+random hexamer primer",
             protocol_type=OntologyAnnotation(term="nucleic acid extraction")
    ),
    Protocol(name="reverse transcription - standard procedure 5",
             description="",
             protocol_type=OntologyAnnotation(term="reverse transcription"),
    ),
     Protocol(name="library construction",
              description="",
              protocol_type=OntologyAnnotation(term="library construction"),
              parameters=[ProtocolParameter(parameter_name=OntologyAnnotation(term="library strategy")),
                          ProtocolParameter(parameter_name=OntologyAnnotation(term="library layout")),
                          ProtocolParameter(parameter_name=OntologyAnnotation(term="library selection"))
            ]
    ),   

    Protocol(name="nucleic acid sequencing", #pyrosequencing - standard procedure 6",
             description="1. Sample Input and Fragmentation: The Genome Sequencer FLX System supports the sequencing of \
             samples from a wide variety of starting materials including genomic DNA, PCR products, BACs, and cDNA. \
             Samples such as genomic DNA and BACs are fractionated into small, 300- to 800-base pair fragments. \
             For smaller samples, such as small non-coding RNA or PCR amplicons, fragmentation is not required. \
             Instead, short PCR products amplified using Genome Sequencer fusion primers can be used for immobilization\
             onto DNA capture beads as shown below.",
             protocol_type=OntologyAnnotation(term="nucleic acid sequencing"),
             parameters=[ProtocolParameter(parameter_name=OntologyAnnotation(term="sequencing instrument"))
            ]
    ),
    Protocol(name="sequence analysis - standard procedure 7",
             description="",
             protocol_type=OntologyAnnotation(term="data transformation")
    )
]

Adding a Study Design descriptor to the ISA Study object

intervention_design = OntologyAnnotation(term_source=obi)
intervention_design.term = "intervention design"
intervention_design.term_accession = "http://purl.obolibrary.org/obo/OBI_0000115"

study.design_descriptors.append(intervention_design)

Declaring the ISA Study Factors, i.e. The Independent Variables of the Experiment

study.factors = [
    StudyFactor(name="compound",
                factor_type=OntologyAnnotation(term="chemical substance",
                                               term_accession="http://purl.obolibrary.org/obo/CHEBI_59999",
                                               term_source=chebi)),
    StudyFactor(name="dose",
                factor_type=OntologyAnnotation(term="dose",
                                               term_accession="http://www.ebi.ac.uk/efo/EFO_0000428",
                                               term_source=efo)),
    StudyFactor(name="collection time",
                factor_type=OntologyAnnotation(term="time",
                                               term_accession="http://purl.obolibrary.org/obo/PATO_0000165",
                                               term_source=pato))
]

Declaring the ISA FactorValues i.e. The factor levels to each of the Indepedent Variables or ISA Study Factor.

fv1 = FactorValue(factor_name=study.factors[0], value=OntologyAnnotation(term="carbon dioxide"))
fv2 = FactorValue(factor_name=study.factors[1], value=OntologyAnnotation(term="high"))
fv3 = FactorValue(factor_name=study.factors[1], value=OntologyAnnotation(term="normal"))
fv4 = FactorValue(factor_name=study.factors[2], value="may 13th, 2006")
fv5 = FactorValue(factor_name=study.factors[2], value="may 19th, 2006")

Adding the publications associated to the study

study.publications = [
    Publication(doi="10.1371/journal.pone.0003042",
                pubmed_id="18725995",
                title="Detection of large numbers of novel sequences in the metatranscriptomes \
                      of complex marine microbial communities.",
                status=OntologyAnnotation(term="indexed in PubMed"),
                author_list="Gilbert JA, Field D, Huang Y, Edwards R, Li W, Gilna P, Joint I."),
    Publication(doi="10.1111/j.1462-2920.2008.01745.x",
                title="Potential for phosphonoacetate utilization by marine bacteria in temperate coastal waters",
                pubmed_id="18783384",
                status=OntologyAnnotation(term="indexed in PubMed"),
                author_list="Gilbert JA, Thomas S, Cooley NA, Kulakova A, Field D, Booth T, McGrath JW, Quinn JP, Joint I.")
]

Adding the authors of the study

study.contacts = [
    Person(first_name="Jack",
           last_name="Gilbert",
           affiliation="Plymouth Marine Laboratory",
           email="jagi@pml.ac.uk",
           address="Prospect Place, Plymouth, United Kingdom",
           comments=[Comment(name="Study Person REF",
                             value="")],
           roles=[OntologyAnnotation(term="principal investigator role"),
                   OntologyAnnotation(term="SRA Inform On Status"),
                   OntologyAnnotation(term="SRA Inform On Error")]
    ),
    Person(first_name="Dawn",
           last_name="Field",
           affiliation="NERC Centre for Ecology and Hydrology",
           address="CEH Oxford, Oxford, United Kingdom",
           comments=[Comment(name="Study Person REF",
                             value="")],
           roles=[OntologyAnnotation(term="principal investigator role")]
    )
]

Now, creating ISA Source and Sample objects and building what is the BII-S-3 Study table

study.sources = [Source(name="GSM255770"), Source(name="GSM255771"),Source(name="GSM255772"),Source(name="GSM255773")]
study.samples = [Sample(name="GSM255770"), Sample(name="GSM255771"), Sample(name="GSM255772"), Sample(name="GSM255773")]

# Note how the treatment groups are defined as sets of factor values attached to the ISA.Sample object
study.samples[0].factor_values=[fv1,fv2,fv4]
study.samples[1].factor_values=[fv1,fv3,fv4]
study.samples[2].factor_values=[fv1,fv2,fv5]
study.samples[3].factor_values=[fv1,fv3,fv5]

characteristic_organism = Characteristic(category=OntologyAnnotation(term="Organism"),
                                         value=OntologyAnnotation(term="marine metagenome",
                                                                  term_source=ncbitaxon,
                                                                  term_accession="http://purl.obolibrary.org/obo/NCBITaxon_408172"))

Now creating a Process showing a Protocol Application using Source as input and producing Sample as output.

for i in range(len(study.sources)):
    study.sources[i].characteristics.append(characteristic_organism)
    study.process_sequence.append(Process(executes_protocol=study.protocols[0],
                                         inputs=[study.sources[i]],
                                         outputs=[study.samples[i]])
                                 )

Now appending the ISA Study object to the ISA Investigation object

investigation.studies = [study]

Now, creating the ISA objects to represent assays and acquired raw data Starting by declaring the 2 types of assays used in BII-S-3:

assay = Assay(filename="a_gilbert-assay-Gx.txt")

assay.measurement_type = OntologyAnnotation(term="metagenome sequencing",
                                            term_accession="http://purl.obolibrary.org/obo/OBI_0002623",
                                            term_source=obi)

assay.technology_type = OntologyAnnotation(term="nucleotide sequencing",
                                           term_accession="http://purl.obolibrary.org/obo/OBI_0000626",
                                           term_source=obi)


assay_tx = Assay(filename="a_gilbert-assay-Tx.txt")

assay_tx.measurement_type = OntologyAnnotation(term="transcription profiling",
                                               term_accession="http://purl.obolibrary.org/obo/OBI_0000424",
                                               term_source=obi)
assay_tx.technology_type = OntologyAnnotation(term="nucleotide sequencing",
                                              term_accession="http://purl.obolibrary.org/obo/OBI_0000626",
                                              term_source=obi)

Warning

make sure the ISA plink function connects the first and last protocols found in a chain of protocols.

for i, sample in enumerate(study.samples):
    
#     # create an aliquoting process which creates new samples from existing samples created in a study.processSequence
#     aliquoting_process = Process(executes_protocol=study.protocols[1])
#     aliquot = Sample(name="aliquot-{}".format(i), derives_from=sample)
#     aliquoting_process.inputs.append(sample)
#     aliquoting_process.outputs.append(aliquot)

    # create an extraction process that executes the extraction protocol

    extraction_process = Process(executes_protocol=study.protocols[0])

    # extraction process takes as input a sample, and produces an extract material as output
    
    char_ext = Characteristic(category=OntologyAnnotation(term="Material Type"),
                              value=OntologyAnnotation(term="pellet"))
    
    char_ext1 = Characteristic(category=OntologyAnnotation(term="quantity"),
                               value=OntologyAnnotation(term="loads"))

    # extraction_process.inputs.append(aliquot)
    extraction_process.inputs.append(sample)
    material = Material(name="extract-{}".format(i))
    material.type = "Extract Name"
    material.characteristics.append(char_ext)
    material.characteristics.append(char_ext1)
    extraction_process.outputs.append(material)

    # create a sequencing process that executes the sequencing protocol

    sequencing_process = Process(executes_protocol=study.protocols[7])
    sequencing_process.name = "assay-name-{}".format(i)
    sequencing_process.inputs.append(extraction_process.outputs[0])
    # sequencing_process.inputs.append(material)

    # Sequencing process usually has an output data file

    datafile = DataFile(filename="sequenced-data-{}".format(i), label="Raw Data File")
    data_comment = Comment(name="data_comment",value="data_value")
    datafile.comments.append(data_comment)
    sequencing_process.outputs.append(datafile)

    # Ensure Processes are linked forward and backward. plink(from_process, to_process) is a function to set
    # these links for you. It is found in the isatools.model package

    # plink(aliquoting_process, sequencing_process)
    plink(extraction_process, sequencing_process)
    # make sure the extract, data file, and the processes are attached to the assay


    assay.samples.append(sample)
    # assay.samples.append(aliquot)
    assay.other_material.append(material)
    assay.data_files.append(datafile)
    
    assay.process_sequence.append(extraction_process)
    assay.process_sequence.append(sequencing_process)

    # create an extraction process that executes the extraction protocol

    extraction_process_tx = Process(executes_protocol=study.protocols[2])

    # extraction process takes as input a sample, and produces an extract material as output

    extraction_process_tx.inputs.append(sample)
    # extraction_process_tx.outputs.append(sample)
    # extraction_process_tx.outputs=[]
    # material_tx = Material(name="extract-{}".format(i))
    # material_tx.type = "Extract Name"
    # extraction_process_tx.outputs.append(material_tx)

    # labeling process takes as input an extract, and produces a labeled extract material as output
    labeling_process_tx = Process(executes_protocol=study.protocols[3])
    # labeling_process_tx.inputs.append(sample)
    # labeling_process_tx.inputs=[]
    material_tx = Material(name="le-{}".format(i))
    material_tx.type = "Labeled Extract Name"
    labeling_process_tx.outputs.append(material_tx)
    
    
    # create a sequencing process that executes the sequencing protocol
    # seq_pv1 = ParameterValue(category="library layout", value="SINGLE")
    seq_pv2 = ParameterValue(category=ProtocolParameter(parameter_name=OntologyAnnotation(term="library layout")),
                             value=OntologyAnnotation(term="SINGLE"))
    sequencing_process_tx = Process(executes_protocol=study.protocols[6], parameter_values=[seq_pv2])
    sequencing_process_tx.name = "assay-name-tx-{}".format(i)
    sequencing_process_tx.inputs.append(labeling_process_tx.outputs[0])

    # Sequencing process usually has an output data file
    
    datafile_tx = DataFile(filename="sequenced-data-tx-{}".format(i), label="Raw Data File")
    sequencing_process_tx.outputs.append(datafile_tx)

    # Ensure Processes are linked forward and backward. plink(from_process, to_process) is a function to set
    # these links for you. It is found in the isatools.model package

   
    # plink(extraction_process_tx, labeling_process_tx)
    # plink(labeling_process_tx, sequencing_process_tx)
    plink(extraction_process_tx, sequencing_process_tx)
    # make sure the extract, data file, and the processes are attached to the assay


    assay_tx.samples.append(sample)
    assay_tx.other_material.append(material_tx)
    assay_tx.data_files.append(datafile_tx)
    
    assay_tx.process_sequence.append(extraction_process_tx)
    assay_tx.process_sequence.append(labeling_process_tx) 
    assay_tx.process_sequence.append(sequencing_process_tx)    

Adding both assays to the ISA Study object

study.assays.append(assay)
study.assays.append(assay_tx)
# assay_g = investigation.studies[0].assays[1]
# assay_t = investigation.studies[0].assays[0]

# assay_t.samples=investigation.studies[0].samples


# extract1 = Material(name="GSM255770.e1")
# extract1.type = "Extract Name"
# extract2 = Material(name="GSM255771.e1")
# extract2.type = "Extract Name"
# extract3 = Material(name="GSM255772.e1")
# extract3.type = "Extract Name"
# extract4 = Material(name="GSM255773.e1")
# extract4.type = "Extract Name"
# extract5 = Material(name="GSM255774.e1")
# extract5.type = "Extract Name"

# assay_t.other_material.append(extract1)
# assay_t.other_material.append(extract2)
# assay_t.other_material.append(extract3)
# assay_t.other_material.append(extract4)
# assay_t.other_material.append(extract5)


# for i in range(len(study.samples)):

#     assay_t.process_sequence.append(Process(
#             executes_protocol=study.protocols[1],
#                 inputs=study.samples[i],
#                 outputs=assay_t.other_material[i]
#         ))
    
#     data=DataFile(filename="sequenced-data-{}".format(i), label="Raw Data File")  
    
#     assay_t.process_sequence.append(Process(
#         executes_protocol=study.protocols[3],
#                 inputs=assay_t.other_material[i]
#     ))
    
#     plink(assay_t.process_sequence[0], assay_t.process_sequence[1])
#     assay_t.process_sequence[-1].outputs.append(data)
#     assay_t.data_files.append(data)

Writing the ISA object representation to file with the ISA-API dump function

from isatools.isatab import dump

# note the use of the flag for explicit serialization on factor values on assay tables
dump(investigation, "./output/BII-S-3-synth/", write_factor_values_in_assay_table=True)

Reading the ISA document from disk back in, loading it into memory and writing to disk again to check that the ISA-API load function works nominally

from isatools.isatab import load
with open(os.path.join("./output/BII-S-3-synth/","i_investigation.txt")) as bii3stest:
    roundtrip = load(bii3stest)
from isatools.convert import isatab2json
from isatools import isajson
import json
isa_json = isatab2json.convert('./output/BII-S-3-synth/', validate_first=False, use_new_parser=True)

isa_j = json.dumps(
            isa_json, cls=isajson.ISAJSONEncoder, sort_keys=True, indent=4, separators=(',', ': ')
        )

et Voilà!

Authors