Search

Creating a FAIR Frictionless Data Package of Rose Scent profiles from a publication's supplementary data.

*Philippe Rocca-Serra (philippe.rocca-serra[at]oerc.ox.ac.uk), University of Oxford e-Research Centre

Background:

Experimental results such as metabolite profiling data published in [1,2] can be straightfowardly reported using OKFN Data Packages. Such components can be easily parsed as data frames and exploiting for data visualization purpose using libraries implementing graphical grammar concepts. Here, we show how to use a set of python libraries to create a tabular data package from an excel file, annotate it with ontologies (CHEBI, PO, NCBITax) and validate the results against the JSON definition of the data table. A few line of codes allow structure information aound key study design descriptors: the independent variables and their levels have been clearly and unambiguously declared in the Tabular Data Package itself.

  1. Let's begin by installing the Python packages allowing easy access and use of data formatted as JSON Data Package
import os
import libchebipy
import re
import pandas as pd
from datapackage import Package
from goodtables import validate
  1. Reading the data:

We now simply read in the Excel file corresponding to the Nature Genetics Supplementary Table from the Zenodo archive

(DOI: https://doi.org/10.5281/zenodo.2598799)

#df = pd.read_excel('Supplementary Data 3.xlsx', sheet_name='Feuil1')
df = pd.read_excel('https://zenodo.org/api/files/91a610cb-8f1f-4ec5-9818-767a75a7a820/Supplementary%20Data%203.xlsx', sheet_name='Feuil1')
---------------------------------------------------------------------------
URLError                                  Traceback (most recent call last)
<ipython-input-3-412784865280> in <module>
----> 1 df = pd.read_excel('https://zenodo.org/api/files/91a610cb-8f1f-4ec5-9818-767a75a7a820/Supplementary%20Data%203.xlsx', sheet_name='Feuil1')

~/.pyenv/versions/venv372/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    206                 else:
    207                     kwargs[new_arg_name] = new_arg_value
--> 208             return func(*args, **kwargs)
    209 
    210         return wrapper

~/.pyenv/versions/venv372/lib/python3.7/site-packages/pandas/io/excel/_base.py in read_excel(io, sheet_name, header, names, index_col, usecols, squeeze, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, verbose, parse_dates, date_parser, thousands, comment, skip_footer, skipfooter, convert_float, mangle_dupe_cols, **kwds)
    308 
    309     if not isinstance(io, ExcelFile):
--> 310         io = ExcelFile(io, engine=engine)
    311     elif engine and engine != io.engine:
    312         raise ValueError(

~/.pyenv/versions/venv372/lib/python3.7/site-packages/pandas/io/excel/_base.py in __init__(self, io, engine)
    817         self._io = _stringify_path(io)
    818 
--> 819         self._reader = self._engines[engine](self._io)
    820 
    821     def __fspath__(self):

~/.pyenv/versions/venv372/lib/python3.7/site-packages/pandas/io/excel/_xlrd.py in __init__(self, filepath_or_buffer)
     19         err_msg = "Install xlrd >= 1.0.0 for Excel support"
     20         import_optional_dependency("xlrd", extra=err_msg)
---> 21         super().__init__(filepath_or_buffer)
     22 
     23     @property

~/.pyenv/versions/venv372/lib/python3.7/site-packages/pandas/io/excel/_base.py in __init__(self, filepath_or_buffer)
    346         # If filepath_or_buffer is a url, load the data into a BytesIO
    347         if _is_url(filepath_or_buffer):
--> 348             filepath_or_buffer = BytesIO(urlopen(filepath_or_buffer).read())
    349         elif not isinstance(filepath_or_buffer, (ExcelFile, self._workbook_class)):
    350             filepath_or_buffer, _, _, _ = get_filepath_or_buffer(filepath_or_buffer)

~/.pyenv/versions/3.7.2/lib/python3.7/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    220     else:
    221         opener = _opener
--> 222     return opener.open(url, data, timeout)
    223 
    224 def install_opener(opener):

~/.pyenv/versions/3.7.2/lib/python3.7/urllib/request.py in open(self, fullurl, data, timeout)
    523             req = meth(req)
    524 
--> 525         response = self._open(req, data)
    526 
    527         # post-process response

~/.pyenv/versions/3.7.2/lib/python3.7/urllib/request.py in _open(self, req, data)
    546 
    547         return self._call_chain(self.handle_open, 'unknown',
--> 548                                 'unknown_open', req)
    549 
    550     def error(self, proto, *args):

~/.pyenv/versions/3.7.2/lib/python3.7/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    501         for handler in handlers:
    502             func = getattr(handler, meth_name)
--> 503             result = func(*args)
    504             if result is not None:
    505                 return result

~/.pyenv/versions/3.7.2/lib/python3.7/urllib/request.py in unknown_open(self, req)
   1385     def unknown_open(self, req):
   1386         type = req.type
-> 1387         raise URLError('unknown url type: %s' % type)
   1388 
   1389 def parse_keqv_list(l):

URLError: <urlopen error unknown url type: https>
df.head(25)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-a4388ef05687> in <module>
----> 1 df.head(25)

NameError: name 'df' is not defined
  1. Following a manual inspection of the Excel Source, getting the start row of the data, we use Pandas take() function to extract first a row of headers (hence -axis set to 0)
header_treatment = df.take([13], axis=0)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-5f52bcb5c41f> in <module>
----> 1 header_treatment = df.take([13], axis=0)

NameError: name 'df' is not defined
  1. We then extract all the columns of interest (same take() function, with -axis set to 1)
data_full = df.take([3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], axis=1)
# We now trim by removing the first 15 rows which contain no information
data_slice = data_full.take([16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
                             39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61,
                             62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76], axis=0)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-7c919ae676d6> in <module>
      1 
----> 2 data_full = df.take([3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], axis=1)
      3 # We now trim by removing the first 15 rows which contain no information
      4 data_slice = data_full.take([16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
      5                              39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61,

NameError: name 'df' is not defined
  1. We now rename the DataFrame automatically generated field header to something more meaningful
data_slice.rename(columns={"Unnamed: 3": "chemical_name",
                           "Unnamed: 4": "sample_mean_1",
                           "Unnamed: 5": "sem_1",
                           "Unnamed: 6": "sample_mean_2",
                           "Unnamed: 7": "sem_2",
                           "Unnamed: 8": "sample_mean_3",
                           "Unnamed: 9": "sem_3",
                           "Unnamed: 10": "sample_mean_4",
                           "Unnamed: 11": "sem_4",
                           "Unnamed: 12": "sample_mean_5",
                           "Unnamed: 13": "sem_5",
                           "Unnamed: 14": "sample_mean_6",
                           "Unnamed: 15": "sem_6",
                           "Unnamed: 16": "sample_mean_7",
                           "Unnamed: 17": "sem_7",
                           "Unnamed: 18": "sample_mean_8",
                           "Unnamed: 19": "sem_8"}, inplace=True)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-b320c26194ff> in <module>
----> 1 data_slice.rename(columns={"Unnamed: 3": "chemical_name",
      2                            "Unnamed: 4": "sample_mean_1",
      3                            "Unnamed: 5": "sem_1",
      4                            "Unnamed: 6": "sample_mean_2",
      5                            "Unnamed: 7": "sem_2",

NameError: name 'data_slice' is not defined
  1. inserting 2 new fields as placeholders for chemical information descriptors and we reinitialize the dataframe index so row numbering start at 0, not 16
data_slice.insert(loc=1, column='inchi', value='')
data_slice.insert(loc=2, column='chebi_identifier', value='')
data_slice = data_slice.reset_index(drop=True)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-705628e05c0a> in <module>
----> 1 data_slice.insert(loc=1, column='inchi', value='')
      2 data_slice.insert(loc=2, column='chebi_identifier', value='')
      3 data_slice = data_slice.reset_index(drop=True)

NameError: name 'data_slice' is not defined
  1. Using LibChebi to retrieve CHEBI identifiers and InChi from a chemical name. Note: in this call, we retrieve only values for which an exact match on the chemical name is found in Chebi libchebi API does not allow easy searching on synonyms, thus we are failing to retrieve all relevant information. This is merely to showcase how to use libchebi.
for i in range(0, 60):
    hit = libchebipy.search(data_slice.loc[i, 'chemical_name'], True)
    if len(hit) > 0:
        print("HIT: ", data_slice.loc[i, 'chemical_name'], ":", hit[0].get_inchi(), "|", hit[0].get_id())
        data_slice.loc[i, 'inchi'] = hit[0].get_inchi()
        data_slice.loc[i, 'chebi_identifier'] = hit[0].get_id()
    else:
        print("Nothing found: ", data_slice.loc[i, 'chemical_name'])
        data_slice.loc[i, 'inchi'] = ''
        data_slice.loc[i, 'chebi_identifier'] = ''
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-b9104522a574> in <module>
      1 for i in range(0, 60):
----> 2     hit = libchebipy.search(data_slice.loc[i, 'chemical_name'], True)
      3     if len(hit) > 0:
      4         print("HIT: ", data_slice.loc[i, 'chemical_name'], ":", hit[0].get_inchi(), "|", hit[0].get_id())
      5         data_slice.loc[i, 'inchi'] = hit[0].get_inchi()

NameError: name 'data_slice' is not defined
  1. The following steps are needed to perform the table transformation from a 'wide' layout to a 'long table' one. Prep stubnames - pick out all the feature_model variables and remove the model suffices 'long table' layout is that relied on by Frictionless Tabular Data Packages and consumed by R ggplot2 library and Python plotnine library. Step1: obtain all the different 'dimension' measured for a given condition (i.e. repeating fields with an increment suffix)
feature_models = [col for col in data_slice.columns if re.match("(sample_mean|sem)_[0-9]", col) is not None]
features = list(set([re.sub("_[0-9]", "", feature_model) for feature_model in feature_models]))
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-10-66e1a519a653> in <module>
----> 1 feature_models = [col for col in data_slice.columns if re.match("(sample_mean|sem)_[0-9]", col) is not None]
      2 features = list(set([re.sub("_[0-9]", "", feature_model) for feature_model in feature_models]))

NameError: name 'data_slice' is not defined
  1. Step2: invoke Pandas pd.wide_to_long() function to carry out the table transformation. See Pandas documentation for more information: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.wide_to_long.html and the excellent blog: https://medium.com/@wangyuw/data-reshaping-with-pandas-explained-80b2f51f88d2
long_df = pd.wide_to_long(data_slice, i=['chemical_name'], j='treatment', stubnames=features, sep="_")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-17ba07570d36> in <module>
----> 1 long_df = pd.wide_to_long(data_slice, i=['chemical_name'], j='treatment', stubnames=features, sep="_")

NameError: name 'data_slice' is not defined
  1. Apparently a feature in Pandas DataFrame causes an mismatch in the field position. We solve this by writing the DataFrame to file and reading it back in again, not ideal but it does the trick. Writing to a temporary file & reading from that file to solve the issue.
long_df.to_csv("long.txt", sep='\t', encoding='utf-8')
long_df_from_file = pd.read_csv("long.txt", sep="\t")
long_df_from_file.head()

try:
    os.remove("long.txt")
except IOError as e:
    print(e)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-6861fa522b51> in <module>
----> 1 long_df.to_csv("long.txt", sep='\t', encoding='utf-8')
      2 long_df_from_file = pd.read_csv("long.txt", sep="\t")
      3 long_df_from_file.head()
      4 
      5 try:

NameError: name 'long_df' is not defined
  1. Insert a new field 'unit' in the DataFrame at position 3 and setting value to empty.
long_df_from_file.insert(loc=3, column='unit', value='')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-13-edfe40666e4c> in <module>
----> 1 long_df_from_file.insert(loc=3, column='unit', value='')

NameError: name 'long_df_from_file' is not defined
  1. Adding new fields for each of the independent variable and associated URI, copying from 'treatment field'
long_df_from_file['var1_levels'] = long_df_from_file['treatment']
long_df_from_file['var1_uri'] = long_df_from_file['treatment']
long_df_from_file['var2_levels'] = long_df_from_file['treatment']
long_df_from_file['var2_uri'] = long_df_from_file['treatment']
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-14-3f885a86cb48> in <module>
----> 1 long_df_from_file['var1_levels'] = long_df_from_file['treatment']
      2 long_df_from_file['var1_uri'] = long_df_from_file['treatment']
      3 long_df_from_file['var2_levels'] = long_df_from_file['treatment']
      4 long_df_from_file['var2_uri'] = long_df_from_file['treatment']

NameError: name 'long_df_from_file' is not defined
# adding a new field for 'sample size' and setting the value to n=3
long_df_from_file['sample_size'] = 3
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-15-9ce48b3473ab> in <module>
      1 # adding a new field for 'sample size' and setting the value to n=3
----> 2 long_df_from_file['sample_size'] = 3

NameError: name 'long_df_from_file' is not defined
  1. Marking up with ontology terms and their resolvable URI for all factor values. This requires doing a manual mapping, better ways could be devised.
long_df_from_file.loc[long_df_from_file['treatment'] == 1, 'treatment'] = 'R. chinensis \'Old Blush\' sepals'
long_df_from_file.loc[long_df_from_file['var1_levels'] == 1, 'var1_levels'] = 'R. chinensis \'Old Blush\''
long_df_from_file.loc[long_df_from_file['var1_uri'] == 1, 'var1_uri'] = 'http://purl.obolibrary.org/obo/NCBITaxon_74649'
long_df_from_file.loc[long_df_from_file['var2_levels'] == 1, 'var2_levels'] = 'sepals'
long_df_from_file.loc[long_df_from_file['var2_uri'] == 1, 'var2_uri'] = 'http://purl.obolibrary.org/obo/PO_0009031'

long_df_from_file.loc[long_df_from_file['treatment'] == 2, 'treatment'] = 'R. chinensis \'Old Blush\' stamens'
long_df_from_file.loc[long_df_from_file['var1_levels'] == 2, 'var1_levels'] = 'R. chinensis \'Old Blush\''
long_df_from_file.loc[long_df_from_file['var1_uri'] == 2, 'var1_uri'] = 'http://purl.obolibrary.org/obo/NCBITaxon_74649'
long_df_from_file.loc[long_df_from_file['var2_levels'] == 2, 'var2_levels'] = 'stamens'
long_df_from_file.loc[long_df_from_file['var2_uri'] == 2, 'var2_uri'] = 'http://purl.obolibrary.org/obo/PO_0009029'

long_df_from_file.loc[long_df_from_file['treatment'] == 3, 'treatment'] = 'R. chinensis \'Old Blush\' petals'
long_df_from_file.loc[long_df_from_file['var1_levels'] == 3, 'var1_levels'] = 'R. chinensis \'Old Blush\''
long_df_from_file.loc[long_df_from_file['var1_uri'] == 3, 'var1_uri'] = 'http://purl.obolibrary.org/obo/NCBITaxon_74649'
long_df_from_file.loc[long_df_from_file['var2_levels'] == 3, 'var2_levels'] = 'petals'
long_df_from_file.loc[long_df_from_file['var2_uri'] == 3, 'var2_uri'] = 'http://purl.obolibrary.org/obo/PO_0009032'

long_df_from_file.loc[long_df_from_file['treatment'] == 4, 'treatment'] = 'R. gigantea petals'
long_df_from_file.loc[long_df_from_file['var1_levels'] == 4, 'var1_levels'] = 'R. gigantea'
long_df_from_file.loc[long_df_from_file['var1_uri'] == 4, 'var1_uri'] = 'http://purl.obolibrary.org/obo/NCBITaxon_74650'
long_df_from_file.loc[long_df_from_file['var2_levels'] == 4, 'var2_levels'] = 'petals'
long_df_from_file.loc[long_df_from_file['var2_uri'] == 4, 'var2_uri'] = 'http://purl.obolibrary.org/obo/PO_0009032'

long_df_from_file.loc[long_df_from_file['treatment'] == 5, 'treatment'] = 'R. Damascena petals'
long_df_from_file.loc[long_df_from_file['var1_levels'] == 5, 'var1_levels'] = 'R. Damascena'
long_df_from_file.loc[long_df_from_file['var1_uri'] == 5, 'var1_uri'] = 'http://purl.obolibrary.org/obo/NCBITaxon_3765'
long_df_from_file.loc[long_df_from_file['var2_levels'] == 5, 'var2_levels'] = 'petals'
long_df_from_file.loc[long_df_from_file['var2_uri'] == 5, 'var2_uri'] = 'http://purl.obolibrary.org/obo/PO_0009032'

long_df_from_file.loc[long_df_from_file['treatment'] == 6, 'treatment'] = 'R. Gallica petals'
long_df_from_file.loc[long_df_from_file['var1_levels'] == 6, 'var1_levels'] = 'R. Gallica'
long_df_from_file.loc[long_df_from_file['var1_uri'] == 6, 'var1_uri'] = 'http://purl.obolibrary.org/obo/NCBITaxon_74632'
long_df_from_file.loc[long_df_from_file['var2_levels'] == 6, 'var2_levels'] = 'petals'
long_df_from_file.loc[long_df_from_file['var2_uri'] == 6, 'var2_uri'] = 'http://purl.obolibrary.org/obo/PO_0009032'

long_df_from_file.loc[long_df_from_file['treatment'] == 7, 'treatment'] = 'R. moschata petals'
long_df_from_file.loc[long_df_from_file['var1_levels'] == 7, 'var1_levels'] = 'R. moschata'
long_df_from_file.loc[long_df_from_file['var1_uri'] == 7, 'var1_uri'] = 'http://purl.obolibrary.org/obo/NCBITaxon_74646'
long_df_from_file.loc[long_df_from_file['var2_levels'] == 7, 'var2_levels'] = 'petals'
long_df_from_file.loc[long_df_from_file['var2_uri'] == 7, 'var2_uri'] = 'http://purl.obolibrary.org/obo/PO_0009032'

long_df_from_file.loc[long_df_from_file['treatment'] == 8, 'treatment'] = 'R. wichurana petals'
long_df_from_file.loc[long_df_from_file['var1_levels'] == 8, 'var1_levels'] = 'R. wichurana'
long_df_from_file.loc[long_df_from_file['var1_uri'] == 8, 'var1_uri'] = 'http://purl.obolibrary.org/obo/NCBITaxon_2094184'
long_df_from_file.loc[long_df_from_file['var2_levels'] == 8, 'var2_levels'] = 'petals'
long_df_from_file.loc[long_df_from_file['var2_uri'] == 8, 'var2_uri'] = 'http://purl.obolibrary.org/obo/PO_0009032'
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-16-a667e6e7b629> in <module>
----> 1 long_df_from_file.loc[long_df_from_file['treatment'] == 1, 'treatment'] = 'R. chinensis \'Old Blush\' sepals'
      2 long_df_from_file.loc[long_df_from_file['var1_levels'] == 1, 'var1_levels'] = 'R. chinensis \'Old Blush\''
      3 long_df_from_file.loc[long_df_from_file['var1_uri'] == 1, 'var1_uri'] = 'http://purl.obolibrary.org/obo/NCBITaxon_74649'
      4 long_df_from_file.loc[long_df_from_file['var2_levels'] == 1, 'var2_levels'] = 'sepals'
      5 long_df_from_file.loc[long_df_from_file['var2_uri'] == 1, 'var2_uri'] = 'http://purl.obolibrary.org/obo/PO_0009031'

NameError: name 'long_df_from_file' is not defined
  1. Dealing with missing values: setting empty values to zero for sample_mean and sem to enable calculation: to do this, we rely on Pandas fillna function.
long_df_from_file['sample_mean'] = long_df_from_file['sample_mean'].fillna("0")
long_df_from_file['sem'] = long_df_from_file['sample_mean'].fillna("0")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-17-aa682d4197dc> in <module>
----> 1 long_df_from_file['sample_mean'] = long_df_from_file['sample_mean'].fillna("0")
      2 long_df_from_file['sem'] = long_df_from_file['sample_mean'].fillna("0")

NameError: name 'long_df_from_file' is not defined
  1. Reorganizing Columns order in the DataFrame/File to match the Frictionless Tabular Data Package Layout. This is done very easily in Pandas by passing desired column order as an array:
long_df_from_file = long_df_from_file[['chemical_name', 'inchi', 'chebi_identifier', 'var1_levels', 'var1_uri',
                                       'var2_levels', 'var2_uri', 'treatment', 'sample_size', 'sample_mean',
                                       'unit', 'sem']]
long_df_from_file.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-18-2b4b3e3381dc> in <module>
----> 1 long_df_from_file = long_df_from_file[['chemical_name', 'inchi', 'chebi_identifier', 'var1_levels', 'var1_uri',
      2                                        'var2_levels', 'var2_uri', 'treatment', 'sample_size', 'sample_mean',
      3                                        'unit', 'sem']]
      4 long_df_from_file.head()

NameError: name 'long_df_from_file' is not defined
  1. We are now ready to write the file to disk as a UTF-8 encoded comma delimited file, with double quoted values and we are also dropping the dataframe index from the output.
try:
    HOME=os.getcwd()
    # print("checking current directory #1: ",HOME)

    if not os.path.exists(os.path.join(HOME,'../data/processed/denovo')):
        # print("checking current directory #2: ", os.getcwd())
        os.makedirs(os.path.join(HOME,'../data/processed/denovo'))
        os.chdir(os.path.join(HOME,'../data/processed/denovo'))
        long_df_from_file.to_csv("rose-aroma-naturegenetics2018-treatment-group-mean-sem-report-table-example.csv",
                         quoting=1,
                         doublequote=True, sep=',',
                         encoding='utf-8', index=False)
    else:
        os.chdir(os.path.join(HOME,'../data/processed/denovo'))

except IOError as e:
    print(e)        
  1. The Final step is to validate the output against JSON data package specifications, which are stored in the JSON Tabular DataPackage Definition folder.
os.chdir('./../../../')
LOCAL = os.getcwd()
print("moving to directory: ", os.getcwd())
moving to directory:  /Users/philippe/Documents/git/FAIRplus-org/the-fair-cookbook/docs/content/recipes/applied-examples
package_definition = os.path.join(LOCAL,'./rose-metabo-JSON-DP-validated/rose-aroma-naturegenetics2018-treatment-group-mean-sem-report-datapackage.json')
file_to_test = os.path.join(LOCAL,'../data/processed/denovo/rose-aroma-naturegenetics2018-treatment-group-mean-sem-report-table-example.csv')

print ("JSON data package definition:", package_definition)
print("csv file to evaluate:", file_to_test)
try:
    pack = Package(package_definition)
    pack.valid
    pack.errors
    for e in pack.errors:
        print(e)

    report = validate(file_to_test)
    if report['valid']== True:
        print("Success! \n") 
        print("\'"+file_to_test + "\'"+ " is a valid Frictionless Tabular Data Package\n" + "It complies with the 'rose-aroma-naturegenetics2018-treatment-group-mean-sem-report-datapackage.json' definition\n")
    else:
        print("hmmm, something went wrong. Please, see the validation report for tracing the fault")

except IOError as e:
    print(e)
JSON data package definition: /Users/philippe/Documents/git/FAIRplus-org/the-fair-cookbook/docs/content/recipes/applied-examples/./rose-metabo-JSON-DP-validated/rose-aroma-naturegenetics2018-treatment-group-mean-sem-report-datapackage.json
csv file to evaluate: /Users/philippe/Documents/git/FAIRplus-org/the-fair-cookbook/docs/content/recipes/applied-examples/../data/processed/denovo/rose-aroma-naturegenetics2018-treatment-group-mean-sem-report-table-example.csv
---------------------------------------------------------------------------
MissingSchema                             Traceback (most recent call last)
~/.pyenv/versions/venv372/lib/python3.7/site-packages/datapackage/helpers.py in retrieve_descriptor(descriptor)
     54             else:
---> 55                 req = requests.get(the_descriptor)
     56                 req.raise_for_status()

~/.pyenv/versions/venv372/lib/python3.7/site-packages/requests/api.py in get(url, params, **kwargs)
     74     kwargs.setdefault('allow_redirects', True)
---> 75     return request('get', url, params=params, **kwargs)
     76 

~/.pyenv/versions/venv372/lib/python3.7/site-packages/requests/api.py in request(method, url, **kwargs)
     59     with sessions.Session() as session:
---> 60         return session.request(method=method, url=url, **kwargs)
     61 

~/.pyenv/versions/venv372/lib/python3.7/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    518         )
--> 519         prep = self.prepare_request(req)
    520 

~/.pyenv/versions/venv372/lib/python3.7/site-packages/requests/sessions.py in prepare_request(self, request)
    461             cookies=merged_cookies,
--> 462             hooks=merge_hooks(request.hooks, self.hooks),
    463         )

~/.pyenv/versions/venv372/lib/python3.7/site-packages/requests/models.py in prepare(self, method, url, headers, files, data, params, auth, cookies, hooks, json)
    312         self.prepare_method(method)
--> 313         self.prepare_url(url, params)
    314         self.prepare_headers(headers)

~/.pyenv/versions/venv372/lib/python3.7/site-packages/requests/models.py in prepare_url(self, url, params)
    386 
--> 387             raise MissingSchema(error)
    388 

MissingSchema: Invalid URL '/Users/philippe/Documents/git/FAIRplus-org/the-fair-cookbook/docs/content/recipes/applied-examples/./rose-metabo-JSON-DP-validated/rose-aroma-naturegenetics2018-treatment-group-mean-sem-report-datapackage.json': No schema supplied. Perhaps you meant http:///Users/philippe/Documents/git/FAIRplus-org/the-fair-cookbook/docs/content/recipes/applied-examples/./rose-metabo-JSON-DP-validated/rose-aroma-naturegenetics2018-treatment-group-mean-sem-report-datapackage.json?

The above exception was the direct cause of the following exception:

DataPackageException                      Traceback (most recent call last)
<ipython-input-21-2862e186806d> in <module>
      5 print("csv file to evaluate:", file_to_test)
      6 try:
----> 7     pack = Package(package_definition)
      8     pack.valid
      9     pack.errors

~/.pyenv/versions/venv372/lib/python3.7/site-packages/datapackage/package.py in __init__(self, descriptor, base_path, strict, storage, schema, default_base_path, **options)
     80 
     81         # Process descriptor
---> 82         descriptor = helpers.retrieve_descriptor(descriptor)
     83         descriptor = helpers.dereference_package_descriptor(descriptor, base_path)
     84 

~/.pyenv/versions/venv372/lib/python3.7/site-packages/datapackage/helpers.py in retrieve_descriptor(descriptor)
     60         except (IOError, requests.exceptions.RequestException) as error:
     61             message = 'Unable to load JSON at "%s"' % descriptor
---> 62             six.raise_from(exceptions.DataPackageException(message), error)
     63         except ValueError as error:
     64             # Python2 doesn't have json.JSONDecodeError (use ValueErorr)

~/.pyenv/versions/venv372/lib/python3.7/site-packages/six.py in raise_from(value, from_value)

DataPackageException: Unable to load JSON at "/Users/philippe/Documents/git/FAIRplus-org/the-fair-cookbook/docs/content/recipes/applied-examples/./rose-metabo-JSON-DP-validated/rose-aroma-naturegenetics2018-treatment-group-mean-sem-report-datapackage.json"
  1. This concludes this notebook, which showed you how to convert a metabolite profiling dataset from a publication and create a FAIR data package. The other notebooks will show you how to visualize and plot the dataset but also convert it to a semantic graph as a Linked Data representation, query it and plot from it.