Search

Exploring and comparing Rose Scent profiles stored as tabular data packages with plotnine, a python port of R ggplot2

*Philippe Rocca-Serra (philippe.rocca-serra[at]oerc.ox.ac.uk), University of Oxford e-Research Centre

Background:

Experimental results such as metabolite profiling data published in [1,2] can be straightfowardly reported using OKFN Data Packages. Such components can be easily parsed as data frames and exploiting for data visualization purpose using libraries implementing graphical grammar concepts. Here, we show how to use python equivalent of ggplot2 , the rich R graphical libraries (https://ggplot2.tidyverse.org/). A few line of codes allow to query datasets and rapidly explore the information. Most importantly, this rapid exploration is possible because of independent variables and their levels have been clearly and unambiguously declared in the Tabular Data Package itself.

  1. Let's begin by installing the Python packages allowing easy access and use of data formatted as JSON Data Package
import pandas as pd
import numpy as np
from plotnine import *
  1. Reading the data:

We now simply read in the comma-separated-file associated with the tabular data package (a "long" table)

data = pd.read_csv("../data/processed/rose-data/rose-aroma-naturegenetics2018-treatment-group-mean-sem-report-table-example.csv")
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-2-39f76dc87f95> in <module>
----> 1 data = pd.read_csv("../data/processed/rose-data/rose-aroma-naturegenetics2018-treatment-group-mean-sem-report-table-example.csv")

~/.pyenv/versions/venv372/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    683         )
    684 
--> 685         return _read(filepath_or_buffer, kwds)
    686 
    687     parser_f.__name__ = name

~/.pyenv/versions/venv372/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    455 
    456     # Create the parser.
--> 457     parser = TextFileReader(fp_or_buf, **kwds)
    458 
    459     if chunksize or iterator:

~/.pyenv/versions/venv372/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    893             self.options["has_index_names"] = kwds["has_index_names"]
    894 
--> 895         self._make_engine(self.engine)
    896 
    897     def close(self):

~/.pyenv/versions/venv372/lib/python3.7/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
   1133     def _make_engine(self, engine="c"):
   1134         if engine == "c":
-> 1135             self._engine = CParserWrapper(self.f, **self.options)
   1136         else:
   1137             if engine == "python":

~/.pyenv/versions/venv372/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1915         kwds["usecols"] = self.usecols
   1916 
-> 1917         self._reader = parsers.TextReader(src, **kwds)
   1918         self.unnamed_cols = self._reader.unnamed_cols
   1919 

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

FileNotFoundError: [Errno 2] File b'../data/processed/rose-data/rose-aroma-naturegenetics2018-treatment-group-mean-sem-report-table-example.csv' does not exist: b'../data/processed/rose-data/rose-aroma-naturegenetics2018-treatment-group-mean-sem-report-table-example.csv'

Alternately, one could read the relevant data file from the corresponding zenodo dataset

#data = pd.read_csv("https://zenodo.org/api/files/ba3fbc84-14af-4858-a9ed-e6cfe8d4efd2/rose-aroma-naturegenetics2018-treatment-group-mean-sem-report-table-example.csv") 
data.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-304fa4ce4ebd> in <module>
----> 1 data.head()

NameError: name 'data' is not defined
  1. Plotting the data: We then generate a barplot using the python plotnine library, which delivers a similar functionality as the R ggplot2 package.
# width = figure_size[0]
# height = figure_size[0] * aspect_ratio
gray = '#666666'
orange = '#FF8000'
blue = '#3333FF'

p1 = (ggplot(data)
 + aes('chemical_name', 'sample_mean',fill='factor(treatment)')
 + geom_col()
 
 + theme(axis_text_x=element_text(rotation=90, hjust=1, fontsize=6, color=blue))
 + theme(axis_text_y=element_text(rotation=90, hjust=2, fontsize=6, color=orange))
 + scale_y_continuous(expand = (0,0))   
 + facet_wrap('~treatment', dir='v',ncol=1)
 + theme(figure_size = (8, 16))      
)

p1 + theme(panel_background=element_rect(fill=blue)
       )

p1
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-8b6cd2b54fc3> in <module>
     13  + scale_y_continuous(expand = (0,0))
     14  + facet_wrap('~treatment', dir='v',ncol=1)
---> 15  + theme(figure_size = (8, 16))
     16 )
     17 

NameError: name 'data' is not defined
  1. Let's now compare the dataset generated in 2015 and the dataset generated in 2018.

Both datasets have been generated by the same team, on the same genotype (Rosa Chinensis 'Old Blush') and organism part ('sepals'). Both datasets are held in a Tabular Data Package with the same structure. To perform the comparison, we have simply created another tabular data package, which retains the exact same structure and that simply holds the measurements for the relevant conditions extracted from each dataset (the function to create such file is omitted).

ng2018sc2015 = pd.read_csv("../data/processed/rose-data/rose_aroma_compound_science2015_vs_NG2018_data_integration.csv")
# ng2018sc2015 = pd.read_csv("https://zenodo.org/api/files/268f29fc-8ead-4049-bb86-181b72073682/rose_aroma_compound_science2015_vs_NG2018_data_integration.csv")
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-6-33683c5df1e2> in <module>
----> 1 ng2018sc2015 = pd.read_csv("../data/processed/rose-data/rose_aroma_compound_science2015_vs_NG2018_data_integration.csv")
      2 # ng2018sc2015 = pd.read_csv("https://zenodo.org/api/files/268f29fc-8ead-4049-bb86-181b72073682/rose_aroma_compound_science2015_vs_NG2018_data_integration.csv")

~/.pyenv/versions/venv372/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)
    683         )
    684 
--> 685         return _read(filepath_or_buffer, kwds)
    686 
    687     parser_f.__name__ = name

~/.pyenv/versions/venv372/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    455 
    456     # Create the parser.
--> 457     parser = TextFileReader(fp_or_buf, **kwds)
    458 
    459     if chunksize or iterator:

~/.pyenv/versions/venv372/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    893             self.options["has_index_names"] = kwds["has_index_names"]
    894 
--> 895         self._make_engine(self.engine)
    896 
    897     def close(self):

~/.pyenv/versions/venv372/lib/python3.7/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
   1133     def _make_engine(self, engine="c"):
   1134         if engine == "c":
-> 1135             self._engine = CParserWrapper(self.f, **self.options)
   1136         else:
   1137             if engine == "python":

~/.pyenv/versions/venv372/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1915         kwds["usecols"] = self.usecols
   1916 
-> 1917         self._reader = parsers.TextReader(src, **kwds)
   1918         self.unnamed_cols = self._reader.unnamed_cols
   1919 

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.__cinit__()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

FileNotFoundError: [Errno 2] File b'../data/processed/rose-data/rose_aroma_compound_science2015_vs_NG2018_data_integration.csv' does not exist: b'../data/processed/rose-data/rose_aroma_compound_science2015_vs_NG2018_data_integration.csv'
  1. We generate another barplot, which shows the concentration of the chemicals targeted by the GC-MS profiling assay.
(ggplot(ng2018sc2015)
 + aes('chemical_name', 'normalized_to_total_sum_concentration',fill='factor(publication_year)')
 + geom_col()
 + facet_wrap('~publication_year', dir='h', ncol=1)
 + theme(axis_text_x=element_text(rotation=90, hjust=1, fontsize=6))

)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-86bb638daf4f> in <module>
      3  + geom_col()
      4  + facet_wrap('~publication_year', dir='h', ncol=1)
----> 5  + theme(axis_text_x=element_text(rotation=90, hjust=1, fontsize=6))
      6 
      7 )

NameError: name 'ng2018sc2015' is not defined

What do we see? The figure shows how consistant the chemical profile of the scent between the 2 studies is, which prevalent compounds such as X, Y, and Z showing roughtly similar relative amount within and across studies.