5. Provenance information



Recipe Overview
Reading Time
20 minutes
Executable Code
No
Difficulty
Provenance information
FAIRPlus logo
Recipe Type
Hands-on
Audience
Principal Investigator, Data Manager, Data Scientist

5.1. Main Objectives

In all tasks of data integration, especially in the area of Pharma, ensuring trust in data sources is essentially. The steps taken to ensure new datasets or sources of information meet a number of criteria ascertaining some level of quality are many. One of them is a check on the origin of the information, in other words, its Provenance. Provenance covers the elements detailing how the data was produced by identifying the agents (human, software, workflows) so a certain level of tracability and accountability can be established. The notions of audit and trail as well as versioning and authorship are essential to be able, should any distortion be identified in downstream analysis, to trace back to possible sources of error.

5.2. Introduction

Data provenance https://en.wikipedia.org/wiki/Data_lineage

5.3. PROV vocabulary

The W3C vetted a specification for an RDF vocabulary to express provenance information, the W3C PROV. Implementation can be made in RDF or JSON, since JSON-LD is now an official serialization of RDF.

5.4. CamFLow

CamFlow is a Linux Security Module (LSM) designed to capture data provenance for the purpose of system audit 1.

CamFlow support 2 output formats.

  • W3C PROV-JSON format

"ABAAAAAAACAe9wIAAAAAAE7aeaI+200UAAAAAAAAAAA=": {
    "cf:id": "194334",
    "prov:type": "fifo",
    "cf:boot_id": 2725894734,
    "cf:machine_id": 340646718,
    "cf:version": 0,
    "cf:date": "2017:01:03T16:43:30",
    "cf:jiffies": "4297436711",
    "cf:uid": 1000,
    "cf:gid": 1000,
    "cf:mode": "0x1180",
    "cf:secctx": "unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023",
    "cf:ino": 51964,
    "cf:uuid": "32b7218a-01a0-c7c9-17b1-666f200b8912",
    "prov:label": "[fifo] 0"
}

Example of a write edge in W3C PROV format:

"QAAAAAAAQIANAAAAAAAAAE7aeaI+200UAAAAAAAAAAA=": {
    "cf:id": "13",
    "prov:type": "write",
    "cf:boot_id": 2725894734,
    "cf:machine_id": 340646718,
    "cf:date": "2017:01:03T16:43:30",
    "cf:jiffies": "4297436711",
    "prov:label": "write",
    "cf:allowed": "true",
    "prov:activity": "AQAAAAAAAEAf9wIAAAAAAE7aeaI+200UAQAAAAAAAAA=",
    "prov:entity": "ABAAAAAAACAe9wIAAAAAAE7aeaI+200UAQAAAAAAAAA=",
    "cf:offset": "0"
}

5.5. Conclusion

5.6. References

1

Thomas F. J.-M. Pasquier, Xueyuan Han, Mark Goldstein, Thomas Moyer, David M. Eyers, Margo I. Seltzer, and Jean Bacon. Practical whole-system provenance capture. CoRR, 2017. URL: http://arxiv.org/abs/1711.05296, arXiv:1711.05296.


5.7. Authors

Name

ORCID

Affiliation

Type

ELIXIR Node

Contribution

University of Oxford

Writing - Original Draft


5.8. License

This page is released under the Creative Commons 4.0 BY license.