60 minutes (upon recipe completion)
The main purpose of this recipe is:
To detail the key elements for the creation of a
data catalogueto enable data
findabilityin an organisation.
We will cover the following points:
- metadata model selection
- annotation with controled vocabularies
- data loading
- data indexing
- facet oriented searching
- minting of stable, persistent and resolvable identifiers
- Main FAIRification Objectives
- Graphical Overview of the FAIRification Recipe Objectives
- FAIRification Objectives, Inputs and Outputs
- Capability & Maturity Table
- Table of Data Standards
- Executable Code in Notebook
- How to create workflow figures
sources| E(Data Source #1):::box AA --> |identify
sources| F(Data Source #n):::box E -->|ETL-1|B1(instance file):::box F -->|ETL-2|B2(instance file):::box B1 -->|data persistence| DL(document oriented database) B2 -->|data persistence| DL:::box DL[Build Search Function] --> |build search index|SE(Search Engine):::box SE -->|ontology tree search| SSS(Query Expansion):::box SE -->|synonym space search| SSS(Query Expansion) end subgraph a A(Building Data Catalogue):::box style a fill:#e8eaeb,font-family:avenir style b fill:#e8eaeb A-->|define curation policies| A3(Curation
Policies):::box A3-->|select data model| B(DATS):::box B-->|select controled
vocabularies| CV1(key facet #1:
CV1):::box B-->|select controled
vocabularies| CV2(key facet #2:
CV2):::box B-->|select controled
vocabularies| CV3(key facet #n:
CVn):::box linkStyle 0,1,2,3,4,5,6,7,8,9,10,11,12,13 stroke:#2a9fc9,stroke-width:1px,color:#2a9fc9,font-family:avenir; classDef box font-family:avenir,font-size:14px,fill:#2a9fc9,stroke:#222,color:#fff,stroke-width:1px end
|Capability||Initial Maturity Level||Final Maturity Level|
role.Data Scientists, it is essential to be able to
action.discover datasets of potential relevance in the context of
action.data integration and
role.Database Managers, a lightweight solution is needed to support a shallow indexing supported fast ingest without intense curation, but good potential for data discovery. Works should rely on approved data standards.
role.lab scientists, the key is to have a minimal burden when having to
action.deposit a dataset to an institutional archive or simply
action.register to dataset to the
Data Catalogue is a resource meant to allow fast identification of
Data set. In keeping with the familiar notion of catalogue, (be it that of an exhibition or that of brand products), the notion of
data catalogue needs to be understood as the compendium of
short descriptive metadata elements about an actual set of data. The
Data Index or Data Catalogue does not store the datasets themselves but provides information about where the datasets can be obtained from. Therefore,
Data Catalogues are often used to index the content of '
Data Repositories and
Data Archives, which provide hosting solutions for the actual datasets, which are often organized (but not always)' around specific
data types or
data production modalities (e.g. NMR Imaging, Confocal microscopy imaging, Nucleic Acid sequence archives and so on.)
Data Catalogues have been identified as critical infrastructure and therefore a number of model exist to support their implementation.
DCAT: In the world of semantic web technologies, The W3C DCAT specifications (v1 and the newly released version 2) provide a vocabulary to express
data catalogue metadatain RDF.
The vocabulary developed by the consortium of search engines has defined a metadata profile for
A number data Indexes/Data Catalogue are populated by harvest Dataset metadata from primary Data Repositories or harvesting JSON-LD files served by these same pages for rapid, shallow indexing. The former method is often richer but requires more
TO BE AUGMENTED.
import dats import json import pandas as pd ...
|Philippe Rocca-Serra||University of Oxford, Data Readiness Group||0000-0001-9853-5668||Writing - Original Draft|
This page is released under the Creative Commons 4.0 BY license.