Level 2 aims to enhance the usability of a project, or a study’s structured data, which often are represented by multiple related datasets. Different projects usually have their own data model and collect different subsets of clinical, molecular, imaging or other data. The FAIRplus-DSM model distinguishes between structured data, unstructured data, and data objects in a project. Structured data include subject-based clinical data, sample-based assay data and other data associated with the data schema. Therefore, indicators at this level, refer to the FAIR Data Object as the Dataset indicating more requirements related to the structural metadata of the Dataset, namely the Dataset Fields and the corresponding Dataset Field Values.
This level of maturity aims to increase the FAIRness level of structured data by focusing on Dataset-level structural metadata and Project-level contextual metadata.
This level of maturity is aimed at data hosted within project-based data repositories, general purpose data repositories or data catalogues.
In terms of hosting, level 2 compliant datasets would be hosted in project-specific or institutional data repositories that provide all the accessibility and storage capabilities required for sharing and reusing data within project data users.
Example
In order to comply with level 2 maturity requirements, a dataset needs to conform to a locally defined domain model such as a project data dictionary or standard generic domain model such as W3C’s DCAT or Bioschemas. This allows data values to be mapped uniformely using standardised terms for both variables and valuesl, where possible.
Where appropriate, datasets should also conform to “Tidy Data Principles”, ie each column and field should represent a single variable. In the example shown below, the initial data encodes both the measured variable (eg sysbp - systolic blood pressure) and the visit during which the measurement was taken (eg sc - screening). After transformation, the measured variables have been reduced from 6 to 3 columns (one for each variable) and a further column added to represent the visit. This makes querying and filtering the data much easier.
A locally defined Domain Model contains concepts that describes the overall project/study design, the relationships between the Datasets, the key entities reported within the Datasets and the relationships between them.
Maturity Level
2
Category
Content and Context
Granularity Level
Dataset
Description
This is a metadata-related requirement focusing on context and domain description. This is an entry level requirement to describe the ‘Domain’ of the data, which at this level might not be fully represented by either the hosting environment or an adopted standard (level 3). The Metadata Record should include information that can help a researcher understand the data context, especially in relation to the overall project or study design that this dataset belongs to as well as the entities that are represented by the dataset content.
Related DSM Indicator
Related FAIR Principle
R1. Meta(data) are richly described with a plurality of accurate and relevant attributes
This is a data-related requirement that is a pre-requisite for the ‘accuracy’ of metadata that FAIR principle (R1) refers to. This requirement is borrowed from one of the key Tidy Data Principles, which states that each column/field should be a single variable. This prevents the often seen scenario in structured data whereby a single column header might carry values for more than one variable. For example, ‘temperature_screening’, ‘temperature_followup’, each column implicitly carries the value for a visit variable and a value for an observation temperature in this case. This indicator therefore requires the data manager to split these variables into two fields: One per variable that is a field for temperature and a field for visit. This is a pre-requisite to DSM-2-C5 and DSM-2-C6 since each Dataset Field is expected to control its terms and create a local dictionary. Unless individual concepts are reported per Dataset Field it will not be possible to find suitable terms that can later be standardised for level 3.
Related DSM Indicator
DSM-2-C5, DSM-2-C6
Related FAIR Principle
R1. Meta(data) are richly described with a plurality of accurate and relevant attributes
This is a data-related requirement that focuses on the consistency of a Dataset’s textual content. This is also related to the ‘accuracy’ and overall consistency of data content within and across multiple related project datasets. Level 2 content standardisation is not required to comply with standard terminologies or ontologies. However, to achieve Level 2 content standardisation, textual values reported in text-based Dataset Fields are expected to be consistently reported using locally defined terms. These local terms are defined in a local Data Dictionary that ought to be reported as well as part of the content-related metadata (DSM-2-C6).
Related DSM Indicator
DSM-2-C6
Related FAIR Principle
F2. Data are described with rich metadata, R1. Meta(data) are richly described with a plurality of accurate and relevant attributes
Dataset Descriptor includes [Field-level Metadata]8https://fairplus.github.io/Data-Maturity/docs/Glossary/#field-level-metadata) as prescribed by the adopted Dataset Model
Maturity Level
2
Category
Content and Context
Granularity Level
Dataset Field
Description
This is a metadata-related requirement focusing on the Dataset’s structure. This is a requirement to include structural metadata into the Dataset’s Metadata Record irrespective of how this information is represented (DSM-2-R1). Dataset-Field metadata include ‘field name’, ‘description’, ‘data type’
Related DSM Indicator
DSM-2-R1
Related FAIR Principle
F2. Data are described with rich metadata, R1. Meta(data) are richly described with a plurality of accurate and relevant attributes
This is a metadata-related requirement. This indicator is related to F+MM-2.C5, which requires that textual data values used within and across related datasets should consistently reported using locally defined terms or values. In case of using numeric values instead of textual values, a data dictionary is needed to map these values and allow users to interpret the data. Therefore, a data dictionary that associated each Dataset Field with its associated list of permissible terms or values and their meanings should also be made available. This could either be represented inside the Metadata Record itself if the metadata schema allows (DSM-2-R3), or otherwise represented separately.
Related DSM Indicator
DSM-2-R3
Related FAIR Principle
F2. Data are described with rich metadata, R1. Meta(data) are richly described with a plurality of accurate and relevant attributes
Data hosting environment stores data in accordance to a locally defined Domain Model for persistence purposes
Maturity Level
2
Category
Hosting Environment
Capability
Storage
Description
This is a data-storage related requirement. In order to provide a basic level of contextual browsing or searching capabilities, the data hosting environment/resource should offer a common data model albeit being a locally defined one or project-specific one, against which all hosted datasets can be navigated and explored against.
Related DSM Indicator
Related FAIR Principle
R1. Meta(data) are richly described with a plurality of accurate and relevant attributes
Metadata hosting environment provides programmatic access and retrieval (API) for the Dataset Descriptor
Maturity Level
2
Category
Hosting Environment
Capability
Metadata Retrieval
Description
This is a metadata-retrieval-related requirement. The metadata hosting environment (which could be the same or different from the data hosting environment) should offer the capability to retrieve the Dataset’s Metadata Record using API technologies like REST, RPC or GRAPHQL.
Related DSM Indicator
Related FAIR Principle
A1. (Meta)data are retrievable by their identifier using a standardised communications protocol
Data hosting environment offers the capability to browse and search related Datasets
Maturity Level
2
Category
Hosting Environment
Capability
Searching Capability
Description
This capability provides enhanced contextual interpretation of multiple related datasets when they are commonly linked to a study or a project. This capability is enabled by the hosting environment’s capitalising on contextual metadata and dataset structural metadata made available at this level of maturity and established by dsm-22c, dsm24c and dsm-26c.
Related DSM Indicator
dsm-22c, dsm-24c, dsm-26c
Related FAIR Principle
F4. (Meta)data are registered or indexed in a searchable resource
Contextual Metadata necessary to understand and interpret Datasets’ content is defined and conforms to a locally defined Domain Model
Maturity Level
2
Category
Metadata Representation
Granularity Level
Project
Description
This is a metadata-related requirement focusing on the representation of the reported Contextual Metadata (DSM-2-C2). For level 2, having a human interpretable representation suffices to pass this requirement. This can be a visual diagram, or textual documentation that can be available from the hosting environment’s documentation pages.
Related F+MM Indicator
DSM-2-C2
Related FAIR Principle
R1. Meta(data) are richly described with a plurality of accurate and relevant attributes
Project collected Data are organized into structured Dataset(s) and conform to a locally defined Dataset Model
Maturity Level
2
Category
Data Representation
Granularity Level
Dataset
Description
This is a data-modelling related requirement. More specifically, this indicator focuses on the data model used to describe the structure of the Dataset which is the form that data is modelled against for the purpose of being utilized for FAIR sharing and re-use (DSM-2-C1). At Level 2, this model is simply a set of defined Dataset Types or Names and their respective Dataset Fields to be used consistently by all project related Datasets.
This is often represented in the form of pre-defined templates that data owners define for their project data or made available by the data hosting environment to be used for importing and exporting the FAIRified Datasets. This is to guarantee a minimum level of consistency amongst similarly reported datasets, which directly affects the storage capability of the hosting environment. This consistency will enable the hosting environment to store and index multiple datasets against this locally defined dataset model and hence offer better searching and discovery capabilities.
Related DSM Indicator
DSM-2-C1
Related FAIR Principle
R1. Meta(data) are richly described with a plurality of accurate and relevant attributes
This is a metadata-modelling requirement focusing on the representation of the Dataset Field level and Field Value level metadata required by DSM-2-C4 and DSM-2-C6. This indicator requires that the chosen standard metadata schema used to describe the Dataset should be amenable to represent structural metadata about the Dataset. Each Dataset Field will have a name, description, data type …etc. Value related metadata may include reference to local dictionary of controlled terms used in each field. Examples of generic metadata schemas supporting field-level metadata are DATS and BioSchemas Dataset.
This is a data-related formatting requirement. The exchange format used to share the Dataset(s) should be readable by machines. This is not a requirement to use semantic representations, this is a simple requirement to use standard formats (e.g. CSV, JSON, XML or similar) for data exchanged via the relevant API (DSM-2-H2).
Related DSM Indicator
DSM-2-H2
Related FAIR Principle
I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.