10. Creating data/variable dictionary

Recipe Overview
Reading Time
15 minutes
Executable Code
No
Difficulty
Creating a data/variable dictionary
FAIRPlus logo
Recipe Type
Technical Guidance
Maturity Level & Indicator
DSM-2-C6DSM-2-C7
hover me Tooltip text

10.1. Main FAIRification Objectives

A data dictionary is a file (or collection of files) which unambiguously declares, defines and annotates all the variables collected in a project and associated to a dataset.

Building a FAIR data dictionary means delivering a machine-actionable list of variables, thus greatly helping in assessing the interoperability potential of a dataset.

Presenting a FAIR data dictionary template is also meant to be useful to deal with current IMI projects as well as guide future ones.

The main purpose of this recipe is:

  • Provide a guide on what factors should be considered when building a data dictionary for data collection, data processing and analysis.

  • Give an example of a data dictionary.

  • Provide an example of machine-actionable data dictionary template.


10.2. User Stories

A well defined data dictionary is essential for data curation and analysis. It should contain all information needed for data collection and subsequent processing of data.


10.3. Graphical overview


10.4. FAIRification Objectives, Inputs and Outputs

Actions.Objectives.Tasks

Input

Output

text annotation

list of variables

machine-actionable list of annotated variables

10.5. Table of Data Standards

Data Formats

Terminologies

Models

CDISC SDTM

schema.org

OMOP

CDISC CDASH

bioschema

EFO

UO

EDAM

10.6. An Example of Data Dictionary

File Name

Variable Name

Variable Label

Variable Ontology ID or RDFtype

Variable ID Source

Variable Statistical Type

Variable Data Type

Variable Size

Max Allowed Value

Min Allowed Value

Regex

Allowed Value Shorthands

Allowed Value Descriptions

Computed Value

Unique (alone)

Unique (Combined with)

Required

Collection Form Name

Comments

1_Subjects.txt

SUBJECT_ID

Subject number

https://schema.org/identifier

https://schema.org

categorical variable

integer

Y

Y

FORM 1

1_Subjects.txt

SPECIES

Species name

https://schema.org/name

https://schema.org

categorical variable

string

FORM 1

1_Subjects.txt

STRAIN

Strain

TODO substitute broken link https://bioschemas.org/profiles/Taxon/0.6-RELEASE/identifier

https://schemas.org/

categorical variable

string

http://purl.obolibrary.org/obo/NCBITaxon_40674

FORM 1

1_Subjects.txt

AGE

Age at study initiation

https://bioschemas.org/types/BioSample/0.1-RELEASE-2019_06_19

https://bioschemas.org/

continuous variable

integer

Y

FORM 1

1_Subjects.txt

AGE_UNIT

Age unit

http://purl.obolibrary.org/obo/UO_0000003

http://purl.obolibrary.org/obo/uo

categorial variable

string

Y

FORM 1

1_Subjects.txt

SEX

Sex

https://schema.org/gender

https://schema.org

categorical variable

enum

M;F

M=male;F=female

FORM 1

1_Subjects.txt

SOMEDATE

Date of acquiring subject

https://schema.org/dateCreated

https://schema.org

ordinal variable

date

YYYY-MM-DD

FORM 1

1_Subjects.txt

HEMOGLOBIN

Hematology: Hemoglobin

http://www.ebi.ac.uk/efo/EFO_0004509

http://www.ebi.ac.uk/efo

continuous variable

float

2,1

15.0

4.0

FORM 1

Field size denotes “places, decimal places”

1_Subjects.txt

HEMOGLOBIN_UNIT

Hemoglobin unit

http://purl.obolibrary.org/obo/UO_0000003

http://www.ebi.ac.uk/efo

categorical variable

string

FORM 1

Field size denotes “places, decimal places”

1_Subjects.txt

HEIGHT

Body size

https://schema.org/height

https://schema.org

continuous variable

float

2,5

0,5

1_Subjects.txt

HEIGHT_UNIT

Body size unit

http://purl.obolibrary.org/obo/UO_0000003

https://schema.org

categorical variable

string

1_Subjects.txt

WEIGHT

Body weight

https://schema.org/weight

https://schema.org

continuous variable

float

300

25

1_Subjects.txt

WEIGHT_UNIT

Body weight unit

http://purl.obolibrary.org/obo/UO_0000003

https://schema.org

categorical variable

string

1_Subjects.txt

BMI

Body mass index

http://www.ebi.ac.uk/efo/EFO_0004340

http://www.ebi.ac.uk/efo

continuous variable

float

100

10

WEIGHT/(HEIGHT*HEIGHT)

1_Subjects.txt

LAB

Laboratory

https://schema.org/identifier

https://schema.org

categorical variable

integer

1;2;3

1=LabA;2=UniversityB;3=CompanyC

FORM 1

2_Samples.txt

SAMPLE_ID

Sample ID

https://schema.org/identifier

https://schema.org

categorical variable

string

Y

Y

FORM 2

2_Samples.txt

SAMPLE_SITE

Sample collection site

https://bioschemas.org/types/BioSample/0.1-RELEASE-2019_06_19

https://bioschemas.org/

categorical variable

string

Y

FORM 2

2_Samples.txt

ANALYTE_TYPE

Type of analysis

http://edamontology.org/operation_2945

http://edamontology.org

categorical variable

string

http://edamontology.org/operation_2945

Y

FORM 2

2_Samples.txt

GENOTYPING_CENTER

GENOTYPING_CENTER

https://schema.org/identifier

https://schema.org

categorical variable

string

FORM 2

2_Samples.txt

SEQUENCING_CENTER

SEQUENCING_CENTER

https://schema.org/identifier

https://schema.org

categorical variable

string

FORM 2

3_SampleMapping.txt

SUBJECT_ID

Subject number

https://schema.org/identifier

https://schema.org

ordinal variable

integer

SAMPLE_ID

Y

FORM 3

3_SampleMapping.txt

SAMPLE_ID

Sample ID

https://schema.org/identifier

https://schema.org

categorical variable

string

SUBJECT_ID

Y

FORM 3


10.7. Elements that should be included when building a data dictionary

  • File Name: The file that contains the annotated variable(s).

  • Variable Name: Name of the variable (field).

  • Variable Label: A self explanatory annotation of the variable.

  • Ontology or RDF type ID: A unique identifier that captures the type of the variable. Semantic types such as schema.org or ontology terms enhance the findability of the data in repositories.

  • ID Source: The source of the identifier for the variable.

  • Variable Data Type: The type of the variable. It is recommended to use the same type definition as it will be implemented in the data capturing system (e.g. an xsd:datatype such as {date, integer, float, date, string}).

  • Variable Type: To unambiguously specify if the data associated with the variable being defined should be treated as a continuous variable, discrete/polychotomous variable or an ordinal variable.

  • Field Size: The size (length) of the variable value, e.g. 8 digits, 5,3 (for floating numbers)…

  • Max Allowed Value: Upper limit of the allowed value.

  • Min Allowed Value: Lower limit of the allowed value.

  • Regex: a regular expression allowing input validation in the case the value should follow a certain pattern (e.g. “\d{5}” for a 5-digit Post Code).

  • Allowed Values: Customised list of allowed values (e.g. “M” and “F” for Gender).

  • Allowed Value Description: Annotation of the list of allowed values (e.g.: M=male;F=female).

  • Computed Value: If a field is computed based on values from other fields, annotate the calculation rule (e.g BMI= WEIGHT/(HEIGHT*HEIGHT) ).

  • Unique (alone): If the value of in a field should be unique (e.g. Subject ID).

  • Unique (combined with): If the combination of several fields should be unique (e.g. Sample ID and Visit Number).

  • Required: If the field should NOT allow empty value.

  • Collection Form Name: Optional, if the field is collected in certain forms (e.g. in Case Report Forms from a clinical trial).

  • Comments: Optional, for futher information.

10.7.1. What fields to include in a data dictionary?

The right fields to include in a data dictionary are strongly dependent on the needs of the project and its context.

  • As a starting point, review existing community standards or minimum information checklists for your subject area to identify recommended fields (see for example recipes on minimal metadata profiles for transcriptomics metadata and guidance on creating minimal metadata profiles). We recommend consulting three key resources:

  • Make sure you capture all relevant variables for your planned analyses, in particular if you plan any non-standard or novel analyses. Also, ensure that variables are captured in the correct format (standardised if appropriate) in order to minimise the need for transformations later.

  • Capture variables in the most atomic form possible as it is easier to aggregate separate fields into a new, combined value than to extract values from a larger field.

  • Reduce free text use to a minimum for value-sets associated with qualitative or ordinal variables by providing list of controlled values from standardised vocabularies (e.g. using NCI Thesaurus or CDISC vocabulary) suited for the context you operated in (e.g. LOINC, SNOMED-CT in clinical context).

  • Provide unambiguous textual definitions, ideally through anchoring in semantic markup, for each of the variables so third party users can understand what the variable represents, instead of second-guessed obscure variable shorthands.

  • Provide units, and where possible, acceptable ranges for continuous variables.

  • Provide regular expressions for input validation where needed (e.g. expecting an identifier or a particular reporting pattern)

  • Provide formula if derived variables are computed from primary variables

10.7.2. A note on using standards such as CDISC

Comprehensive standards such as CDISC offer a complete tabulation model for data capture, consolidation and analysis. CDISC should not be used in a cherrypicking fashion to map variables but rather full compliance with the standard should be ensured, both structurally and in terms of what data is collected.

CDISC-compliant datasets group variables slightly differently to the format suggested here. Records are grouped by Domain such as vital signs (VS) and demographics (DM). Records represent one single measurement, so rather than capturing both height and weight in one record, like in the data dictionary here, these would be separate records in the VS domain, with test name (VSTESTCD) height or weight, respectively. CDISC also has a specific way of cross-referencing records, which is not cleanly mappable to do simpler approach suggested in this sample data dictionary. For further information on the CDISC model, please visit https://www.cdisc.org/.

10.7.3. Indicate how missing values are dealt with:

Data collection is never plain sailing. Patients drop out from studies, animals die, cell cultures or laboratory tests fail. This results in holes in the datasets. However, without a clear plan to record missing data point unambiguously, empty cells in a record can be the cause of analysis pains. It is therefore important and good practice to detail in a data dictionary what is a legimitate form to indicate a missing value, which should be interpreted as null.

Depending on the persistence system, how this needs to be specified varies. We provide an example on how to do so in the context of a Frictionless Tabular package. The specifications provide more information about how to specify how missing values should look like:

"missingValues": [""]
"missingValues": ["-"]
"missingValues": ["NaN", "-"]

10.7.3.1. Remember to provide descriptive metadata for the data dictionary itself

10.7.3.2. Remember to provide the data dictionary in an open syntax

10.8. Data Dictionary Mapping in FAIRplus

While the most desirable approach is of course to design a fully FAIR data dictionary at the start of a project, it is possible to retroactively FAIRify a data dictionary. The FAIRplus project is in the process of working with the Innovative Medicine Initiative APPROACH and ABIRISK projects to assist with the FAIRification of their data dictionaries with a view to improving both the findability and interoperability of their datasets.

10.9. Conclusion

This recipe covered an essential output of any research program, namely the documentation of all variables recorded about study subjects and key metadata descriptors used in subsequence analysis in the form of a data dictionary. The creation and provision of such a data dictionary should be a central component of any data management plan and should be one of the key deliverable of any IMI project. Why? Simply because if affords several key data management processes to take place

  • First, it forces data owners to carefully structure core metadata and annotation requirements, by spelling out the nature, purpose and constraints on the data collection.

  • Second, it provides data owners the means to communicate about their scientific outputs, without necessarily disclosing the actual data collected over the course of the projects. It simply brings clarity and removes ambiguity about collected metadata and data. This clarity helps gauge reusability potential as well as interoperability potential of datasets.

  • Thirdly, the availability of the data dictionary proves extremely useful for any curatorial works, from gearing for an ETL process, to planning for mapping across ontological frameworks. This is especially facilitated if the data dictionaries have clearly identified the semantic resources relied upon in a project.

  • Finally, in the context of the Innovative Medicine Initiative, delivering Data Dictionaries contributes to making research output more FAIR.

10.10. Authors