Search

Findability: Search Engine Optimization


Recipe metadata

identifier: RX.X

version: v1.0

Difficulty level

Reading Time

10 minutes

Recipe Type

Guidance

Executable Code

No

Intended Audience

Software Developers

Data Scientists


Main Objectives

The main purpose of this recipe is:

to describe what search engine optimization is and show how to implement markup with the Schema.org vocabulary, and Bioschemas extension, to improve page discovery and visibility by web page indexers.

There are sub-recipes for embedding search engine optimization into specific web pages about a specific type or resource:


Graphical Overview of the FAIRification Recipe Objectives

graph TD A(HTML page):::box -->| Search Engine Optimization| B{What
type
of
page?}:::box B --> C(Dataset):::box B --> D(Data catalog):::box B --> E(Data page):::box E --> F{What
type of
data
page}:::box F --> G(Chemical Substance):::box F --> H(Gene):::box F --> I(Molecular entity):::box F --> J(Protein):::box F --> K(Sample):::box F --> L(Taxon):::box C --> M D --> M G --> M H --> M I --> M J --> M K --> M L --> M(Schema.org augmented HTML page):::box M --> N(fa:fa-search fa:fa-cog fa:fa-fighter-jet improved discoverability):::box classDef box font-family:avenir,font-size:14px,fill:#300861,stroke:#222,color:#fff,stroke-width:1px linkStyle 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19 stroke:#2a9fc9,stroke-width:1px,color:#2a9fc9,font-family:avenir;

Capability & Maturity Table

Capability Initial Maturity Level Final Maturity Level
Findabililty minimal repeatable
Interoperability minimal

Main body of the recipe

Finding web pages

Providers of content for the Internet serve documents formatted or rendered in HTML format. The web pages are hosted on servers, which are accessed via the HTTP protocol. HTML pages can be styled with cascading stylesheets (CSS) and interactivity can be delivered via scripting languages such as Javascript.

With billions of web pages served, a key issue is finding content. To assist in this task, search engines (e.g. Bing, Google, Yandex, Qwantt) have been built. They work by crawling the web, performing brute force keyword indexing or specific files served by the server (e.g. site map), or by targeting specific data structures embedded in the web pages themselves.

What is search engine optimization

Search engine index pages based on their content, as identified by web crawlers. So any misclassification or errors in concept identification can affect where a given pages appears in a search results. Various techniques have been therefore been development by website designers, maintainers and engineers to improve ranking in search results. As ranking in search results significantly impact trafic to a web site and possibly revenues, especially if these are dependent on advertising, search engine optimization covers any of the practices which aim at improving the position of a web page in a search result.

Schema.org Vocabulary

A few years back, a consortium of search engines decided to combine forces to generate a structured vocabulary to identify and annotation entities, so search engine can index those more efficiently, bringing the power of semantics in the picture. The priorities for content addition to this vocabulary are defined by various factors, mostly driven between content advertising and relevance. Compared to plain keyword based indexing, annotation with structured vocabulary affords gains such as query expansion or improved content validation

How does Schema.org work in practice:

The principle is actually fairly simple. It relies on embedding machine readable content into the HTML file. A variety of options are available (RDFa, microformat, JSON-LD). JSON-LD is widely recommended as the most suitable approach.

Below is a regular plain vanilla HTML page providing information about an scientific joournal article.

<!-- A list of the issues for a single volume of a given periodical. -->
<div>
 <h1>The Lancet</h1>
 <p>Volume 376, July 2010-December 2010</p>
 <p>Published by Elsevier
 <ul>
   <li>ISSN: 0140-6736</li>
 </ul>
 <h3>Issues:</h3>
 <ul>
   <li>No. 9734 Jul 3, 2010 p 1-68</li>
   <li>No. 9735 Jul 10, 2010 p 69-140</li>
 </ul>
</div>

Now, we are presenting the same information augmented with the JSON-LD file using Schema.org ScholarlyArticle profile. Note how the file is provided with the HTML script tag

<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@graph": [
    {
        "@id": "#issue",
        "@type": "PublicationIssue",
        "issueNumber": "5",
        "datePublished": "2012",
        "isPartOf": {
            "@id": "#periodical",
            "@type": [
                "PublicationVolume",
                "Periodical"
            ],
            "name": "Cataloging & Classification Quarterly",
            "issn": [
                "1544-4554",
                "0163-9374"
            ],
            "volumeNumber": "50",
            "publisher": "Taylor & Francis Group"
        }
    },
    {
        "@type": "ScholarlyArticle",
        "isPartOf": "#issue",
        "description": "The library catalog as a catalog of works was an infectious idea, which together with research led to reconceptualization in the form of the FRBR conceptual model. Two categories of lacunae emerge--the expression entity, and gaps in the model such as aggregates and dynamic documents. Evidence needed to extend the FRBR model is available in contemporary research on instantiation. The challenge for the bibliographic community is to begin to think of FRBR as a form of knowledge organization system, adding a final dimension to classification. The articles in the present special issue offer a compendium of the promise of the FRBR model.",
        "sameAs": "https://doi.org/10.1080/01639374.2012.682254",
        "about": [
            "Works",
            "Catalog"
        ],
        "pageEnd": "368",
        "pageStart": "360",
        "name": "Be Careful What You Wish For: FRBR, Some Lacunae, A Review",
        "author": "Smiraglia, Richard P."
    }
  ]
}
</script>

JSON-LD is an official serialization of RDF and the document is recognized as a graph holding a set of triples. The availability of such semantic statements from a web page are exploited by the indexing algorithms of search engines to provide improved search results.

Tools supporting creation and validation of structured data

Google has produced an online tool allowing developers to test the annotation they produce before rolling them out to production. The tool is known as the Google Structured Data Testing Tool

Bioschemas: trying to address the coverage gap

Schema.org development is mainly driven by commercial applications. The scientific use case was not very high until recently. The Covid-19 pandemic exposed the needs to find datasets and disease related information more effectively. This proves to be a good timing for the Bioschemas project, which has been running for a few years with the support of the EU-Elixir organization. Bioschemas focuses on making Schema.org more relevant for the life sciences community by providing:

  1. types for life sciences entities such as chemicals, genes, and proteins.
  2. profiles that identify the most pertinent properties for marking up a life sciences resources of a specific type to enable it to be more findable

The main profiles currently specified by the Bioschemas organisation are as follows:


FAIRification Objectives, Inputs and Outputs

Actions.Objectives.Tasks Input Output
text annotation Schema.org annotated text
validation Schema.org report

Table of Data Standards

Data Formats Terminologies Models
JSON-LD Schema.org RDF
JSON-LD Bioschemas RDF

Authors:

Name Affiliation orcid CrediT role
Philippe Rocca-Serra University of Oxford, Data Readiness Group 0000-0001-9853-5668 Writing - Original Draft
Alasdair Gray Bioschemas Community Lead / Heriot-Watt Unviersity / ELIXIR-UK 0000-0002-5711-4872 Contributions to text
Leyla Garcia Bioschemas Community / ZB MED Information Centre for life sciences, Knowledge Management Group 0000-0003-3986-0510 External review

License:

This page is released under the Creative Commons 4.0 BY license.