13. File format validation, FASTQ example¶

Recipe Overview

Reading Time

15 minutes

Executable Code

Yes

Difficulty

Validating file format - FASTQ example

Recipe Type

Hands-on

Audience

Principal Investigator, Data Manager, Data Scientist

Maturity Level & Indicator

not applicable

Cite me with FCB030

13.1. Main Objectives¶

The main purpose of this recipe is to:

provide a FASTQ file validation solution

propose a general file validation workflow.

13.2. Graphical Overview¶

13.3. User Stories¶

The table below lists common file validation use cases. This recipe provides solutions with FASTQ files 1 as an example.

As a ..	I want to ..	So that I can ..
Data owner	Validate my sequencing files before depositing to public archives	Reduce the risk of submitting invalid files or submission rejection
Data consumer	Validate files before running analysis	Avoid wasting time and resource processing corrupted files
Data consumer	Integrate file format validation into my data process pipeline	Build a more reproducible and error-proof pipeline
Data librarian	Check files downloaded from unknown sources before deposition	Ensure the file is usable in the future.

13.4. FAIRification Objectives, Inputs and Outputs¶

Actions.Objectives.Tasks	Input	Output
Format validation	FASTQ file	Validation results

13.5. Table of Data Standards¶

Data Formats	Terminologies	Models
FASTQ
Compressed Format

FASTQ is the de facto sequencing file format and one of the most common file formats in bioinformatics analysis 2, 4. Researchers receive FASTQ files from various sources. These files are used intensively in automated bioinformatics analysis pipelines. Therefore, it is important to validate FASTQ files to improve the data reusability and build error-proof data analysis processes.

FASTQ validators detect truncated reads, base calls and quality score mismatches, invalid encoding, etc. For paired-end reads, they also check if the forward reads match with the reverse reads. Most validators can process different FASTQ variants automatically and handle compressed FASTQ files.

FASTQ-utils is an open-source software to validate and process FASTQ files. It has been applied in the European Nucleotide Archive(ENA), and several research initiatives.

This recipe provides an example of validating FASTQ files with FASTQ-utils on MacOS and Linux machines.

Warning

⚠️ Quality control is out of the scope of file format validation.

13.5.1. Requirements¶

The users are expected to be comfortable with Unix-based OS and basic Bash programming syntax and commands.

Software	Description	Version
conda	Package manager for installing validators	4.8.3
FASTQ-utils	FASTQ validator	0.23.0
wget	File downloader	1.19.4

13.5.2. Step 1: Install fastq-utils¶

The command below installs fastq-utils via Conda. It is also possible to install fastq-utils from the source code 3.

conda install -c bioconda fastq_utils

13.5.3. Step 2: Get example file for testing*¶

Note

Users can skip this step and test with their own files._

In this step, we download example FASTQ files from ENA for testing. The first example file is a single read file, the other ones are paired-end read files.

Example 1: Get single read FASTQ file

The command below downloads an Ion Torrent S5 fastq file from ENA. This file is the whole genome sequencing file of SARS-CoV-2. The complete file is 192Mb.

wget -c ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR121/077/SRR12132977/SRR12132977.fastq.gz

Users can inspect the fastq.gz file using gzip -cd SRR12132977.fastq.gz | head -8. Below is the header of this FASTQ file.

@SRR12132977.1 1/1
AACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACAAACTAAAATGTCTGATAATGGACCCCAAAATCAGCGAAATGCACCCCGCATTACGTTTGGTGGACCCTCAG
+
C@CCD>DBC?B692;;;09?<BBBBC>BBBBBBBBB@?ABB@BC<BBB>@A?:999992;=>>@??==:=C;>=<:'555)8;;;;;AG:AAAAADD;CCBB>?@;;;0:<@A>CEE?CFCC
@SRR12132977.2 2/1
AACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCAC
+
A>A@@=@@F@D@C<999,:<@ABBBB@B=>=BB@BBB?@@><;;7>??=BBB>BDD;D>????@@;@CDC@@@BBB>BBB@AAC>>9BBBB;;;@@?;><::;99<9<;A;>><@@A:=:>@@@>A@>:>===>:=<<>>;;;>=BCAA?>=A>>>:==>;998<=;===@@@<>>9>>>?;??==:=>>>>:>>;;;;;;;<;;

Example 2: Get paired-read FASTQ files

The command below downloads Illumina iSeq 100 paired end sequencing files from ENA. These files are raw sequence reads of a SARS-CoV-2 sample. Each file is 26 Mb.

wget -c \
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR115/044/SRR11542244/SRR11542244_1.fastq.gz \
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR115/044/SRR11542244/SRR11542244_2.fastq.gz

Below is the headers of the two files. The read pairs info is listed in the read IDs.

# Header of the forward read, SRR11542244_1.fastq.gz
@SRR11542244.1 1/1
GTGTGTGTATACATATATATATATATCACATTTTCTTTATCCATTTATCTGTTGTTGGACACTTAGGTTGATTCCATATCTTGGCTATTGTGAATAGTG
+
,,FFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFF
@SRR11542244.2 2/1
GTGATTCCTCAAAGATTTAGAACCAGAAATACCATGTGACCCAGCAATTCCATTACCAGGTCTAAACCCAAAGGAATATAAATCATTCTGTAATGAAGATA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF


# Header of the reverse read, SRR11542244_2.fastq.gz
@SRR11542244.1 1/2
CTATTGGGTATTTAATCCAAAGAAAGGAAATCGGTATATCAAAGAGACATCTGCATGCCCATGTTTATTGTAGCACTATTCACAATAGCCAAGATATGGAA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFF
@SRR11542244.2 2/2
GAACATATGTGTGCATGTATCTTCATTACAGAATGATTTATATTCCTTTGGGTTTAGACCTGGTAATGGAATTGCTGGGTCACATGGTATTTCTGGTTCTA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

13.5.4. Step 3: Perform validation¶

The command below validates the single read file in Example 1.

fastq_info -r SRR12132977.fastq.gz

Below are the validation results. fastq-utils returns the number of reads, read length details, and encoding info. Field Quality encoding indicates the fastq file variant. FASTQ-utils returns OK for a valid fastq file. Otherwise, it will return the validation details in the Error message.

Skipping check for duplicated read names
1900000
------------------------------------
Number of reads: 1919741
Quality encoding range: 34 77
Quality encoding: 33
Read length: 25 352 215
OK

The validation of paired end reads is similar to single read file validation.

fastq_info SRR11542244_1.fastq.gz SRR11542244_2.fastq.gz

Here are the validation results.

fastq_utils 0.23.0
DEFAULT_HASHSIZE=39000001
Scanning and indexing all reads from SRR11542244_1.fastq.gz
700000Scanning complete.

Reads processed: 733611
Memory used in indexing: ~47 MB
File SRR11542244_1.fastq.gz processed
Next file SRR11542244_2.fastq.gz
700000
------------------------------------
Number of reads: 733611
Quality encoding range: 35 70
Quality encoding: 33
Read length: 35 101 96
OK

fastq_util also provides additional arguments to tune the validation:

-s: to validate if reads in two files have the same ordering.

-r: to skip duplicated read names validation. It uses less memory and runs faster.

-e: to allow empty files pass the validation

-q: not to fail if the encoding can’t be decided.

13.5.4.1. Error messages for invalid files¶

FASTQ-utils returns an error message with the location of invalid lines and type of errors if the files are invalid. Below are examples of error messages.

Invalid file example 1, duplicated reads

ERROR: Error in file SRR11542244_2.fastq: line 16: duplicated sequence SRR11542244.5 5/
Invalid file example 2, wrong base call encoding

ERROR: Error in file SRR11542244_2.fastq: line 5: invalid character ‘e’ (hex. code:’65’), expected ACGTacgt0123nN.

13.5.5. FASTA-utils feature summary¶

The table lists technical considerations when selecting the validator, including basic validation function, performance, interface, etc. It also provides a detailed summary of fastq-utils features.

Aspects	Validation content	Description	FASTQ-utils
Basic validation	4-line format	Check if the FASTQ file is a 4-line file	☑️
	Character encoding	Check if the base calls and quality score encoding are correct.	☑️
	Read length	Check if the length of the base calls are the same as that of the quality scores	☑️
	File truncation	Check if the file is truncated or not	☑️
Paired-end reads validation	Deinterleaved paired reads	Validate when the forward and reverse reads are in two files.	☑️
	Interleaved “8-line” files	Validate when the forward and reverse reads are listed together as an 8-line file	☑️
Compressed file validation	gzip	Validate compressed fastq files, with extension `fastq.gz`	☑️
FASTQ variants* validation	fastq-illumina	Validate the fastq-illumina format	☑️
	fastq-sanger	Validate the fastq-sanger format	☑️
	fastq-solexa	Validate the fastq-solexa format	☑️
Performance	Memory		`N/A`
	Speed		`N/A`
Archieve compatiablity	ENA	File validated can be submitted to the ENA archive.	☑️
	ArrayExpress	File validated can be submitted to Array Express.	☑️
	SRA	File validated can be submitted to the SRA archive.	☑️
Interface	Command line interface	Can be used in shell and intergerated in pipe commands	☑️
License	Licensed		☑️GPL-3
	Commercial use	Can be used for commercial purpose	☑️
Code	Open source	Source code available on public platforms	☑️

*See details in the [FASTQ specification recipe]( TODO include link).

13.6. Conclusion¶

In this recipe, we have shown how to validate fastq files, and proposed indicators to evaluate a FASTQ validator. We also identified common file validation related use cases and provided a general file validation workflow. This recipe can be expanded to other file formats and other use cases.