Whole genome sequence benchmark datasets for validation of bioinformatics workflows

Funding period: 2020-2025
Leads: Burton Blais and Catherine Carrillo
Total GRDI funding: $110,000

Food microbiology testing labs play a vital role in food safety investigations, confirming contamination and identifying its extent and source. The CFIA focuses on testing for Salmonella, Listeria, and Escherichia coli, routinely through use of Whole Genome Sequencing (WGS) for typing and to detect virulence and resistance genes. This project aims to create benchmark WGS datasets that meet CFIA standards. These datasets are crucial for developing and validating new gene-based methods for foodborne pathogen detection, verifying bioinformatic pipelines, and will be included as part of the Microbiological Methods Committee requirements within the purview of the Compendium of Analytical Methods.

Research tool/process

  • Method for selection of diverse genomes from large public datasets to be used for in silico validation analyses. This workflow involves: (1) recovery of thousands of genomes from public repositories, (2) removing low-quality datasets, (3) removing clonal or closely related datasets. These large-scale datasets support the implementation of rigorous in silico validation protocols ensuring improved performance of gene-based methods for pathogen detection.

Dataset/database

  • Shiga Toxin Allele Database (StxDB): A comprehensive, curated database of Shiga toxins including all know nucleotide and protein sequence variants to enable accurate determination of Shiga-toxin variants. This database has now been curated to provide accession numbers for representative genomes to enable database users to assess reliability of results. Contributors: Sarah Clarke, Catherine Carrillo, Burton Blais, Adam Koziol, Noor Shubair, Mathu Malar, Liam Brown, Ashley Cooper and Alex Gill.
  • STEC NCBI Database: A set of 10790 diverse genomes were selected from a total of approximately 100,000 published genomes in the NCBI pathogens database. This reduced dataset has been implemented for in silico validation tools at the CFIA, including the PrimerValidator on FoodPort, and for assessing reliability of species specific targets.
  • Listeria monocytogenes NCBI Database: A set of 3225 diverse genomes were selected from a total of approximately 60,000 published genomes in the NCBI pathogens database. This reduced dataset has been implemented for in silico validation tools at the CFIA, including the PrimerValidator on FoodPort, and for assessing reliability of species specific targets. Use of a standardized dataset will ensure comparability of in silico validation analyses.

Contact us

For additional information, please contact:
Genomics R&D Initiative
Email: info@grdi-irdg.collaboration.gc.ca