Nov 08 2018

STRSEQ: A Resource for Sequence-Based STR Analysis


As the forensic DNA community explores the possibility of STR sequencing, and commercially available assays are introduced, we, at NIST, realize the importance of standardizing sequence formats, both to allow inter-lab sequence-based profile comparisons and to preserve backward-compatibility to CE-based STR alleles.


written by: Katherine Butler Gettings, NIST



The Applied Genetics Group at NIST has been Sanger sequencing STR alleles long before modern sequencing instruments were introduced. Much of that in-house data and Sanger data from other researchers was cataloged on STRBase, but the proliferation of published STR sequence data in recent years made STRBase an impractical solution for capturing this information. To address this aspect of the technology transition, the STRSeq consortium and associated NCBI BioProject were initiated last year, with the goal of establishing a catalog of STR sequences formatted according to current forensic guidance.

Our consortium partners are forensic research laboratories at the University of North Texas, Kings College London, University of Santiago de Compostela, and the Medical University of Innsbruck. Combining data across NIST, UNT, KCL, and USC results in over 4500 samples, which have been sequenced with commercial STR multiplexes, representing the backbone the STRSeq BioProject. The laboratory in Innsbruck is the home of STRidER, which serves a QC role for CE and sequence-based STR population data. Researchers from these laboratories also constitute the ad hoc STR Sequence Working Group, which continues to refine nomenclature guidance for the forensic community.

The Applied Genetics Group at NIST has lead the STRSeq effort by interfacing with NCBI, organizing the consortium, analyzing STR sequence data from thousands of samples, and uploading GenBank records. The bioinformatic experts at NCBI have been instrumental in facilitating this project—we are able to leverage their available infrastructure for record creation, organization, and searching.

With a goal of creating one record for each unique sequence, to-date, we have created 1145 GenBank records, which are distributed across 28 autosomal STR loci. We have approximately 350 additional autosomal STR records that we expect to upload this year. The poster we presented at ISHI 29 explores the overlap of sequences across the four consortium laboratories. These graphical displays give a broad sense of the distribution of the approximately 1500 unique records across the loci/laboratories and contain interesting academic insights regarding the population genetics of the loci.

We’ve also been working to ensure software developers are aware of this resource. When we started this project, we surveyed forensic labs worldwide that were interested in STR sequencing, and the overwhelming feedback was that a catalog of sequences would best be accessed from within commercial software. Given that, we’d consider this work most successful if the impact is behind-the-scenes for average practitioners—the data will be uniformly formatted according to this catalog, but they may never directly access it. For bioinformaticians currently developing software or considering future database design, this catalog is a great starting point.

This project is a great fit for NIST—we have the expertise, motivation and connections to make this a sustainable resource. Once the initial submission of GenBank records is complete for the autosomal STR loci, we will tackle the Y-STR and X-STR loci present in commercial sequencing assays. We also hope to begin work with STRidER to develop a pathway for researchers to submit new STR sequences to the STRSeq catalog.