POPSeq: The Human STR Sequence Diversity Database

Next-generation sequencing (NGS), also referred to as massively parallel sequencing (MPS), is evolving at a rapid pace and opening new doors to more sensitive methods for genetic typing of forensic samples [1]. This technology allows for extremely high sample and genetic marker multiplexing of conventional forensic loci and also investigative genomic data. In the scope of short tandem repeat (STR) typing, recent publications have described NGS methods and discovered sequence diversity in the core repeat and flanking regions of many forensically relevant loci [2-7] that have not been previously reported or catalogued  [8,9].  The identification of STR isoalleles, sequence variants of equivalent size (length), using NGS technology has excited the community who are eager to use this additional sample variation where traditional capillary electrophoresis may have been limited, such as complex mixtures and degraded samples.



Written by: Seth Faith, North Carolina State University



Guidelines for the forensic use of NGS and sequence-based STR data [10-12] and a recently established sequence repository created by NIST, STRSeq, [13] will assist the community in validation and implementation of NGS technology for casework purposes.  However, to date, no comprehensive resource is available that compiles allele frequencies of sequence-based STR loci across multiple populations. The POPSeq Human STR Sequence Diversity Database was designed to meet this need in forensic genetics.   The database leverages the most recent technological advancements in forensic STR typing by NGS, bioinformatics and analytics.  A guiding principle of the database is to allow users to conduct self-service analytics – the ability of selecting data based on user defined criteria of importance.  Thus, the database catalogues numerous metadata attributes including: population group, NGS kit, geographic location of collection, locus, sample type, sex, and quality control level. To allow end-user interaction, we have built a web application layer that allows the visualization of data in table and graphical formats, as well as the download of structured data tables.  Lastly, the database design uses the most current guidance for reporting STR sequences [11], but is extensible to growth with new data and reporting considerations that will evolve from the community over the next several years.

All samples currently admitted to the database were amplified with the PowerSeq™ Auto/Y System Prototype (Promega Corp., Madison, WI), currently marketed as PowerSeq™ 46GY, and were sequenced with the MiSeq System (Illumina Inc., San Diego, California). The methods for analyzing STR data with Cloud computing were described by Bailey et al. [14] and allowed for accurate and efficient STR processing. Further, a team of forensic NGS experts conducted one over one review of data incorporating concordance data (when available) to ensure the highest quality, described further in the Documentation page of the website.  Future datasets will incorporate samples processed using other NGS platforms that will also be included as searchable metadata fields. 

POPSeq Human STR Sequence Diversity Database version 1.0 was released on 18 August 2017 and contained: 550 total genotypes from 7 major population groups, 28,410 STR sequence alleles, and 806 unique allele sequence as generated from 23 autosomal and 21 Y-STR markers. The data population data is explored through the website using the HTML application layer provided by R Shiny.

Remarkably, sixty-one percent of the STR loci (27/44) presented isoalleles.  Autosomal markers SE33, D12S391, D21S11, D2S1338, and D8S1179 showed the highest number of isoalleles observed in the dataset when characterized by the core repeat region. Thus, there is great promise for continued observation of sequence variation in STR loci that will be discovered as POPSeq expands.

We aim to exceed 1,000 quality-reviewed profiles in POPSeq by the end of 2017 from existing datasets currently in review.  Moreover, the website will also include the following enhancements: reporting of flanking regions, bracket notation of core repeats, search compatibility among kits with different start/stop amplicon positions, and bioinformatic tools that will leverage allele frequencies to calculate match statistics. The larger dataset expected in the next release (v1.1) will be examined to conduct a thorough sequence-based genetic analysis of different populations to better understand global allelic diversity and distribution.  Lastly, this resource is positioned to proliferate with the addition of data from collaborating laboratories.  Thus, we ask that interested groups please contact the use the Contact link in the web page to learn how to provide NGS data to POPSeq as a collaborator.




[1] C. Børsting, N. Morling, Next generation sequencing and its applications in forensic genetics, Forensic Science International: Genetics. 18 (2015) 78-89.

[2] C. Phillips, A genomic audit of newly-adopted autosomal STRs for forensic identification, Forensic.Sci.Int.Genet. 29 (2017) 193-204.

[3] N.M. Novroski, J.L. King, J.D. Churchill, L.H. Seah, B. Budowle, Characterization of genetic sequence variation of 58 STR loci in four major population groups, Forensic.Sci.Int.Genet. 25 (2016) 214-226.

[4] C. Gelardi, E. Rockenbauer, S. Dalsgaard, C. Børsting, N. Morling, Second generation sequencing of three STRs D3S1358, D12S391 and D21S11 in Danes and a new nomenclature for sequenced STR alleles, Forensic Science International:Genetics. 12 (2014) 38-41.

[5] M. Scheible, O. Loreille, R. Just, J. Irwin, Short tandem repeat typing on the 454 platform: Strategies and considerations for targeted sequencing of common forensic markers, Forensic Science International:Genetics. 12 (2014) 107-119.

[6] X. Zhao, H. Li, Z. Wang, K. Ma, Y. Cao, W. Liu, Massively parallel sequencing of 10 autosomal STRs in Chinese using the ion torrent personal genome machine (PGM), Forensic Science International: Genetics. 25 (2016) 34-38.

[7] K.B. Gettings, K.M. Kiesler, S.A. Faith, E. Montano, C.H. Baker, B.A. Young, et al., Sequence variation of 22 autosomal STR loci detected by next generation sequencing, Forensic.Sci.Int.Genet. 21 (2016) 15-21.

[8] Christian M. Ruitberg, Dennis J. Reeder, John M. Butler, STRBase: a short tandem repeat DNA database for the human identity testing community, Nucleic Acids Research. 29 (2001) 320-322.

[9] M. Bodner, I. Bastisch, J.M. Butler, R. Fimmers, P. Gill, L. Gusmão, et al., Recommendations of the DNA Commission of the International Society for Forensic Genetics (ISFG) on quality control of autosomal Short Tandem Repeat allele frequency databasing (STRidER), Forensic Science International:Genetics. 24 (2016) 97-102.

[10] FBI SWGDAM, Validation Guidelines for Forensic DNA Analysis Methods, Federal Bureau of Investigation, Scientific Working Group on DNA Analysis and Methods. (2016).

[11] W. Parson, D. Ballard, B. Budowle, J.M. Butler, K.B. Gettings, P. Gill, et al., Massively parallel sequencing of forensic STRs: Considerations of the DNA commission of the International Society for Forensic Genetics (ISFG) on minimal nomenclature requirements, Forensic.Sci.Int.Genet. 22 (2016) 54-63.

[12] L. Gusmão, J.M. Butler, A. Linacre, W. Parson, L. Roewer, P.M. Schneider, et al., Revised guidelines for the publication of genetic population data, Forensic Science International: Genetics. 30 (2017) 160-163.

[13] K. Butler Gettings, L.A. Borsuk, D. Ballard,  M. Bodner, B. Budowle, L. Devesse, J. King, W. Parson, C. Phillips, P. M. Vallone. STRSeq: A catalog of sequence diversity at human identification Short Tandem Repeat loci. Forensic Science International: Genetics , Volume 31 , 111 – 117

[14] S. Bailey, M.K. Scheible, C.L. Williams, Silva, D S B S, C. Eichman, S.A. Faith, Secure and Robust Cloud Computing for High-throughput Forensic Microsatellite Sequence Analysis and Databasing, Forensic Science International:Genetics. 31 (2017) 40-47.