Simulating Families from STR Data Derived from Files Exported from CODIS

Home » Blog » Forensic » Simulating Families from STR Data Derived from Files Exported from CODIS

November 28, 2017
8:50 am

During the course of initially validating a familial searching method, it was determined that sampling a sufficient number of genuine family sets would be a time consuming and cost prohibitive task. To overcome these challenges, a simulation approach was pursued to gather the familial data required to conduct the study.

The immediate benefits of using simulated over genuine data is that simulations are more cost effective and an expedited means to obtain the needed data points. Although this can be done manually, creating large sets of simulated families is much easier with a basic knowledge of programming and logic.

Written by: Dan Myers, New York State Police Forensic Investigation Center

While most subject matter in programming is beyond the scope of the poster presented at ISHI 2017, only a general description of the concepts are addressed here. If further information with respect to types of files, programming, logic, or language and communication concepts is sought, I would highly recommend programming and logic courses. Programming knowledge is not only beneficial to understanding and conducting simulations but also an invaluable skill to possess in most disciplines and fields today.

After some consideration, I decided to construct the simulations using Visual Basic for Applications in Microsoft Excel. This choice of programming language was based on the availability, proficiency, and experience using VBA. Had other options been readily accessible, I may have chosen a language such as Java, Python, VB, or C++. If time permitted, creating a stand-alone, executable program could prove to be more useful, portable, and flexible than an Excel VBA macro.

Familial Searching and Simulation

The ultimate goal of a familial searching method is to detect a biological relative of a forensic unknown (i.e. target) in a large database of individuals. To achieve this, an offender profile (i.e. seed) from the database was chosen and used to reverse engineer a pair of biological parents. These parents were then crossed in a Mendelian fashion to simulate biological children. This effectively creates a simulated, biological family set that can be used as targets to search for the original offender.

The focus of the poster presented at ISHI 2017 and the simulations for validating the familial searching method I conducted, can be summarized in these steps:

Splitting a seed genotype into two single allele genotypes (i.e; haplotype)
Selecting a sister allele for each haplotype to create simulated parental profiles of the seed.
Crossing the simulated parents to create simulated full-siblings of the seed
Randomly choosing one allele from each parent
Selecting a sister allele for each to create simulated half-siblings of the seed

Since working on this project, I have encountered quite a bit of interest and confusion about exactly what a simulation is. For this application, it simply means selecting alleles artificially as they would be found in the population. This process can be used to create a single profile, multiple profiles, entire databases consisting of profiles, or even mating populations with multiple familial generations. The data to be simulated depends on the study goals and how the data is to be modeled.

The simulations used for this validation were targeted as relatives of an existing individual in the database, therefore about ¼ of the alleles were given. The remaining alleles were selected from an aggregated list of alleles using the FBI Amended African-American, Caucasian and Southwestern Hispanic populations (any set could be used). In short, simulation in the context of this validation, is assigning selected alleles to simulated individuals and/or crossing the simulated individuals following Mendelian rules of inheritance to additionally create simulated children.

Objects and XML

A method to simulate a large number of alleles involves basic programming and logic using variables, objects, and data storage. Attempting to understand what objects and variables there are can be a little difficult at first, but not altogether impossible. An object is something in memory that has properties and functions that the computer can reference and manipulate.

To understand an object, consider a physical object, such as a pen as an example. It has certain properties (i.e. variables) such as the type of point, or the color of the ink, and a certain function, writing. Objects in programming are similar to physical objects although a bit more abstract since they only exist in the computer’s memory. Objects are not directly observable, but rather indirectly observed through the computer output. Examples of outputs resulting from an object include a message box, a picture, a spreadsheet on the screen, or a text file containing genotypes.

Instead of using an object in memory, simulations could be done in Excel worksheets using functions rather than code, but this becomes difficult with large amounts of data. Loading information into Excel to display, then looping through it all puts a strain on computational resources. This process is slow and inefficient, especially with hundreds of thousands of profiles. In addition, the limits of Excel itself will prevent processing exponentially larger amounts of data (e.g. databases with millions of 22 locus profiles). For purposes of efficiency, it would be more effective to store and simulate the profiles in memory as an object, save the output in a text file and then use the Excel worksheets for allele frequency storage only.

This proves convenient since the industry standard for transferring information between programs (XML or Extensible Markup Language files) are text files that can be imported into memory as objects with Excel and CODIS. These text files contain structured tags that are used to define the data so that it can be read, treated, written, and stored. For example, the Amelogenin and CSF1P0 loci in a CODIS Interpol format XML file, would appear as shown below:

</amelogenin>

</locus>

Where each one of the bracketed terms and their structure are used to define the data (e.g.; in the XML file above the data would read X,Y, and 11,11, respectively). This definition allows programs such as CODIS and Excel to read and write the data into memory, or save it, such as in a .csv, .xls, .txt, .doc, or a new XML file.

The Document Object Model

The object in question when loading XML files into memory is called a Document Object Model (DOM). Most programming languages will have a definition for this object. Other types of objects such as lists, linked lists, and arrays could probably be used, but it is much easier when the object is already defined. The Microsoft Excel VBA object library has several iterations of the DOM called DOMDocument, which is what I currently use to simulate families. Of course, text files can be read and edited line by line without using a DOM, but just like reading and writing to and from an Excel Worksheet, it takes significant amounts of time and drains system resources. It would be more efficient to conduct the bulk of the processing with a DOM in memory, then save it to a text file rather than displayed on a screen or in an open text file.

Simulating alleles

The core of simulating alleles is a random method or subroutine. The function of this method is to choose and return a random number from the computer. It is important to point out that the set of numbers used as the selection pool can be set, making a random number generator extremely useful. For example, in step 3, under Familial Searching and Simulation, the way to cross two genotypes in Mendelian fashion (or choose an allele from one parent to create a half-sibling) is to code a random choice between a 1 and 2, then using it to decide which allele to pass on from parent 1. This process can be repeated for parent 2 and will result in a child’s full genotype being simulated.

Simulating alleles from the population is slightly more complicated than randomly choosing between two numbers. The selection must follow a distribution, preferably one that approximates the applicable population. Some programs interface with MATLAB or R to follow a set distribution function, but this is unnecessary if you have enough space to store alleles (Excel Worksheets, an SQL database or even a .txt file will work). In the simulations for this validation, all the alleles from the FBI population study used were tabulated in separate Excel worksheets for each locus. The random number generator was then utilized to select an allele from each table (steps 2 and 5 above). For example, in the D19S433 allele set, there are twenty-four 11 alleles. Or, the table created contains 24 instances of the 11 in a total set of 922 allele observations. When the random number generator chooses a number between 1 and 922, 2.063% of the time an 11 is returned, in effect following the population of interest.

Repeating the process

The part of the process that allows computers to be so instrumental in data procurement and analysis is that anything can be iterated and/or automated. Looping processes are a very large part of programming and can be used to recursively perform functions hundreds, thousands, or millions of times in a very short period (e.g. often in a second or less). For this familial validation, 3,350 biological relatives of 1,150 offender seeds were created in 4 simulations, taking under 1 second each.

Conducting Further Simulations for Familial Searching

Given that databases are changing to include the expanded 20 CODIS core loci, further simulations will have to be done prior to conducting familial searching on any new data. The conversion of the data from 13 CODIS core loci to the expanded set will soon limit the usefulness of familial searching. Conducting new simulations such as those I describe will open up the new expanded sets of data to familial searching.

Validating Familial Searching

I have fielded quite a few questions regarding how I conducted my simulations, and the short answer is programming. If validating familial searching in your agency, before attempting simulations, I would at least consult with a programmer or maybe enroll in some programming and logic courses. The skills and knowledge gained could be invaluable, not only to the validation but to your agency.

Dan Myers has a Bachelor of Science in Biochemistry from SUNY Binghamton and additional coursework at various colleges in statistics, molecular biology and computer science. He has been employed by the New York State Police since June of 2000 as a Laboratory Technician, Serologist and DNA Analyst. As part of his duties as a DNA Analyst he has specialized in kinship analysis, and bone/fetal sampling. In addition, he has been tasked with training NYSP staff in kinship analysis and currently is part of the team validating and implementing a familial searching method for the state of New York. For any questions regarding his poster, simulations or the familial searching validation he can be reached at: danielmyers76@gmail.com.

Would you like to see more articles like this? Subscribe to the ISHI blog below!

Subscribe now!