A Likelihood Ratio Framework for Evaluating the Statistical Strength of Body Fluid Identification Using Protein Markers

A Likelihood Ratio Framework for Evaluating the Statistical Strength of Body Fluid Identification Using Protein Markers

As the amount of DNA needed for identifying individuals from forensic evidence becomes increasingly small, the need to know the source of that DNA (e.g. body fluid or tissue) becomes increasingly important – affecting both the defense and prosecution, as well as the public’s confidence in the accurate presentation of evidence and impartial administration of justice.

 

The New York City Office of Chief Medical Examiner (NYC OCME) has developed and validated a confirmatory body fluid assay for semen, saliva and blood[1-2]. This assay uses mass spectrometry to identify multiple marker proteins from each body fluid to confirm their presence in a forensic sample. The marker proteins selected for each body fluid are involved in the essential function of the body fluid– e.g. hemoglobin transports oxygen in blood, semenogelin aids sperm during fertilization, and amylase helps digest carbohydrates in saliva. Protein markers are therefore abundant and characteristic for the body fluid. These properties allow for both specificity of body fluid identification and sensitivity of detection, which has been confirmed with extensive validation testing of known body fluid samples[2]. Considering the widespread use of likelihood ratios in the presentation of forensic DNA evidence, a similar system for evaluating the statistical strength of body fluid identification using protein markers is desirable.

 

For the identification of a specified body fluid in forensic evidence, the likelihood ratio is calculated as the probability of the evidence if the evidence originated from that body fluid, over the probability of the evidence if the evidence originated from any other protein source. While empirical testing of known sources forms the backbone of the probability estimation, the breadth of possible sources of protein in forensic evidence is essentially limitless, and as such impossible to fully test empirically. A thorough investigation of large, publicly available DNA and protein sequence databases provides more complete understanding of the specificity and sensitivity of the selected body fluid protein markers, allowing for robust estimation of the probability of the evidence under the two opposing hypotheses outlined above.

 

The Genome Aggregation Database (gnomAD) v2 is composed of 125,748 exomes and 15,708 genomes from diverse populations around the world[3] and provides rich resource for an investigation of the variation present in protein markers for body fluids. While protein-coding regions, to varying degrees, exhibit constraint again deleterious mutation, variants that result in protein sequence difference do occur in the target regions of NYC OCME protein markers, largely at low population frequencies. An evaluation of the frequency of marker protein variation in human populations is used to predict potential effects on sensitivity of detection of protein markers. The National Center for Biotechnology Information (NCBI)’s non-redundant protein database (https://www.ncbi.nlm.nih.gov/protein/) contains 707,028,945 protein sequences from more than 160,000 organisms (all organisms for which sequence data is available). Using this database, the frequency of occurrence of body fluid marker sequences can be estimated. In addition, the analysis of marker sequence frequency is expanded to evaluate probability that an arbitrary protein sequence or group of sequences could produce a signal that could interfere with or mimic the mass spectrometry assay target. With this information, the denominator hypothesis, or the probability of the evidence if the evidence originated from any protein source other than the identified body fluid, can be accurately determined.

 

This work is funded in part by NIJ grant NIJ Grant 15PNIJ-21-GG-02712-SLFO.

 

  1. Yang, H., Zhou, B., Deng, H., Prinz, M. and Siegel, D., 2013. Body fluid identification by mass spectrometry. International journal of legal medicine, 127, pp.1065-1077.

 

  1. Butler, E., Yang, H., Perez, T., Almubarak, I., Zapata, J., Siegel, D. Validation of a Confirmatory Proteomic Mass Spectrometry Body Fluid Assay for Use in Publicly Funded Forensic Laboratories, presented NIJ Forensic Science Research and Development Symposium, AAFS 2023

 

  1. Karczewski, K.J., Francioli, L.C., Tiao, G., Cummings, B.B., Alföldi, J., Wang, Q., Collins, R.L., Laricchia, K.M., Ganna, A., Birnbaum, D.P. and Gauthier, L.D., 2020. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature, 581(7809), pp.434-443.

As the amount of DNA needed for identifying individuals from forensic evidence becomes increasingly small, the need to know the source of that DNA (e.g. body fluid or tissue) becomes increasingly important – affecting both the defense and prosecution, as well as the public’s confidence in the accurate presentation of evidence and impartial administration of justice.

 

The New York City Office of Chief Medical Examiner (NYC OCME) has developed and validated a confirmatory body fluid assay for semen, saliva and blood[1-2]. This assay uses mass spectrometry to identify multiple marker proteins from each body fluid to confirm their presence in a forensic sample. The marker proteins selected for each body fluid are involved in the essential function of the body fluid– e.g. hemoglobin transports oxygen in blood, semenogelin aids sperm during fertilization, and amylase helps digest carbohydrates in saliva. Protein markers are therefore abundant and characteristic for the body fluid. These properties allow for both specificity of body fluid identification and sensitivity of detection, which has been confirmed with extensive validation testing of known body fluid samples[2]. Considering the widespread use of likelihood ratios in the presentation of forensic DNA evidence, a similar system for evaluating the statistical strength of body fluid identification using protein markers is desirable.

 

For the identification of a specified body fluid in forensic evidence, the likelihood ratio is calculated as the probability of the evidence if the evidence originated from that body fluid, over the probability of the evidence if the evidence originated from any other protein source. While empirical testing of known sources forms the backbone of the probability estimation, the breadth of possible sources of protein in forensic evidence is essentially limitless, and as such impossible to fully test empirically. A thorough investigation of large, publicly available DNA and protein sequence databases provides more complete understanding of the specificity and sensitivity of the selected body fluid protein markers, allowing for robust estimation of the probability of the evidence under the two opposing hypotheses outlined above.

 

The Genome Aggregation Database (gnomAD) v2 is composed of 125,748 exomes and 15,708 genomes from diverse populations around the world[3] and provides rich resource for an investigation of the variation present in protein markers for body fluids. While protein-coding regions, to varying degrees, exhibit constraint again deleterious mutation, variants that result in protein sequence difference do occur in the target regions of NYC OCME protein markers, largely at low population frequencies. An evaluation of the frequency of marker protein variation in human populations is used to predict potential effects on sensitivity of detection of protein markers. The National Center for Biotechnology Information (NCBI)’s non-redundant protein database (https://www.ncbi.nlm.nih.gov/protein/) contains 707,028,945 protein sequences from more than 160,000 organisms (all organisms for which sequence data is available). Using this database, the frequency of occurrence of body fluid marker sequences can be estimated. In addition, the analysis of marker sequence frequency is expanded to evaluate probability that an arbitrary protein sequence or group of sequences could produce a signal that could interfere with or mimic the mass spectrometry assay target. With this information, the denominator hypothesis, or the probability of the evidence if the evidence originated from any protein source other than the identified body fluid, can be accurately determined.

 

This work is funded in part by NIJ grant NIJ Grant 15PNIJ-21-GG-02712-SLFO.

 

  1. Yang, H., Zhou, B., Deng, H., Prinz, M. and Siegel, D., 2013. Body fluid identification by mass spectrometry. International journal of legal medicine, 127, pp.1065-1077.

 

  1. Butler, E., Yang, H., Perez, T., Almubarak, I., Zapata, J., Siegel, D. Validation of a Confirmatory Proteomic Mass Spectrometry Body Fluid Assay for Use in Publicly Funded Forensic Laboratories, presented NIJ Forensic Science Research and Development Symposium, AAFS 2023

 

  1. Karczewski, K.J., Francioli, L.C., Tiao, G., Cummings, B.B., Alföldi, J., Wang, Q., Collins, R.L., Laricchia, K.M., Ganna, A., Birnbaum, D.P. and Gauthier, L.D., 2020. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature, 581(7809), pp.434-443.

Workshop currently at capacity. A waitlist is available to join on our registration page.

Brought to you by

Worldwide Association of Women Forensic Experts

Erin Butler

Forensic Biology Lab, New York City Chief Medical Examiner

During a nine-year tenure at the New York City Office of the Chief Medical Examiner Forensic Biology lab, Erin Butler has focused on the statistical basis for the confident identification of proteins with mass spectrometry and its application in forensic body fluid and species identification, including the design and validation of a comprehensive data analysis strategy for identification of blood, semen, and saliva proteins that has been implemented in OCME’s molecular serology assay.

Speaker Image

Submit Question to a speaker