Jan 30 2024

Breaking Barriers in DNA Analysis: The Role of Bioinformatics


In this engaging interview, Ellen Greytak, Director of Bioinformatics at Parabon Nanolabs, shares insights from her extensive career in forensic DNA analysis. Ellen, who has been with Parabon since her postdoctoral fellowship in bioinformatics, discusses the challenges and breakthroughs in investigative genetic genealogy, particularly in dealing with mixed DNA samples. She explains the complex process of differentiating DNA from multiple individuals in a single sample, often involving a perpetrator and a victim, and how her team developed innovative bioinformatic solutions to accurately separate and analyze these mixtures.


Ellen highlights the critical role of SNP (Single Nucleotide Polymorphism) profiling in identifying individuals from mixed samples and the importance of uploading accurate profiles to databases for successful genetic genealogy. She emphasizes the delicate balance between obtaining useful genetic profiles and ensuring the integrity and reliability of the data for genealogists. Additionally, Ellen delves into the nuances of phenotyping in forensic work, detailing how predicting physical traits from DNA complements genealogical research and aids investigations.


The conversation also touches on the legal and ethical aspects of forensic DNA analysis, with Ellen noting the increasing involvement of bioinformatics in court cases and the importance of adhering to strict guidelines and respecting consent in genetic genealogy. She concludes with reflections on the evolution of forensic DNA analysis over her career and predictions for future advancements in the field.



Laura: Well, Ellen, thank you so much for joining us again here at ISHI 34. This is the annual video series we do, and we are honored to have you again. Um, let’s talk a little bit about your background for new viewers who might not know who you are yet. I’m sure everyone who attends does.


Ellen: Well, I’m Ellen Greytak, I’m the director of bioinformatics at Parabon Nanolabs. And my background is that I’ve been at Parabon my whole working life. It’s the only job I’ve ever had. So, I did my PhD in evolutionary biology, did a one year fellowship, a postdoctoral fellowship in bioinformatics, and then started working at Parabon in 2012, and I’ve been there ever since.


Laura: That’s wonderful. I’ve really loved watching the work that you’ve done your entire career while you’ve been there.


Ellen: It’s been a very interesting series of years.


Laura: Well, that’s for sure. And this year you’re presenting on investigative genetic genealogy, and in particular, on things like mixed samples and things that are more challenging to work with. For viewers who might not know what is a mixed sample?


Ellen: So, a mixed sample is where you have DNA in a single sample from multiple people. And so usually the scenario is that it’s the perpetrator as well as the victim. Their DNA has been mixed together in that sample. And so, when it was swabbed it has DNA from both of them. And so that’s really can be really challenging to work with.


Laura: And it makes sense that that would be often the case. So, because they’re so challenging how do you address that?


Ellen: Well, there are some laboratory methods to try to sort of separate out if it’s a sexual assault sample. But really bioinformatically is what we’re doing. So, we get the data back and it clearly is mixed. I mean, we can tell from sort of the signal that there’s multiple people in there. And so, what we’ve needed to do is we had to basically invent a solution for this. So, when we first started doing forensic DNA, it was a bit of a rude awakening for us because we had not done forensics before. And, you know, we’d all the assays we’d developed and all the work we’d done was really using clinical level samples where you have beautiful DNA, beautiful data every time. And we started talking to detectives and they were telling us, well, we only have this little, tiny amount of DNA. And from the very start it was always, can you do mixtures? Can you do mixtures, can you do mixtures? Like what what is a mixture? Why would this come up? And we realized that they’re very, very common in forensics. I mean so many of the cases we work on have a sexual component. And so, there’s plenty of DNA from the guy, but there’s DNA from the victim as well. And so, it became very clear that we would need a solution for that.


Laura: And talking about that solution, I know you mentioned when we talked earlier about SNPs and STRs. Maybe you can talk about that in relation to mixed samples.


Ellen: So, when the samples come to us, they’ve already gone through STR testing. So, the lab who’s sending it to us. They already know. They can tell us this is a mixture and the approximate ratio. And so, with STRs there are so many alleles that with mixtures… So, let’s say you have a locus and you see an 11,12, 13 and 14. And you know your victim is 11,12. Well then you know, your guy is 13,14. That’s fairly simple. It’s much tougher if you just see an 11 and 12. Your victim is 11 and 12. Well, is he 11, 11, 11,12 or 12,12? And that’s really the situation we’re always in. With SNPs there’s only two alleles. So, at a given SNP there’s only A or G. And so, the only signal that we have is how much A and how much G? And then we know what the victim’s genotype was. So, we have to figure out mathematically from that what his unknown genotype is.


Laura: Wow, that sounds incredibly complicated. And then if you throw in investigative genetic genealogy, does that complicate the picture? Does that change it? What does that look like?


Ellen: I think I’m on a bit of a crusade this year with the talks that I’m giving to really emphasize that. So, I’ve heard it said that with an investigative genetic genealogy, the most important thing is getting a SNP profile that can be uploaded to a database. To Family Tree DNA or GEDmatch. And I’m on a bit of a crusade to say that’s not enough. It’s not just getting a profile that can be uploaded, it’s getting a profile that can be uploaded and get the right matches. So that’s really challenging. It’s not just, you know, can it upload, can I get some matches but are the matches that are there, do they make sense? Do they have the right amount of DNA sharing for a genealogist to be able to put them together? And so, what we learned very early on was that with mixtures, if you upload a mixture to GEDmatch, you just get nonsense. You get so many spurious matches because that data is just a mess. It’s not just matches to the perpetrator and the victim, it’s matches to completely random people in the database, and you just can’t work with that. And so it’s really important that the profile that you upload, if it says you share as much DNA as a second cousin with this person, it’s important that that be right. That it’s not really a first cousin. That it’s not really a third cousin. Because a genealogist is putting together all of these pieces, and if they have the wrong puzzle pieces to start with, they’re not going to be able to build what they need to find that guy.


Laura: Absolutely. It sounds like it could send you down the wrong path, where IGG really is, to narrow down and help, you know, support the two things together.


Ellen: Yeah, unfortunately, at this point, I mean, mixtures… The data is it’s very obvious that it’s mixed. And so now if you upload a mixture to GEDmatch, it will actually be flagged and not be usable. So, you at least don’t have to worry about that. But you still need to worry about making sure that you’re deconvoluted accurately.


Laura: Okay, that makes a lot of sense. Can you talk about, or would it be helpful to talk about the types of mixtures you’ve worked with in relation to IGG?


Ellen: So with mixtures, our first question was what a mixtures even look like when you do SNP genotyping? Because that’s just wasn’t really something that had been studied before, because that hasn’t come up. I mean, it doesn’t come up in clinical work. And so, the first thing we did was actually generate a bunch of mixtures going down to like 0.1%, 1, 2, and up to 50% and just genotype them just to see, you know, what is this data even look like? Can we tell if it’s a mixture? Can we tell what the proportion is? That sort of thing. And so, it became very clear that really low-level mixtures, you know, less than 10% basically don’t really impact the data much at all. So, you can just work with it as it is up 10 to 20%. There might be a little bit that you need to do, but mostly you can just work with it. But above 20%, you very quickly start to see… I mean, it’s genuinely wrong data. I mean, it’s wrong genotypes that are not what you want to see. And so, if it’s above 20% really, really has to be deconvoluted. And so, what we did was we took all that data. You know, it’s 850,000 SNP. So, I built a machine learning model with 850,000 individuals in it, where every SNP was an individual to try and predict their correct genotype. So, we noticed that if the person of interest is below 40%, there’s really not quite enough signal there from his data to reconstruct his results accurately. And one thing that we noticed when we come into these cases, we have a reported likely mixture ratio based on the STR profile. And one thing we realized is that the mixture ratio that comes out in STRs is not necessarily the same as will come out in SNPs, which is a little strange, but we realize that we can’t just say, oh, they told us that the 30% mixture, we can just say it’s a 30% mixture. We actually need to study the data. And so, this was another thing I had to put together was all this research, all of these thresholds, all of this mechanism for going from that data to accurately deconvoluted data and, you know, making sure that you’ve really subtracted out that victim entirely.


Laura: That’s really remarkable. Between machine learning and all of the data that you had to process and, you know, really work through so that you could come up with these percentages and understand where, you know, the threshold was, how long did that take?


Ellen: Oh, that’s a great question. I actually don’t know.


Laura: Okay. I was just curious.


Ellen: Well, it’s funny. I mean, this came up, we sort of did this on our own. It was not like a funded project. It was just clearly we need to be able to handle mixtures. And I still remember very early on… It may have been an early ISHI even talking to someone from a lab and them mentioning, “Well, we always have the victim standard.” Really? You do? Oh, okay. That changes everything. Because if we have the mixture and we know her genotypes, we can do a great job. If we don’t know her genotypes, it’s very hard to accurately reconstruct the data that you’re looking for. But so that just sort of was like a light bulb. I bet I could do that. I bet I could figure this out if we if we had enough data and a good enough model.


Laura: Well, it’s a remarkable. It’s very. I can’t wait to hear your talk.


Ellen: You’ve pretty much heard it now.


Laura: Well, thank you. It’s very nice to be able to do that. So sometimes I’m in the room.


Ellen: My talk will have more graphs though.


Laura: Okay so I’ll probably look at your PowerPoint for that. So, all of this brings up an interesting question. Since this goes out to a YouTube audience, you know, maybe we can talk about the difference between bioinformatics and, say, a traditional analyst. What’s the difference?


Ellen: So, like a traditional DNA analyst.


Laura: Yeah.


Ellen: Well with the bioinformatics that I’m doing, a lot of it is really finding new analyses. And I know forensics is so much about, you know, we have a process. It’s been validated. We do that process exactly the same every time. And it gets to that. We get to that point. I mean, we have an SOP for mixture deconvolution. We have all this, these scripts written that just run at the same way every time. But before that, before those existed, I had to build them. And so, I had to write all that new stuff. And one thing I wanted to mention is that the mixture deconvolution that we do when we went through our validation for New York accreditation, mixture deconvolution was one of the things that we took through that validation. And so, it is allowed to be used in New York, which, you know, is not that big of a deal. But I think the big deal is the fact that it was validated to the level that it could be accredited for use in that way.


Laura: I think that is remarkable. I mean, is that something you’ll need to do in other states as well, or how does that work?


Ellen: So, New York has very specific rules about what you can do with DNA originating there. So, for a while you couldn’t do like 23AndMe, for example, from New York because they did not have New York accreditation. So, they’re the one state that has these really specific rules, which, you know, is great if you want to make sure that everything is done correctly, which is, like I say, it’s great, but it’s a big hurdle. And so fortunately we will not have to go through that for other states. Although interestingly, as investigative genetic genealogy has become more and more popular, there are starting to be some laws put into place. So, the first one was in Maryland, and one of the things they say is that you have to use a lab that has some sort of accreditation. To be defined later. I mean, they don’t even know what that is. I’ve argued you should certainly accept New York accreditation if that already exists. Sort of grandfather that in. But I mean, they wrote this law, but they don’t have a system in place to actually accredit anyone. So, we’ll see what happens. Hopefully Maryland will get to continue using genetic genealogy, but it’s sort of another hurdle. And so, we’ll see if other states start putting those into place as well, I bet.


Laura: I mean, I bet the next couple of years are really going to determine where that goes and who decides to accredit or what other kind of standards or best practices could be used to achieve the same goal.


Ellen: Yeah, we’ll see what happens. I mean, there’s so many private companies now offering genetic genealogy. It’s become a little cutthroat. But also there’s the FBI and, you know, so this effort to sort of bring it into the lab, this validation, um, to get this new technique usable by forensic practitioners. And so we’ll see how it goes.


Laura: Which also brings up an interesting question with IGG being used so much in court cases, um, how does that affect everything? Have you been involved? Have you seen some of the challenges or processes, you know, when you’re talking about bioinformatics? And deconvoluting… Has that come to a court case yet?


Ellen: No, I don’t think that has yet. So, we have been involved in a lot of court cases. Obviously, I don’t know the numbers, but it’s, you know, some in the 40s of the number of convictions and pleas. And so fortunately at Parabon, we have a legal liaison who works with the attorneys and sort of translates it into questions that I understand. We get a lot of requests from the defense for discovery. You know, all the records that we used, all of our reports, all of our lab notebooks, everything that we’ve done, all of our procedures, we have to turn that over. Thus far, we haven’t actually had to testify in any cases, because in every case, the judge has ruled that the genealogy is not important for… I mean, it’s how they found the guy, but it’s not why he’s being prosecuted. He’s being prosecuted because of the STR match. The genetic genealogy helped them get there, but it was that STR match that says, yeah, this was the guy. And so yeah, we have so many court cases. I mean, just on Friday I was filling out a discovery request with what must have been like 100 questions. And so many of them are like, I’m not sure I even understand what the what the question is, but I’ll try to write up an answer and say, I think this is what you mean. And so, it’s tricky. I mean, sort of translating what we do into legalese, I guess, and back and backward and vice versa. Is challenging. But yeah, we are seeing a lot of, of court cases that are involving IGG.


Laura: It’s definitely all over the news. I mean, every day another case breaks. And it had played some role. You know, certainly it’s a mix of other things. I imagine, yes, asking the questions, if you don’t know what question you’re exactly asking, that’s going to be hard for you to answer.


Ellen: Yeah. Like one of the questions was like, you know, in, in your paper that you wrote, why did you cite this other paper in this particular way.


Laura: Why, why, yeah. Oh, I can’t even imagine. Um, well, how about phenotyping? Let’s talk about some of the other work that you do at Parabon before we let you go and get back to the conference.


Ellen: And so phenotyping is where Parabon started in the forensics world, of course. And it just like mixture deconvolution, it also started as a research project. I mean, I was hired with sort of the question, is it going to be possible to predict what someone looks like, and if so, with what level of confidence, what traits are we going to be able to predict? And those were all open questions at the time. And when I started working at Parabon, it was known that you could distinguish between like blue eyes and brown eyes, and that was about it. You know, what about green eyes? What about hair color? What about everything else? Um, these days, I mean, phenotyping is still an important part of what we do. Um, especially it really works alongside the genetic genealogy. So that was one thing we learned. We realized early on we were like, you know, people really just want genealogy. But the genealogists, they like having that phenotype information. And so, they were asking for it anyway and was like, well, as long as we’re running it anyway, we might as well give it to the agency. So, we had to sort of put together a smaller report that no longer has the composite. They can still do that if they want to, but most of the time it’s not needed. It just says, here’s this person’s ancestry and their eye color and their hair color and their skin color and their freckling and, you know, just sort of have that information because the genealogists can find it really useful. I mean, sometimes it’s not useful, but other times it can help them figure out what branch of the family they should look at. I mean, I know it’s come across. I said, this guy has bright red hair and he’s looking through this family and it’s like a case from Oregon or something like that. And she’s like, well, there’s this branch of the family that moved to Oregon and they all have bright red hair. I’m going to focus there. And similarly, there was a case where she had been looking through this family and had found someone who, you know, the victim had said something about, you know, she smelled like grease or something. And so, he was like, oh, he works in a in a car shop. And but, you know, he has blue eyes and you predicted brown eyes. And what are the probabilities? I was like, that’s really unlikely. It’s, you know, and she kept looking and found someone else, you know, who also had that job, you know, and he was the guy. And so, the phenotyping did help. But I say these days, you know… I can tell you that he has blue eyes, but you may not care because CeCe can tell you his name and address.


Laura: Right. So, it’s all these different pieces and depending on what you have or don’t have any advancement in phenotyping, say, in the last year since we’ve gotten together to talk about.


Ellen: The Parabon phenotyping has not changed in the last few years.


Laura: They’re pretty far along. So it makes sense. Yes.


Ellen: I mean, there’s certainly a lot going on in the literature. I guess I say this every year that I’m excited about the possibility of determining age from a DNA sample. Still TBD on forensic samples, but, you know, it’s getting there. And I know there was some research on, like, aging, like people who genetically seem to age faster physically. So, we’ll see. But at this point, what we’ve got seems to be working pretty well.


Laura: Okay, that sounds good. Yes, I did some sequencing of my own and it said I was faster aging. I didn’t appreciate that. If I need you know, I put that in there. I mean, I don’t think I need to know about that. I’d rather just think. No, um, anything we missed that you want to be sure we include.


Ellen: Well, from a genetic genealogy point of view, one of the most important things is following the rules. And I know that sounds very square, but it’s true. I mean, genetic genealogy is sort of the ultimate citizen science. It relies on the support of millions of people who voluntarily put their DNA in family tree DNA or in GEDmatch and opt-in to law enforcement matching. And if those databases go away, if those people go away, it’s over. And this amazing tool that’s been solving so many cases is gone. And so, we think it’s really, really important to follow the terms of service as they’re written and not upload to databases that don’t allow law enforcement usage to not upload in a way that you get around the law enforcement strictures. I mean, because it’s just so important. It’s not worth it for this one case to get some extra matches if you’re going to blow it for everybody else. And so that’s something that we really tell people. I mean, there’s so many genealogy providers out there now. And that’s something that we tell people to ask is like, are they going to follow the rules? Are they going to put your case in danger by uploading somewhere where the people didn’t consent to that usage? And, you know, are you willing to risk that? And so, we think it’s extremely important these days that everybody sort of follow the rules.


Laura: I think that’s an amazing, excellent point to make right now. I mean, I can’t tell you every day there’s another story that really centers around that. And the bottom line is follow the rules and then everything will be okay. Yeah.


Ellen: Yeah, exactly. I mean, we now have this system in place where everybody who’s participating is opted in. And that’s great. The fact that we have that is great. And so that we can be confident that no one is upset about the fact that their DNA is being used this way. And so, we need to respect that. That’s really what it is. We need to respect the choices and the consent that people give.


Laura: Yeah, the opt-in is key, I agree. Yes. Well, since you’ve been with us for so many years, we always like to ask why do you like to come to ISHI and present here? What are some of your favorite parts.


Ellen: So I’ve been coming to Ishi since 2013. I think I’ve missed two: one after I just had a baby, and one because of Covid or something like that. Um, and I was actually considering putting in an abstract this year on an analysis of issues talks over the years, um, to show like what proportion of them are about SNPs versus STRs. And, thinking about the amount of work that it would have been to go through every abstract book and how many SNPs are they using? Okay. That’s fine. It would have been a lot of work, so I ended up not doing it. But I think it would be really interesting to see how the field has changed. In in the ten years that I’ve been coming, I’ve seen such a change, and I think it would be pretty neat to see sort of a little graph of that. And so, ISHI’s the place to be. You have to come to ISHI if you’re doing forensic DNA, if you want to be able to talk to the people who are doing it. And I mean, being able to speak this year is really wonderful. I mean, I always put in abstracts, but I assume I’ll get a poster. So, I was very surprised this year to be selected to speak. And so, I’m really excited to be able to get back in front of that audience. It’s been a while for me. Um, I think I’ve only spoken at ISHI once before, and it was about phenotyping back in like 2014 or so long ago. Uh, it was a completely different world.


Laura: You have an incredible memory because I think it was 2014 or 2015, because I was going back to see when we had spoken and did an on camera interview. Yeah, we did do we did do something over Covid via video, but yes, we did. Yeah. Um, that was harder. Yeah, that was about a poster. I love your idea about the graph. There needs to be a way to work together to do that. We’re coming up on the 35th anniversary next year, so that’ll be a big one. And we’ll do something special. Um, along those lines, these are optional questions for you, but with the 35th year anniversary coming up, what would you say over your career has been the biggest innovation or surprise that you’ve seen? It’s a big question.


Ellen: I don’t know that I have a good answer for that.


Laura: I mean, that’s okay. If you go yeah, it’s definitely a big one. Well, how about this. This is speculative, so we’re holding nobody to this. But what do you think will be the next big thing that we’ll see over the next five, ten, however many years?


Ellen: Well, I think the work that we’ve been doing, this lead generation has really been focused on cold cases, but we started seeing these active cases happen more and more frequently where detectives are thinking ahead. Well, okay, I didn’t get a hit in CODIS. Is there more that DNA can tell me? And so, it does seem like cases that need to be turned around quickly are starting to be done using genetic genealogy, which is very exciting to see. And we’re also seeing a lot of advances on the laboratory side. It’s pretty neat these days actually coming from. So, a lot of these ideas are coming from the ancient DNA community. So that’s an area where… I mean, I remember when I was applying to grad school, I was like, I’d love to do ancient DNA. And they’re like, it would be so much work to build a lab specifically for that because it was… So I mean, it’s still so technical, but it’s been fascinating to see. We sequenced one ancient mitochondrial genome, and now it’s like, you can’t even get published unless you’re sequencing like 80 individuals or 100 individuals, all of this ancient DNA. And so it’s amazing that they’re able to do that. And so that was sort of a like a well, they’re sequencing things that thousands of years old. We’ve got something that’s 50 years old. Granted it’s a forensic sample. We can’t just pick and choose among all the different remains in a cave or something like that. But, you know, we should be able to use those same techniques. And so, we’re seeing really that we’re able to work with these really, really challenging samples that, I mean, we’ve gone down at this point to 114 picograms arms and also to samples that are only like 3% human DNA. And the rest of the DNA is bacterial. You need to be able to work with those samples because so many forensic samples are really difficult like that. And so, we’re seeing, I mean, really with the use of whole genome sequencing and enrichment techniques and also imputation on the bioinformatics side, you’re taking those really challenging samples. Again, all you have is what’s there. I mean, you don’t have anything else that you can work with and making sure you’re getting the best possible data out of that. And again, the important thing being not just uploading but uploading and getting the right answers so that you can solve those cases.


Laura: That is an amazing answer. Thank you. No, I love that. It’s incredible. Ellen, thank you so much for coming in and being with us. We appreciate it every year. And I can’t wait to hear more about your presentation after you give it. And we really appreciate you at ISHI. Thank you.