Using a program called lobSTR, researchers extracted individual, genetic fingerprints from an online database and compared them to gene-based genealogy sites, social media and other online tools, to find their match.
The University of Utah will not be notifying roughly 50 people after a team of researchers said they were able to identify them from DNA samples anonymously submitted to the 1000 genomes project.
Using a program called lobSTR – for the short tandem repeats it’s trying to analyze – researchers from the United States and Israel extracted individual, genetic fingerprints and compared them to gene-based genealogy sites, social media and other online tools, to find their match.
“What we wanted to do with this project, is illuminate the current limitations for privacy and call for a discussion with the general public about how we should move forward,” Yaniv Erlich, study co-author and researcher at the Whitehead Institute for Biomedical Research, told SciFare.com.
First, the researchers used lobSTR to parse through the database for STRs on the Y-chromosome – the company that provides paternity testing for Maury Povich, also uses the same markers if the child is a boy.
“Once we have these markers in hand, we take them to a recreational, genetic genealogy database,” Erlich, whose lab develops tools that help researchers harness the power of such databases, said.
Thanks to the falling cost of genome sequencing, the genealogical databases have also started including STR information on the Y-chromosome – there are a lot of Smith’s in the world and they’re not all related.
Even if an unknown relative has deposited the same information in the genealogical databases, the researchers can compare the unknown Y-chromosome STRs with those in the database to obtain a possible surname.
“There’s a strong correlation between these markers on the Y-chromosome and surname,” Erlich said.
It’s not an identity match though – this is where the detective work starts. Alongside the donor’s DNA is the age of the individual and the state where they resided at the time – these things aren’t protected by privacy laws.
“We take the age and state, together with the surname we inferred, then we use Google and other public search engines,” Erlich said.
The lobSTR program wasn’t just created to re-identify individuals and expose vulnerabilities in genomic databases. Several diseases – including Fragile X syndrome, which is characterized by an over-copying of the CGG sequence – are linked to regions of DNA where copying the repeats goes wrong and changes its ability to function properly.
But, Erlich was once employed by a private security company hired to hack and assess the integrity of private industry security infrastructure, so, it’s not a shock that he made the connection in lobSTR’s ability to have multi-functional capabilities.
Fortunately for Erlich’s team, high profile researchers – including Craig Venter, who wasn’t available to comment on this story – have made their personal sequences available online.
These sequences act like controls for the data mining experiment and help them test the viability of lobSTR to get the job done – during the interview, Erlich used Venter’s DNA to show me how simple it can be to get an accurate surname after the unknown DNA has be analyzed by lobSTR.
They weren’t always able to identify an individual but, in some cases they were even able to identify their spouse – either way, they weren’t allowed to notify the individuals.
“I’m not allowed to contact them because I was not part of the original study,” Erlich said.
From just five sequences though, researchers assembled three family trees – one spanning six generations – and placed nearly 50 relatives in them.
Because the research was spearheaded by the University of Utah, it’s their institutional review board that’s tasked with making the decision to notify the participants or not.
The Vice President for Research Integrity, Dr. Jeffrey Botkin, told SciFare.com they won’t be notifying individuals because it may create more anxiety and concern than is warranted by the situation.
“If we’re getting concerns from our research community and these participants are picking up on the news articles and expressing concern, then I think we’ll have ongoing dialogue and reconsider all of the options,” Botkin said.
In response to the research findings, the National Institutes of Health announced that they’ve moved the demographic information to a controlled access database.
“Researchers have to request them specifically,” Laura Rodriguez, Director of the Division of Policy, Communications, and Education at the NIH’s, National Human Genome Research Institute, told SciFare.com. “The access is definitely provided, but, it’s provided under the standard agreement which includes not attempting to try and identify the individuals.”
George Church is the Director of the Personal Genome Project – another research project that involves the collection of DNA sequences for analysis. He calls the NIH’s reaction a weak bandage to a real concern.
“It’s almost window dressing,” Church, who’s also a Professor of Genetics at Harvard Medical School, told SciFare.com. “We predicted in 2005 there would be problems and you can’t just keep hoping they’ll go away – it’s getting worse, it’s not getting better.”
“There are a number of organizations, in addition to NIH, that are putting bandages on these things ineffectually and pretending like they can continue to keep doing it,” Church said.
So, they crafted their protocol knowing incidences of data re-identification and data escape were on the rise and inevitable.
“The combination of escape and re-identification means that no matter where you put this data, you’re making false promises to people,” Church said. “What we decided to do was just be totally transparent.”
“More importantly, we educated our cohort in advance,” he added.
There’s even a test administered to ensure the participants understand the risks and benefits of participating in the study – the researchers call this informed consent.
“The correct way is to be very frank and open,” Church said. “They’re giving of themselves, so, we should be giving back to them a realistic representation, not wishful thinking.”
That’s also why Church is critical of the University of Utah’s decision to not contact those who might be involved.
“Some people just get upset when someone lies to them, even if it’s not a life-or-death thing,” Church said. “It’s an issue of trust and I think that’s going to be on everybody’s lips for a few days.”
The PGP method isn’t just an alternative protocol, it’s also been approved by an Institutional Review Board – Erlich said it could even be a model going forward – so it’s a viable option.
“If there’s some specific study where the protocol won’t work then, by all means, pursue other possibilities,” Church said.
Because identification isn’t sold as some remote possibility – the method used in the current research isn’t like any of the theoretical possibilities listed in the 1000 Genomes Project consent form – the database also becomes an interactive community.
“To some extent they become living Rosetta stones,” Church said.
And that’s a helpful tool for researchers who study the secrets held within the sequences of As, Ts, Cs and Gs.
Check out the 1000 Genomes Project consent form, here.
Check out the Personal Genome Project consent form, here.
The research was published in the latest issue of the journal, Science.
The response from the NIH can be found, here.