Updated Friday, January 18, 2013 at 01:25 PM
The genetic data of more than 1,000 people from around the world seemed stripped of anything that might identify them individually. All that was posted online were those data, the ages of the individuals and the region where each of them lived.
But when a researcher randomly selected the DNA sequences of five people in the database, he figured out who they were and identified their entire families, although the relatives had no part in the study. His foray into genomic sleuthing ended up breaching the privacy of nearly 50 people.
All it took was triangulation, using the genetic data, a genealogy website and Google searches. While the methods for extracting relevant genetic data from the raw genetic- sequence files were specialized enough to be beyond the scope of most laypeople, no one expected it would be so easy to zoom in on individuals.
“We are in what I call an awareness moment,” said Eric Green, director of the National Human Genome Research Institute at the National Institutes of Health (NIH).
The researcher did not publish the names he found. But the exercise revealed a growing tension between the advancement of medical research, which often requires making genetic information public so scientists can use it, and protecting the privacy of study subjects.
The paper, published Thursday in the journal Science, follows other reports that identified people whose genetic data were online. But none had started with such limited information: just the long string of DNA letters, an age and, because the study focused on only U.S. subjects, a state.
“I’ve been worried about this for a long time,” said Barbara Koenig, a researcher at the University of California, San Francisco, who studies issues involving genetic data. The new paper is “amazing,” she added, but “we always should be operating on the assumption that this is possible.”
There is no easy answer about what to do. Make study subjects more aware they could be identified by their DNA sequences? Keep more data locked behind security walls? Institute severe penalties for those who invade the privacy of study subjects? None of the above?
“We don’t have any claim to have the answer,” Green said.
Opinions about what should be done vary greatly among experts.
But after seeing how easy it was to find the individuals and their extended families, the NIH removed people’s ages from the public database, making it more difficult to identify them.
Dr. Jeffrey Botkin, associate vice president for research integrity at the University of Utah, which collected the genetic information of some research participants whose identity was breached, cautioned about overreacting. Genetic data from hundreds of thousands of people has been freely available online, he said, yet there has not been a single report of someone being illicitly identified.
He added that “it is hard to imagine what would motivate anyone to undertake this sort of privacy attack in the real world.” But he said he had serious concerns about publishing a formula to breach subjects’ privacy. By publishing, he said, the investigators “exacerbate the very risks they are concerned about.”
The project was the inspiration of Yaniv Erlich, a human-genetics researcher at the Whitehead Institute, which is affiliated with the Massachusetts Institute of Technology. He stresses that he is a strong advocate of data sharing and would hate to see genomic data locked up. But when his lab developed a new technique, he realized he had the tools to probe a DNA database. And he could not resist trying.
The tool allowed him to quickly find a type of DNA pattern that looks like stutters among billions of chemical letters in human DNA. Those little stutters — short tandem repeats — are inherited.
Genealogy websites use repeats on the Y chromosome, the one unique to men, to identify men by their surnames, an indicator of ancestry. Any man can submit the short tandem repeats on his Y chromosome and find the surname of men with the same DNA pattern. The sites enable men to find their ancestors and relatives.
So, Erlich asked, could he take a man’s entire DNA sequence, pick out the short tandem repeats on his Y chromosome, search a genealogy site, discover the man’s surname and then fully identify the man?
He tested it with the genome of Craig Venter, a DNA-sequencing pioneer who posted his own DNA sequence on the Web. He knew Venter’s age and the state where he lives. Bingo: Two men popped up in the database. One was Craig Venter.
“Out of 300 million people in the United States, we got it down to two people,” Erlich said.
On the Web and publicly available are DNA sequences from subjects in an international collaboration, the 1000 Genomes Project. People’s ages were included and all the Americans lived in Utah, so the researchers knew their state.
Erlich began with one man from the database. He got the Y chromosome’s short tandem repeats and then went to genealogy databases and searched for men with those same repeats. He got surnames of the paternal and maternal grandfather. Then he did a Google search for those people and found an obituary. That gave him the family tree.
“Oh my God, we really did this,” Erlich said. “I had to digest it. We had so much information.”
He and his colleagues went on to get detailed family trees for other subjects and then visited Green and his colleagues at the NIH to tell them what they’d done.
They were referred to Amy McGuire, a lawyer and ethicist at Baylor College of Medicine in Houston. She, like others, called for more public discussion of the situation.
“To have the illusion you can fully protect privacy or make data anonymous is no longer a sustainable position,” McGuire said.
When the subjects in the 1000 Genomes Project agreed to participate and provide DNA, they signed a form saying the researchers could not guarantee their privacy. But, at the time, it seemed like so much boilerplate. The risk, Green said, seemed “remote.”
“I don’t know that anyone anticipated that someone would go and actually figure out who some of those people were,” McGuire said.