corneliabl: Errors in Sequence Databases

Sandra Porter at Discovering Biology in a Digital World brings up an issue that has been bugging me for two decades [Biologists vs. the Age of Information]. The issue is the accuracy of information in biological databases.

Let's begin with GenBank - GenBank is the main database of nucleotide sequences at the NCBI. Sequence data are submitted to GenBank by researchers or sequencing centers. If mistakes are found, the information in the records can be updated by the submitters or by third parties if the corrected versions are published. This correction activity doesn't always happen though, and the requirement for third party annotations to be published makes it pretty unlikely that anyone will submit small corrections to a sequence.

This is why we see these kinds of quotes from Steven Salzberg (3):
So you think that gene you just retrieved from GenBank [1] is correct? Are you certain? If it is a eukaryotic gene, and especially if it is from an unfinished genome, there is a pretty good chance that the amino acid sequence is wrong. And depending on when the genome was sequenced and annotated, there is a chance that the description of its function is wrong too.

This is a serious problem. Most people don't realize that GenBank is full of sequences that are known to be incorrect and/or poorly annotated. In most cases, the errors are relatively minor such as one or two incorrect codons or deletion of a single codon. In other cases, the errors are more important, such as a pseudogene being represented as a real gene, or missing exons. Sometimes the identity of a gene is completely wrong. I've even seen examples where the species is incorrectly identified.

Sandra asks,

So what do we do? Do we care if the database information is up-to-date? If so, who should be responsible for the updates?

I'm sure some people would like the NCBI to be the final authority and just fix everything but I don't think that's very realistic.

Other people have proposed that wikis are the answer. Maybe they're right, but I really wonder if researchers would be any better at updating wikis than they are at updating information in places like the NCBI.

Well, dear readers, what do you think? Does GenBank need to be fixed? Do we just need more alternatives? Does it even matter?

Back in 1992, I spent part of a summer at the GenBank site in Los Alamos (New Mexico, USA). That was before GenBank moved to NCBI in Bethesda. My task was to explore the possibility of curating GenBank to fix all the errors. I worked with the HSP70 sequences since I had already documented most of the errors in those sequences (The HSP70 Sequence Database).

We decided that I could make corrections to any HSP70 sequence as long as I annotated the changes and got permission from the authors by 'phone.¹ This didn't work. Most of the authors were unwilling to allow changes 'cause they weren't aware of the fact that there was a conflict between their sequences and the aligned sequence database. They didn't even know that others had sequenced the same gene and gotten a different sequence.

We discussed this problem. At the time, everyone was aware of the fact that the SwissProt database was curated and that the curators were making decisions on their own about which sequences were correct and which ones were errors. Here's an example of the entry for human HSPA1A showing the conflicts and variations.

Sometimes the SwissProt curators get it wrong and identify the correct sequence as an error and vice versa. Sometimes they really screw up. Here's an example of that mistake [P23931].

Curating a sequence database is incredibly expensive. You need to hire hundreds of competent workers who can analyze every sequence as it comes in. There are some tools that will help identify errors but in order to reach an acceptable level of accuracy you need to build aligned sequence databases for every gene. That can't be done automatically; you need to have real people look at the data and make the best alignment if you are going to use it to make judgements on the accuracy of a submitted sequence.

The final decision at GenBank was to forget about correcting errors and treat the database as an archive of submitted sequences. It would be up to every researcher to become aware of the error-prone nature of the database before drawing any conclusions. I think this was the correct decision—it was the only realistic decision. Unfortunately, the average researcher doesn't realize how may errors are being propagated in the sequence databases.

1. It was a huge ego-trip to have the power to change records in GenBank. All of the changes I made to other people's sequences have been removed but the ones I made to my own sequences are still there. You can check out [M76613] to see an example of what an annotated sequence could have looked like. Note the references to "old-sequence," "conflict," "variation," and "unsure." These represent differences between the genomic sequence and our older error-prone cDNA sequences.

corneliabl

Friday, June 20, 2008

Errors in Sequence Databases

No comments:

Post a Comment