Saturday, January 17, 2009

BioGPS

 
BioGPS is billed as a "Biology Gene Portal System." It's another database. You can read the review on genomeweb but you will have to register [GNF Team Rolls Out BioGPS Gene Portal for Users and Contributors].

The brains behind BioGPS is Andrew Su at the Genomics Institute of the Novartis Research Foundation (GNF) in San Diego (USA). According to the genomeweb article ...
As scientists move forward in analyzing experimental results, they generally consult up to a dozen "standard web sites" Su said, such as Entrez Gene, Ensembl, UniProt, or the Mouse Genome Informatics site. Each site delivers "partially overlapping gene annotation," so users must visit each, enter their search, learn the interface, and learn how to find each of the genes of interest on that site, he said. "Often that is a quite daunting process."

The idea behind BioGPS, Su said, is to avoid that process as well as reveal to researchers smaller and less-known gene portals that scientists might have missed.
Call me skeptical. The author of the article, Vivian Marx, contacted me and asked me to check out BioGPS. I have a long-standing interest in biological databases dating back to an early attempt to improve and update GenBank by adding annotation. That attempt was a failure—for very sound reasons [Errors in Sequence Databases].

I looked at my favorite genes on BioGPS. Here's the link to their homepage: BioGPS. The first thing you notice is that that database is restricted to rat, mouse, and human genes. The second thing you notice is that there's no value added. The data appears to be copied from other databases. This includes all of the errors, omissions, and misinterpretations found at each site. The emphasis is on expression data—that's what overwhelms the visible record of each gene.

Here's an example. This is the human HSPA1L gene. It happens to be a member of the HSP70 gene family. HSP70 proteins are the major chaperones of the cell. The HSP1AL version is specifically expressed in testes.


The expression data is correct but none of the databases mention that this gene is a developmentally regulated member of the HSP70 gene family even though that information has been in the literature for almost twenty years. You don't learn anything from visiting BioGPS that you wouldn't learn from visiting most other databases and, more importantly, you don't learn the information that might be most important to your research because it isn't in any of the databases. Anyone looking at this record would be puzzled by the lack of connection between the correct expression profile and all of the other information.

It gets worse. If you check out the rat HSPA1L gene you won't even learn that it is developmentally regulated because the expression profile doesn't include testes. The links to this genes suggest that it responds to stress, but it doesn't.

This is just one example of the problems with biological databases. Collecting together links from a variety of databases doesn't help. It just ensures that the errors from each database will be combined, creating maximum confusion.

I'm quoted correctly in the article ...
Larry Moran, a biochemist at the University of Toronto, told BioInform by e-mail that he had looked at a few of his "favorite genes" in the portal. "I don't think it's a very useful database," he said, since it is a summary of information gleaned from other databases with "no attempt at annotation."

In addition, he said, "much of the information is wrong or misleading," such as some of the expression profiles, which "seem to be incorrect; probably because the data is for another gene and not the one in the database record."

Users "who would rely on that sort of expression data would be making a very serious mistake," he said."

Reacting to these comments, Su said, "I think it is a good thing, in terms of making those errors more widely seen. The more eyes that see it, the more likely that that error will be fixed."

Being able to detect errors, however, has to be connected to the ability to fix it, he said. "This is the wiki principle, everybody can edit it, everybody can fix it, everybody has the responsibility and the power to make sure it's correct."
In an ideal world, researchers will fix errors in the databases and a Wiki-like system seems like a good idea. The experiment is already underway [A Gene Wiki]. But, as it turns out, this approach is incredibly naive as I discovered from attempts to fix GenBank a few decades ago. Nobody's going to do it. It's way too much work and there's no motivation to share information on public databases.

I received an email message from one of the authors of the expression data. As you might expect, the expression data profiles that are so prominently featured in the BioGPS database records are from the team at The Genomics Institute of the Novartis Research Foundation (e.g. Su et al., 2004). Much of it may be correct—it certainly succeeded with the HSP1Al gene—but I think it's wrong for HSPA1A.

My correspondent pointed out that his expression data has been widely used by hundreds of researchers and the papers have tons of citations.1 He described several studies that have made important discoveries based on the expression profiles that have been published. I don't doubt that this is true. That's not the point. The point is whether taking the expression data and adding links from other sources makes BioGPS a valuable resource.

Not as far as I can see.


1. The idea that just because a paper is widely quoted means that it must be correct is something that troubles me greatly. It seems to be part of the new way of doing science.

Su, A.I., Wiltshire, T., Batalov, S., Lapp, H., Ching, K.A., Block, D., Zhang, J., Soden, R., Hayakawa, M., Kreiman, G., Cooke, M.P., Walker, J.R. and Hogenesch, J.B. (2004) A gene atlas of the mouse and human protein-encoding transcriptomes. Proc. Natl. Acad. Sci. (USA) 101:6062-6067. [PubMed]

No comments:

Post a Comment