- Open Access
The TREC 2004 genomics track categorization task: classifying full text biomedical documents
© Cohen and Hersh; licensee BioMed Central Ltd. 2006
Received: 12 October 2005
Accepted: 14 March 2006
Published: 14 March 2006
The TREC 2004 Genomics Track focused on applying information retrieval and text mining techniques to improve the use of genomic information in biomedicine. The Genomics Track consisted of two main tasks, ad hoc retrieval and document categorization. In this paper, we describe the categorization task, which focused on the classification of full-text documents, simulating the task of curators of the Mouse Genome Informatics (MGI) system and consisting of three subtasks. One subtask of the categorization task required the triage of articles likely to have experimental evidence warranting the assignment of GO terms, while the other two subtasks were concerned with the assignment of the three top-level GO categories to each paper containing evidence for these categories.
The track had 33 participating groups. The mean and maximum utility measure for the triage subtask was 0.3303, with a top score of 0.6512. No system was able to substantially improve results over simply using the MeSH term Mice. Analysis of significant feature overlap between the training and test sets was found to be less than expected. Sample coverage of GO terms assigned to papers in the collection was very sparse. Determining papers containing GO term evidence will likely need to be treated as separate tasks for each concept represented in GO, and therefore require much denser sampling than was available in the data sets.
The annotation subtask had a mean F-measure of 0.3824, with a top score of 0.5611. The mean F-measure for the annotation plus evidence codes subtask was 0.3676, with a top score of 0.4224. Gene name recognition was found to be of benefit for this task.
Automated classification of documents for GO annotation is a challenging task, as was the automated extraction of GO code hierarchies and evidence codes. However, automating these tasks would provide substantial benefit to biomedical curation, and therefore work in this area must continue. Additional experience will allow comparison and further analysis about which algorithmic features are most useful in biomedical document classification, and better understanding of the task characteristics that make automated classification feasible and useful for biomedical document curation. The TREC Genomics Track will be continuing in 2005 focusing on a wider range of triage tasks and improving results from 2004.
Because of the growing size and complexity of the biomedical literature, there is increasing effort devoted to structuring knowledge in databases. One of the many key efforts is to annotate the function of genes. To facilitate this, the research community has come together to develop the Gene Ontology (GO, http://www.geneontology.org) , a large, controlled vocabulary based on three axes or hierarchies:
• Molecular function (MF) – the activity of the gene product at the molecular (biochemical) level, e.g. protein binding
• Biological process (BP) – the biological activity carried out by the gene process, e.g., cell differentiation
• Cellular component (CC) – where in the cell the gene product functions, e.g., the nucleus
A major use of the GO has been to annotate the genomes of organisms used in biological research. The annotations are often linked to other information, such as literature, the gene sequence, the structure of the resulting protein, etc. An increasingly common approach is to develop "model organism databases" that bring together all the information for a specific organism into an easy to use format. Some of the better-known model organism databases include those devoted to the mouse (Mouse Genome Informatics, MGI, http://www.informatics.jax.org) and the yeast (Saccharomyces Genome Database, SGD, http://www.yeastgenome.org). These databases require extensive human effort for curation and annotation, which is usually done by PhD-level researchers. These curators could be aided substantially by high-quality information tools, including automated document categorization systems.
In the categorization task, using data extracted for us from the MGI databases by the MGI staff, we simulated two of the classification activities carried out by human annotators for the MGI system: a triage task and two simplified variations of MGI's annotation task. Systems were required to classify full-text documents from a two-year span (2002–2003) of three journals, with the first year's (2002) documents comprising the training data and the second year's (2003) documents making up the test data.
One of the goals of MGI is to provide structured, coded annotation of gene function from the biological literature. Human curators identify genes and assign GO codes about gene function with another code describing the type of experimental evidence supporting assignment of the GO code. The huge amount of literature requiring curation creates a challenge for MGI, as their resources are not unlimited. As such, they employ a three-step process to identify the papers most likely to describe gene function:
1. About mouse
The first step is to identify articles about mouse genomics biology. The full text of articles from several hundred journals is searched for the words mouse, mice, or murine. Articles passing this step are further analyzed for inclusion in MGI. At present, articles are searched in a Web browser one at a time because full-text searching is not available for all of the journals included in MGI.
The second step is to determine whether the identified articles should be sent for curation. MGI curates articles not only for GO terms, but also for other aspects of biology, such as gene mapping, gene expression data, phenotype description, and more. For GO curation, MGI strives to select only the articles that contain evidence supporting assignment of a GO code to a specific gene. The goal of this triage process is to limit the number of articles sent to human curators for more exhaustive and specific analysis. Articles that pass this step go into the MGI system with tags for GO, gene mapping, embryological expression, etc. The rest of the articles are not entered into MGI. Our triage task involved correctly classifying which documents had been selected for GO annotation in this process.
The third step is the actual curation with GO terms. Curators identify genes for which there is experimental evidence to warrant assignment of GO codes. Those GO codes are assigned, along with an additional code for each GO code indicating the type of experimental evidence. There can more than one gene assigned specific GO codes in a given paper, and there can be more than one GO code assigned to a gene. In general, and in our collection, there is only one evidence code per GO code assignment per paper. Our annotation task involved a simplification of this annotation step. The goal of this task was not to select the actual GO term, but rather to automatically select the one or more GO hierarchies (molecular function, biological process, or cellular component) from which terms had been selected to annotate the gene for the article. Systems attempting to automate this step must both identify the individual genes, perhaps using named entity recognition techniques , as well as the corresponding GO code hierarchy. For the secondary subtask, systems must identify the evidence type code as well.
A shorter, preliminary version of this paper lacking much of the analysis and discussion presented here was posted originally online at "http://trec.nist.gov/pubs/trec13/papers/GEO.OVERVIEW.pdf".
The documents for the categorization task consisted of articles from three journals over two years, reflecting the full-text documents we were able to obtain from Highwire Press http://www.highwire.org. Highwire is a "value added" electronic publisher of scientific journals. Most journals in their collection are published by professional associations, with the copyright remaining with the associations. Highwire originally began with biomedical journals, but in recent years has expanded into other disciplines. They have also supported IR (information retrieval) and related research by acting as an intermediary between consenting publishers and information systems research groups who want to use their journals, such as the TREC Genomics Track.
Number of papers total and available in the mouse, mus, or murine subset.
2002 papers – total, subset
2003 papers – total, subset
Total papers – total, subset
Data set positive and negative sample counts.
Training (year 2002)
Test (year 2003)
The evaluation measure for the triage task was the utility measure often applied in text categorization research and used by the former TREC Filtering Track. This measure contains coefficients for the utility of retrieving relevant and non-relevant documents. We used a version that was normalized by the best possible score:
Unorm = Uraw / Umax
where Unorm was the normalized score, Uraw the raw score, and Umax the best possible score.
The coefficients for the utility measure were derived as follows. For a test collection of documents to categorize, Uraw is calculated as:
Uraw = (ur * relevant-docs-retrieved) + (unr * non-relevant-docs-retrieved)
• ur = relative utility of relevant document
• unr = relative utility of non-relevant document
We used values for ur and unr that were driven by boundary cases for different results. In particular, we thought it was important that the measure have the following characteristics:
• Completely perfect prediction: Unorm = 1
• All documents designated positive (triage everything): 1 > Unorm > 0
• All documents designated negative (triage nothing): Unorm = 0
• Completely imperfect prediction (all predictions wrong): Unorm < 0
Boundary cases for utility measure of triage task for training and test data.
Completely perfect prediction
Completely imperfect prediction
The measure Umax was calculated by assuming all relevant documents were retrieved and no non-relevant documents were retrieved, i.e., completely perfect prediction and Umax = ur * all-relevant-docs-retrieved.
Thus, for the training data,
Uraw = (20 * relevant-docs-retrieved) - nonrelevant-docs-retrieved
Umax = 20 * 375 = 7500
Unorm = [(20 * relevant-docs-retrieved) - nonrelevant-docs-retrieved] / 7500
Likewise, for the test data,
Uraw = (20 * relevant-docs-retrieved) - nonrelevant-docs-retrieved
Umax = 20 * 420 = 8400
Unorm = [(20 * relevant-docs-retrieved) - nonrelevant-docs-retrieved] / 8400
The primary goal of annotation subtask was, given an article and gene name, to correctly identify which of the GO hierarchies (also called domains) had terms within them that were annotated by the MGI curators. Note that the goal of this task was not to select the actual GO term, but rather to select the one or more GO hierarchies (molecular function, biological process, or cellular component) from which terms had been selected to annotate the gene for the article. Papers that were annotated had terms from one to three hierarchies.
For negative examples, we used 555 papers that had a gene name assigned but were used for other purposes by MGI. As such, these papers had no GO annotations. These papers did, however, have one or more genes assigned by MGI for the other annotation purposes.
Data file contents and counts for annotation hierarchy subtasks.
Training data count
Test data count
Documents – PMIDs
Genes – Gene symbol, MGI identifier, and gene name for all used
Document gene pairs – PMID-gene pairs
Positive examples – PMIDs
Positive examples – PMID-gene pairs
Positive examples – PMID-gene-domain tuples
Positive examples – PMID-gene-domain-evidence tuples
Positive examples – all PMID-gene-GO-evidence tuples
Negative examples – PMIDs
Negative examples – PMID-gene pairs
For the positive examples in the training data, there were 178 documents and 346 document-gene pairs. There were 589 document-gene name-GO domain tuples (out of a possible 346 * 3 = 1038). There were 640 document-gene name-GO domain-evidence code tuples. A total of 872 GO plus evidence codes had been assigned to these documents. For the negative examples, there were 326 documents and 1072 document-gene pairs. This meant that systems could possibly assign 1072*3 = 3216 document-gene name-GO domain tuples. Note that MGI evidence codes refer to the type of evidence, not the specific thing that there is evidence for. Some documents contained evidence of more than one type for a gene and GO domain.
The evaluation measures for the annotation subtasks were based on the notion of identifying tuples of data. Given the article and gene, systems designated one or both of the following tuples:
• <article, gene, GO hierarchy code>
• <article, gene, GO hierarchy code, evidence code>
We employed a global recall, precision, and F-measure evaluation measure for each subtask:
• Recall = number of tuples correctly identified / number of correct tuples
• Precision = number of tuples correctly identified / number of tuples identified
• F = (2 * recall * precision) / (recall + precision)
For the training data, the total number of correct <article, gene, GO hierarchy code> tuples was 589, while the total number of correct <article, gene, GO hierarchy code, evidence code> tuples was 640.
Example required submission format for each task.
Tab Delimited Submission Entry Format
Annotation hierachy plus evidence
There were 98 runs submitted from 20 groups for the categorization task. These were distributed across the subtasks of the categorization task as follows: 59 for the triage subtask, 36 for the annotation hierarchy subtask, and three for the annotation hierarchy plus evidence code subtask.
Triage subtask runs, sorted by utility.
Because of these results we further analyzed the text collections, comparing the features identified as strong predictors in the training data (papers from the year 2002) with those in the test data (papers from the year 2003). One of the important issues in applying text classification systems to documents of interest to curators and annotators is how well the available training data represents the documents to be classified.
When classifying a biomedical text, the available training documents must have been written before the text to be classified. This is required for the TREC tasks to realistically simulate automation of the triage task of the GO curators. Papers written after a given article would not be available to the system for training prior to classifying that article. However, by its very nature the field of science changes over time, as does the language used to describe it. How rapidly the written literature of science changes has a direct influence on the development of biomedical text classification systems in terms of how features are generated and chosen, how often the systems need to be retrained, how large the training increment should be, and may effect the maximum performance that can be expected out of these systems.
We wanted to begin to understand this potentially important issue of terminological drift in the biomedical literature. In order to measure how well the features chosen from the training collection represented the information important in classifying the document in the test collection, we performed identical feature generation and selection processing on the training and test collections, including stemmed and stopped words, Chi-square feature selection at an alpha of 0.025, and inclusion of MeSH terms in the potential feature set. The process generated a set of 1885 features on the training collection and 1899 significant features on the test collection. We then measured how well the training collection feature set represented the test collection feature set by computing similarity metrics between the two sets . The Dice similarity coefficient was 0.2489, the Jaccard similarity was 0.1422, cosine similarity was 0.2489, and the overlap measure was 0.2499. All similarity measures show a low level of similarity between the two sets.
We performed equivalent similarity measures on the individual word frequencies in the training and test collection, filtered out common English words as before, and sorted the words most frequent to least frequent for both sets. Computing similarity measures between the top 100, 1000, and 10,000 words in both sets showed consistently high similarity measures, with the maximum being the Dice similarity coefficient of 0.9618 at 100 words, and the minimum being a Jaccard similarity of 0.9232 at 10,000 words.
It is clear that a significant number of documents (48 out of 328, about 15%) have a "most common" GO code that appears only once in the entire corpus. More than half of the documents have a most common GO code that appears less than 10 times in the entire corpus.
Annotation hierarchy subtask
Annotation hierarchy subtask, sorted by F-score.
Annotation hierarchy plus evidence code subtask, sorted by F-score.
In the annotation hierarchy subtask, the runs varied widely in recall and precision. The best runs, i.e., those with the highest F-measures, had medium levels of recall and precision. The top run came from Indiana University and used a variety of approaches, including a k-nearest-neighbor model, mapping terms to MeSH, using keyword and glossary fields of documents, and recognizing gene names . Further post-submission runs raised their F-measure to 0.639. Across a number of groups, benefit was found from matching gene names appropriately. University of Wisconsin also found identifying gene names in sentences and modeling features in those sentences provided value .
The TREC 2004 Genomics Track categorization task featured a wide diversity of approaches, resulting in substantial variation across the results. Trying to discern the relative value of them is challenging, since few groups performed parameterized experiments or used common baselines.
The triage subtask was limited by the fact that using the MeSH term Mice assigned by the MEDLINE indexers was a better predictor of the MGI triage decision than anything else, including the complex feature extraction and machine learning algorithms of many participating groups. Some expressed concern that MGI might give preference to basing annotation decisions on maximizing coverage of genes instead of exhaustively cataloguing the literature, something that would be useful for users of its system but compromise the value of its data in tasks like automated article triage. We were assured by the MGI director (J. Blake, personal communication) that the initial triage decision for an article was made independent of the prior coverage of gene, even though priority decisions made later in the pipeline did take coverage into account. As such, the triage decisions upon which our data were based was sound from the standpoint of document classification.
The annotation decision was also not affected by this since the positive and negative samples were not exhaustive by design, that is, the data set for the annotation task did not include all article GO annotations made by MGI during this time period. The corpora do not need to be exhaustive for the results to be valid for this subtask; they must simply be correct for the training and test samples provided with GO hierarchies and evidence codes approximately evenly distributed.
Another concern about the MGI data was whether the snapshot obtained in mid-2004 was significantly updated by the time the track was completed. This was analyzed in early 2005, and it was indeed found that the number of PMIDs in the triage subtask had increased in size by about 10%, with a very small number of previously positive samples now negatively triaged (curators determined that these papers actually did not contain evidence for GO assignment). We re-ran our submitted methods on the updated data and obtained virtually identical results.
The major question for the triage subtask is why systems were unable to outperform the single MeSH term Mice. It should be noted that this term was far from perfect, achieving a recall of 89% but a precision of only 15%. So why did more elaborate systems not outperform this? There are a variety of possible explanations:
• MGI data is problematic – while MGI does some internal quality checking, they do not carry it out at the level that research groups would, e.g., with kappa scores.
• Our algorithms and systems are imperfect – we are unaware of or there do not exist better predictive feature sets and algorithms for this task.
• Our metrics may be problematic – is the factor = 20 in the utility formula appropriate? How do we determine a more appropriate means of computing utility that more accurately reflects the needs of the MGI curators?
• The terminological drift between the 2002 training corpus and the 2003 test corpus was large enough to reduce the effectiveness of all discriminating features except for the MeSH term Mice. Perhaps an online-style (incremental) training and evaluation method would be more appropriate than the batch method that we used here.
• The GO triage task is significantly more complex than previously studied document classification tasks. Much more data may be necessary to adequately train machine learning algorithms.
To some extent all of these explanations may play a factor, but the last is probably the dominant factor. The GO triage task appears significantly more difficult than previously studied biomedical document triage tasks. In the 2002 Knowledge Discovery and Data Mining (KDD) Challenge Cup, a task somewhat similar to the TREC triage task was organized around selection of papers about Drosophila (fruit fly) for curation in FlyBase, also using full text articles. Overall, analysis of the results showed that systems did quite well, with the best system achieving an F-measure of 78% on making yes/no decisions on papers, similar to the triage decision required in the TREC task .
The results of the TREC genomics track GO triage task appear significantly worse, with the best submission scoring a utility of 0.6512 and a corresponding F-score of about 27%. However, there are several important differences between the TREC and the KDD triage tasks, besides the obvious, but possibly important difference, that the KDD task focused on fly genomics and the TREC task on mouse. First of all, both the training and test collections for the KDD task had a relatively high proportion of positives (33% and 43%, respectively) as compared to the TREC task (6.5% and 7%). Furthermore, the TREC task used a utility measure heavily weighted towards high recall, while the KDD Cup used F-score, the balanced harmonic mean of recall and precision. Therefore the KDD measure did not take into account a curator preference for not missing many positive articles as we have done here, equally weighted correct prediction of positives and negatives, and had a proportion of positives approaching 50% in the test collection. These factors may have made scoring well on the KDD task easier compared to the TREC task.
Another difference between the TREC and KDD shared tasks may be even more important. The KDD FlyBase triage task was to "determine whether the paper meets the FlyBase gene expression curation criteria, and for each gene, indicate whether the full paper has experimental evidence for gene products (mRNA and/or protein)" . Positive classification was determined solely on whether the full paper included experimental evidence linking genes to their products. The TREC task was to determine whether the paper contained evidence for assignment of GO codes, any GO code. Currently, there are about 20,000 different terms in the GO, in the areas of cellular component, molecular function, and biological process. This is clearly a much wider range of topics than simply gene transcription products, and makes the TREC GO task much more heterogeneous than the KDD task.
Figures 4 and 5 show that the sampling and coverage of GO terms in the training and testing sets, as well as the combined collection, is very sparse, both in terms of individual GO terms, and for papers containing evidence for common GO terms. With 20,000 different terms in the GO under three main headings, a great variety of different topic areas related to the individual GO terms may be present in our collection.
Each of these individual topics can be viewed as a separate yes/no classification task in itself. The GO triage categorization task may better be thought of as many subtasks, where classification of the presence/absence of each GO code is done individually, and the document is triaged for GO if classified as positive for any of the GO codes. But the individual GO codes are sampled very thinly. When the corpus is split into training and test collections, it is very likely that for most GO codes either the training or testing set will be either missing many codes, include only one document that is associated with a given code, or at best, very thinly sample the GO codes relevant for classifying a paper positive for the triage task. Therefore the corpus may contain many GO topics for which there are an inadequate number of cases to provide meaningful samples in both the test and training sets.
For about 85% of documents, the most common GO code associated with a document is found associated with two or more documents. Interestingly, this figure is very close to the recall of the best performing system for the GO triage task 88.8%, and may represent an upper limit on recall performance for this data set.
Combining the samples for each of the many GO topics together may result in the strong features for a given topic being obscured by the strong features in other topics, overwhelming any classification system with the resulting noise, with only features common to the majority of individual topics still predictive. It appears that the MeSH term Mice meets this description. The terminological drift showing a difference in significant features between the training and test collection may simply be due to the very sparse sampling of the range of GO topics over both years. This is substantiated by the data that the most common words (after stop word removal) were largely unchanged, but the statistically significant feature set changed quite a bit from the year 2002 to 2003.
All of the above lends support to the theory that the GO triage task is difficult because it contains many sub-problems which are very sparsely sampled. There aremany GO codes having only one associated document contained in the corpus, and there are many, many GO codes that are completely missing from the corpus. We believe that the triage subtask data represents an important task (i.e., document triage is valuable in a variety of biomedical settings, such as discerning the best evidence in clinical studies) and that these data provide the initial substrate for work to continue in this area. However, it appears that the corpushas to be much, much larger in order to support machine learning on the full range of GO codes for automated text classification on this specific task. Over time, MGI will collect vast amounts of data during the natural course of curating documents each year, but it may be a very long time before adequate numbers of samples are available for all GO codes. Selecting data specifically to train and test classification systems for identifying papers containing evidence for the most common GO codes and other, more specifically defined triage scenarios (such as embryological expression) may be more tractable tasks to address in the near term.
The annotation hierarchy task had lower participation, and the value of picking the correct hierarchy is unclear. However, there would be great value to systems that could perform automated GO annotation, even though the task is very challenging . These results demonstrated value in identifying gene names and other controlled vocabulary terms in documents for this task.
The automated classification of documents for GO annotation proved to be a challenging task. Automated extraction of GO hierarchy codes was even more challenging. This was the first year that the TREC Genomcs Track included a classification task, and so our understanding of the best way to approach these tasks for biomedical curation is just beginning. Current text classification systems are most often optimized for a balanced F-measure, where precision and recall are weighted evenly. However, the asymmetric utility measure used in the triage task was heavily weighted towards recall. This reflected the priorities of the document curators. It is likely that further experience optimizing for this type of utility measure will provide improved results.
Analysis of feature sets showed less correlation between statistically significant features in the training and test sets than expected. While this is most likely due to the sparse sampling of individual GO topics, there is currently insufficient evidence to determine the practical significance and generality of this, and whether this is a general problem for biomedical document classification.
While no approach was able to improve upon the triage performance of simply using the MeSH term Mice, this is likely due to the heterogeneity of the GO triage task and is unlikely to be the case for other, more specific biomedical document triage tasks. Additional research into other tasks will provide more information about the performance expectations for biomedical document classification. This task is likely not representative of document classification for biomedical curation tasks. The Mouse Genome Institute also curates articles for purposes other than GO annotation. Comparison with these tasks will provide further insight into the true potential of document classification for biomedical curation.
The TREC Genomics Track will be continuing in 2005. The categorization task will consist of selecting papers for a set of four triage categories relevant to MGI curation, including allele phenotypes, embryologic expression, and tumor biology as well as repeating the GO triage categorization task with updated data. It is hoped that the research community will be able to build on their experience from this year and present improved results in 2005. There is a large potential benefit to biomedical curation, and work in this area must continue to realize fully the advantages the automated biomedical document classification and text mining could bring to biomedical research.
The TREC 2004 Genomics Track was supported by NSF Grant ITR-0325160. The track also appreciates the help of Ellen Voorhees and NIST.
The TREC 2004 Genomics Track would like to acknowledge the assistance of Judith Blake and her staff at the Mouse Genome Institute for their support in creating the tasks and preparing the data for this research.
The categorization task data and details for its use are available on the TREC Genomics web site (http://ir.ohsu.edu/genomics/). A second version of data was released in early 2005 that updated the 2004 data to correct some minor errors.
- Anonymous: The Gene Ontology (GO) database and informatics resource. Nucleic Acids Research. 2004, 32: D258-D261. 10.1093/nar/gkh036.View ArticleGoogle Scholar
- Cohen AM, Hersh W: A Survey of Current Work in Biomedical Text Mining. Briefings in Bioinformatics. 2005, 6: 57-71. 10.1093/bib/6.1.57.View ArticlePubMedGoogle Scholar
- Dayanik A, Fradkin D, Genkin A, Kantor P, Madigan D, Lewis DD, Menkov V: DIMACS at the TREC 2004 Genomics Track: ; Gaithersburg, MD. Edited by: Voorhees EM and Buckland LP. 2004, National Institute of Standards and TechnologyGoogle Scholar
- Fujita S: Revisiting again document length hypotheses - TREC 2004 Genomics Track experiments at Patolis: ; Gaithersburg, MD. Edited by: Voorhees EM and Buckland LP. 2004, National Institute of Standards and TechnologyGoogle Scholar
- Cohen AM, Bhuptiraju RT, Hersh W: Feature generation, feature selection, classifiers, and conceptual drift for biomedical document triage: ; Gaithersburg, MD. Edited by: Voorhees EM and Buckland LP. 2004, National Institute of Standards and TechnologyGoogle Scholar
- Dunham MH: Data mining introductory and advanced topics. 2003, Upper Saddle River, N.J., Prentice Hall/Pearson Education, xiii, 315 p.-Google Scholar
- Seki K, Costello JC, Singan VR, Mostafa J: TREC 2004 Genomics Track experiments at IUB: ; Gaithersburg, MD. Edited by: Voorhees EM and Buckland LP. 2004, National Institute of Standards and TechnologyGoogle Scholar
- Settles B, Craven M: Exploiting zone information, syntactic rules, and informative terms in Gene Ontology annotation of biomedical documents: ; Gaithersburg, MD. Edited by: Voorhees EM and Buckland LP. 2004, National Institute of Standards and TechnologyGoogle Scholar
- Yeh AS, Hirschman L, Morgan AA: Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup. Bioinformatics. 2003, 19 Suppl 1: i331-9. 10.1093/bioinformatics/btg1046.View ArticlePubMedGoogle Scholar
- Hirschman L, Yeh A, Blaschke C, Valencia A: Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics. 2005, 6: S1-10.1186/1471-2105-6-S1-S1.PubMed CentralView ArticlePubMedGoogle Scholar
- Darwish K, Madkour A: The GUC goes to TREC 2004: using whole or partial documents for retrieval and classification in the Genomics Track: ; Gaithersburg, MD. Edited by: Voorhees EM and Buckland LP. 2004, National Institute of Standards and TechnologyGoogle Scholar
- Aronson AR, Demmer D, Humphrey SH, Ide NC, Kim W, Loane RR, Mork JG, Smith LH, Tanabe LK, Wilbur WJ, Xie N, Demner D, Liu H: Knowledge-intensive and statistical approaches to the retrieval and annotation of genomics MEDLINE citations: ; Gaithersburg, MD. Edited by: Voorhees EM and Buckland LP. 2004, National Institute of Standards and TechnologyGoogle Scholar
- Zhang D, Lee WS: Experience of using SVM for the triage task in TREC 2004 Genomics Track: ; Gaithersburg, MD. Edited by: Voorhees EM and Buckland LP. 2004, National Institute of Standards and TechnologyGoogle Scholar
- Li J, Zhang X, Zhang M, Zhu X: THUIR at TREC 2004: Genomics Track: ; Gaithersburg, MD. Edited by: Voorhees EM and Buckland LP. 2004, National Institute of Standards and TechnologyGoogle Scholar
- Lee C, Hou WJ, Chen HH: Identifying relevant full-text articles for GO annotation without MeSH terms: ; Gaithersburg, MD. Edited by: Voorhees EM and Buckland LP. 2004, National Institute of Standards and TechnologyGoogle Scholar
- Nakov PI, Schwartz AS, Stoica E, Hearst MA: BioText team experiments for the TREC 2004 Genomics Track: ; Gaithersburg, MD. Edited by: Voorhees EM and Buckland LP. 2004, National Institute of Standards and TechnologyGoogle Scholar
- Sinclair G, Webber B: TREC Genomics 2004: ; Gaithersburg, MD. Edited by: Voorhees EM and Buckland LP. 2004, National Institute of Standards and TechnologyGoogle Scholar
- Ruch P, Chichester C, Cohen G, Ehrler F, Fabry P, Marty J, Muller H, Geissbuhler A: Report on the TREC 2004 experiment: Genomics Track: ; Gaithersburg, MD. Edited by: Voorhees EM and Buckland LP. 2004, National Institute of Standards and TechnologyGoogle Scholar
- Yang K, Yu N, Wead A, LaRowe G, Li YH, Friend C, Lee Y: WIDIT in TREC 2004 Genomics, Hard, Robust and Web Tracks: ; Gaithersburg, MD. Edited by: Voorhees EM and Buckland LP. 2004, National Institute of Standards and TechnologyGoogle Scholar
- Guillen R: Categorization of genomics text based on decision rules: ; Gaithersburg, MD. Edited by: Voorhees EM and Buckland LP. 2004, National Institute of Standards and TechnologyGoogle Scholar
- Kraaij W, Raaijmakers S, Weeber M, Jelier R: MeSH based feedback, concept recognition and stacked classification for curation tasks: ; Gaithersburg, MD. Edited by: Voorhees EM and Buckland LP. 2004, National Institute of Standards and TechnologyGoogle Scholar
- Eichmann D, Zhang Y, Bradshaw S, Qiu XY, Zhou L, Srinivasan P, Sehgal AK, Wong H: Novelty, question answering and genomics: the University of Iowa response: ; Gaithersburg, MD. Edited by: Voorhees EM and Buckland LP. 2004, National Institute of Standards and TechnologyGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.