Anne O'Tate: A tool to support user-driven summarization, drill-down and browsing of PubMed search results
© Smalheiser et al; licensee BioMed Central Ltd. 2008
Received: 28 September 2007
Accepted: 15 February 2008
Published: 15 February 2008
PubMed is designed to provide rapid, comprehensive retrieval of papers that discuss a given topic. However, because PubMed does not organize the search output further, it is difficult for users to grasp an overview of the retrieved literature according to non-topical dimensions, to drill-down to find individual articles relevant to a particular individual's need, or to browse the collection.
In this paper, we present Anne O'Tate, a web-based tool that processes articles retrieved from PubMed and displays multiple aspects of the articles to the user, according to pre-defined categories such as the "most important" words found in titles or abstracts; topics; journals; authors; publication years; and affiliations. Clicking on a given item opens a new window that displays all papers that contain that item. One can navigate by drilling down through the categories progressively, e.g., one can first restrict the articles according to author name and then restrict that subset by affiliation. Alternatively, one can expand small sets of articles to display the most closely related articles. We also implemented a novel cluster-by-topic method that generates a concise set of topics covering most of the retrieved articles.
Anne O'Tate is an integrated, generic tool for summarization, drill-down and browsing of PubMed search results that accommodates a wide range of biomedical users and needs. It can be accessed at . Peer review and editorial matters for this article were handled by Aaron Cohen.
Anne O'Tate was developed as a part of the Arrowsmith project [1–4], which has been developing informatics tools for advanced text mining of the biomedical literature. We sought to create a tool for carrying out PubMed searches  that did not require the user to progressively reformulate the initial query; that would assist the user in finding the most relevant articles quickly and efficiently; and that would summarize the salient features of a given set of articles – e.g., given a set of articles discussing gene X, to give a list of diseases that gene X has been studied in, or given a set of articles on disease Y, to give a list of symptoms that have been described in that disease. The present paper describes the current implementation of Anne O'Tate, which is used routinely by our group for conducting PubMed searches. The tool has been placed on the Arrowsmith homepage  as a free, public web-based service.
2.1 Query interface
The PubMed query interface  was imported into the Anne O'Tate web page, so that when a user types in a query, it is sent to PubMed using the NCBI E-Utilities (ESearch and EFetch)  to obtain the PubMed IDs, and thereby takes advantage of the pre-processing that occurs within PubMed. Given the set of PubMed IDs, articles are looked up in a local MEDLINE/PubMed database; for articles not included in the local database, E-Utilities are used to download the records of those (generally very recent) articles. There is no restriction on the number of articles retrieved from PubMed and displayed initially to the user. However, to limit the computational load on the system, a limit was placed on the number of papers that are processed further (as discussed below). At present, the default limit is set to process further only the 25,000 most recent articles of a given query.
2.2 MEDLINE term database
A database of terms was created including all of the words and phrases [n-grams (n = 1,2,3)] that occur in the title of at least one article in MEDLINE. A simple tokenizer (to remove sentence delimiters and change the text to lower case) and a stemmer (to handle plurals) have been applied . In total, 15.5 million terms were extracted. Document frequency is defined as the number of different articles in MEDLINE that contain the term in either title or abstract. Each term in an article is counted only once, even though it may occur several times in that article. We intend to update the term database yearly.
Terms were run through the NIH MetaMap program (MMTx version 2.0)  to assign each term to one or more semantic categories, if possible, as defined by the Unified Medical Language System (UMLS). The 134 semantic categories were grouped into ~15 super-categories as outlined in . (For example, a number of individual semantic categories such as Hazardous or Poisonous Substance, Hormone, and Immunologic Factor were subsumed under the super-category of Chemicals & Drugs.) Because MetaMap cannot optimally recognize terms out of context, and because at the time certain terms were poorly represented in the UMLS, including neuroanatomical terms and gene/protein names, the NeuroNames vocabulary  and a list of predicted gene and protein names extracted from Entrez Gene  were added as complementary semantic categories. Anne O'Tate allows users to restrict important words (see below) or MeSH terms to any of the 15 super-categories or to any of the individual semantic categories therein; alternatively, they can retain all terms that mapped to at least one semantic category while discarding terms that failed to map at all.
2.3 Anne O'Tate categories
1. Important words
Important words distinguish a specific literature L from the rest of MEDLINE. Important words of a literature should occur significantly more frequently within the literature than overall in MEDLINE. That is, they should show high enrichment, forming a literature-specific vocabulary that is similar to the concept of a domain sub-language . At the same time, important words should ideally occur in a high proportion of the articles in literature L (i.e., should have high coverage).
where λ = |L| * n/N is the expected value of f.
For each retrieved literature L, we created a list of all words that had a very high enrichment value (i.e. p ≤ 0.001) as calculated above. These were then displayed in order of their relative "importance score" which takes into account both enrichment and coverage, using the formula: Importance = (f/|L|)2/n.
2. Topics (i.e., Medical Subject Headings)
Articles in MEDLINE are indexed by Medical Subject Headings (MeSH); these are annotated by expert biologists, follow a standardized hierarchical set of terminology, and are used to describe the main topics discussed . We display the MeSH terms used in the PubMed search output (stoplisting the 20 most frequent MeSH terms in MEDLINE from consideration as being too general to be useful, such as Humans, Male, Female, etc.).
Within the affiliation field, text delimited by commas is extracted, assuming that these correspond to meaningful components such as institutions, departments, cities, states, zip codes, or countries. They were not tokenized or stemmed. In addition, different text segments that always co-occur were displayed together. For example, "Yale University School of Medicine" always co-occurred with "Connecticut". As such, Anne O'Tate put them together as a single affiliation term.
4. Other MEDLINE fields
Anne O'Tate also displays the search results according to other MEDLINE fields, including author names, journals, and year of publication, listed in order of frequency within the PubMed search output. These fields allow users to have a quick overview of the retrieved literature from different perspectives.
2.4 Literature expansion
The literature expansion tool was added in order to assist the user when he or she finds themselves examining a very small set of articles after running a PubMed query. This situation may arise for at least 3 reasons: a) The PubMed query may relate to a new or highly specific research area in which few articles are available. b) The query may have been poorly formulated so that most relevant papers were missed. c) The user may have already used the Anne O'Tate tool to drill down a few levels within the initial search output.
The PubMed "related articles" function  was employed in batch mode to expand a retrieved literature L containing fewer than 50 articles. For each article in L, a list of its most related articles is retrieved from PubMed using its Elink utility, and the top 100 are kept. These related articles are pooled, and for each of the related articles in the pool, we ask whether it is related to at least 40% of the articles in L. (When L contains only 2–4 articles, a related article must be related to at least 2 of them.) There may be hundreds of related articles satisfying these criteria, but we only display the 50-L most related articles so that the total number of displayed articles (L + related articles) is equal to 50. The expansion not only provides more relevant articles to the user, but also gives a reasonably big literature for Anne O'Tate to summarize.
2.5 The cluster-by-topic function
3.1 Top ranked important words include important biological concepts
Top 20 most important words for the PubMed query "Alzheimer Disease [MeSH Term]".
One possible use for the "important words" function is to annotate a collection of genes and proteins according to the major concepts and items discussed regarding each. Each gene or protein can be used as input to a PubMed search, and the retrieved literature is processed to provide a list of the most important words.
3.2 Categories defined by MEDLINE fields
3.3 Browsing: search results are clustered into topics with a high coverage
For a person unfamiliar with the dicer literature, browsing the various categories may be a useful way to gain an overview and decide which, if any, articles to examine. However, in order to provide an even more succinct overview, we added a "clustered by topic" button which divides any literature into no more than 18 clusters, i.e., the size of a list that can fit comfortably onto one page.
Clustering the search results of the PubMed query "Alzheimer Disease [MeSH Term]" using the cluster-by-topic function.
Most recent articles
Aged, 80 and over
Not indexed by topic
Some currently available web-based tools that allow users to carry out post-processing of PubMed queries.
TYPE 1: extract relationships and allow for 1) graphical visualization or navigation; or 2) refining queries
Extract relationships between biological objects and map them into a graphical network
Extract informative sentences from retrieved results
Extract biological relationships from search results
Extract relationships between medical concepts and allow graphical visualization
Extract several relationships from the search results and then map them into networks
Extract dependency relations among words and allow users to refine queries using these words
TYPE 2: organize results by ontologies or hierarchies
Sort PubMed query results through Gene Ontology and MeSH hierarchy
Summarize search results according to MeSH hierarchy
Tag gene and protein occurrences in text
Categories include words, MeSH, authors, journals, year, substances and country
PubMed Assistant 
Lists MeSH and chemicals, with link-outs to PubMed, Google and Google Scholar
TYPE 3: cluster articles into categories
Vivísimo ClusterMed 
Cluster articles into several categories
Cluster related articles and allow for graphical visualization
TYPE 4: rank articles
Rank articles by the journal impact factor and volume of forward references
Rank articles by relevance
Each of the available web-based tools has unique features, and may be preferred for particular users, types of queries or types of analyses. However, Anne O'Tate offers at least 4 unique features that, to our knowledge, are not found in any other tool at present, and that taken together make it a flexible and practical option for summarization, drill-down and browsing of biomedical articles:
First, the current implementation of Anne O'Tate permits analysis of the 25,000 most recent articles retrieved by any PubMed search, which is much larger than can be handled by other tools; this feature makes it an everyday work-horse rather than a prototype. The emphasis on large literatures did not permit us to include computation-intensive visualization capabilities such as are provided by Alibaba  or HubMed . However, we were able to include a "clustered by topic" feature that represents the major topics covered by a set of articles in an extremely concise form, by developing a novel clustering algorithm that is computed efficiently and is scalable to very large literatures. This allowed us to cluster tens of thousands of articles in real time, whereas other public interfaces  permit users to cluster no more than 500 articles.
Second, search results can be progressively narrowed down by simple clicking, according to any category, allowing one to find articles of interest without needing to modify and re-input the initial query. Thus, Anne O'Tate allows users to direct their attention according to which articles are of greatest interest but does not attempt to predict in advance which articles are likely to be most relevant; this philosophy differs from tools such as HubMed  or Relemed , which display articles in order of predicted relevance to the input query
Third, the "important words" of the retrieved literature are displayed, with an option to restrict these to user-defined semantic categories. A list of "important words" will avoid displaying many general items commonly discussed throughout MEDLINE (such as gene, protein, human, cell, etc.), and thus is more informative than displaying a simple list of the most frequent words.
Fourth, when the number of displayed articles is less than 50, the user has the option to view additional articles that are most closely related to the existing set considered as a whole. This extends the power of the existing PubMed "related records" feature that finds the most closely related articles relative to a single index article.
In our own experience, the web interface has been a useful, daily tool to enhance routine PubMed searching. Anne O'Tate is freely available as a web-based service with no need for log-ins, passwords or downloads; we invite users to employ Anne O'Tate in their own searches and to provide feedback and suggestions for improving its features and aligning it with the needs of the biomedical community.
5. System performance, availability and requirements
Anne O'Tate is currently running on a server with two Xeon 2.4 G processors and 6 GB RAM. Computation time increases linearly to the number of articles to be post-processed. At present, times range from <1 second (to compute the important words for 100 articles) to ~100 seconds (to compute the important words for 25,000 articles containing abstracts).
This research is supported by NIH Grants LM 007292 and LM 08364.
- Smalheiser NR, Torvik VI, Bischoff-Grethe A, Burhans LB, Gabriel M, Homayouni R, Kashef A, Martone ME, Perkins GA, Price DL, Talk AC, West R: Collaborative development of the Arrowsmith two node search interface designed for laboratory investigators. J Biomed Discov Collab. 2006, 1: 8-10.1186/1747-5333-1-8.PubMed CentralView ArticlePubMed
- Torvik VI, Smalheiser NR: A quantitative model for linking two disparate literatures in MEDLINE. Bioinformatics. 2007, 23: 1658-1665. 10.1093/bioinformatics/btm161.View ArticlePubMed
- Torvik VI, Weeber M, Swanson DR, Smalheiser NR: A probabilistic similarity metric for Medline records:a model for author name disambiguation. J Am Soc Inform Sci Technol. 2005, 56: 140-158. 10.1002/asi.20105.View Article
- Arrowsmith: Linking documents, disciplines, investigators and databases. [http://arrowsmith.psych.uic.edu]
- Entrez-PubMed. [http://pubmed.gov]
- PubMed E-Utilities. [http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html]
- Biomedical Tokenizer and Stemmer. [http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/tokenizer.cgi]
- Aronson AR: Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc Am Med Informatics Assn Symp. 2001, 17-21.
- McCray AT, Burgun A, Bodenreider O: Aggregating UMLS semantic types for reducing conceptual complexity. Medinfo. 2001, 10: 216-220.
- Bowden DM, Martin RF: NeuroNames Brain Hierarchy. Neuroimage. 1995, 2: 63-83. 10.1006/nimg.1995.1009.View ArticlePubMed
- Entrez Gene. [http://www.ncbi.nlm.nih.gov/sites/entrez]
- Grishman R, Kittredge R: Analyzing Language in Restricted Domains: Sub-language Description and Processing. Lawrence Erlbaum Associates. 1986, 19-38.
- Medical Subject Headings. [http://www.nlm.nih.gov/mesh/]
- Wilbur WJ, Yang Y: An analysis of statistical term strength and its use in the indexing and retrieval of molecular biology texts. Comput Biol Med. 1996, 26: 209-222. 10.1016/0010-4825(95)00055-0.View ArticlePubMed
- Plake C, Schiemann T, Pankalla M, Hakenberg J, Leser U: AliBaba: PubMed as a graph. Bioinformatics. 2006, 22: 2444-2445. 10.1093/bioinformatics/btl408.View ArticlePubMed
- Divoli A, Attwood TK: BioIE: extracting informative sentences from the biomedical literature. Bioinformatics. 2005, 21 (9): 2138-9. 10.1093/bioinformatics/bti296.View ArticlePubMed
- Chen H, Sharp BM: Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinformatics. 2004, 5: 147-10.1186/1471-2105-5-147.PubMed CentralView ArticlePubMed
- ConceptLink. [http://project.cis.drexel.edu/conceptlink/]
- Douglas SM, Montelione GT, Gerstein M: PubNet: a flexible system for visualizing literature derived networks. Genome Biol. 2005, 6: R80-10.1186/gb-2005-6-9-r80.PubMed CentralView ArticlePubMed
- Perez-Iratxeta C, Bork P, Andrade MA: XplorMed: a tool for exploring MEDLINE abstracts. Trends Biochem Sci. 2001, 26: 573-575. 10.1016/S0968-0004(01)01926-0.View ArticlePubMed
- Doms A, Schroeder M: GoPubMed: exploring PubMed with the Gene Ontology. Nucleic Acids Res. 2005, 33: W783-786. 10.1093/nar/gki470.PubMed CentralView ArticlePubMed
- Tenner H, Thurmayr GR, Thurmayr R: Data mining with Meva in MEDLINE. Lecture Notes in Computer Science Series. 2003, 2868: 39-46.View Article
- McDonald R, Pereira F: Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics. 2005, 6 (Suppl 1): S6-10.1186/1471-2105-6-S1-S6. Epub 2005 May 24.PubMed CentralView ArticlePubMed
- PubReMiner. [http://bioinfo.amc.uva.nl/human-genetics/pubreminer/]
- Ding J, Hughes LM, Berleant D, Fulmer AW, Wurtele ES: PubMed Assistant: a biologist-friendly interface for enhanced PubMed search. Bioinformatics. 2006, 22: 378-380. 10.1093/bioinformatics/bti821.View ArticlePubMed
- ClusterMed. [http://clustermed.info/]
- Eaton AD: HubMed: a web-based biomedical literature search interface. Nucleic Acids Res. 2006, 34: W745-747. 10.1093/nar/gkl037.PubMed CentralView ArticlePubMed
- Plikus MV, Zhang Z, Chuong CM: PubFocus: semantic MEDLINE/PubMed citations analytics through integration of controlled biomedical dictionaries and ranking algorithm. BMC Bioinformatics. 2006, 7: 424-10.1186/1471-2105-7-424.PubMed CentralView ArticlePubMed
- Siadaty MS, Shu J, Knaus WA: Relemed: sentence-level search engine with relevance score for the MEDLINE database of biomedical articles. BMC Med Inform Decis Mak. 2007, 7: 1-10.1186/1472-6947-7-1.PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.