Lu 2011 PubMed and Beyond
Lu 2011 PubMed and Beyond
1093/database/baq036
.............................................................................................................................................................................................................................................................................................
Review
PubMed and beyond: a survey of web tools
for searching biomedical literature
Zhiyong Lu*
National Center for Biotechnology Information (NCBI), National Library of Medicine, Bethesda, MD 20894, USA
The past decade has witnessed the modern advances of high-throughput technology and rapid growth of research capacity
in producing large-scale biological data, both of which were concomitant with an exponential growth of biomedical
literature. This wealth of scholarly knowledge is of significant importance for researchers in making scientific discoveries
and healthcare professionals in managing health-related matters. However, the acquisition of such information is becom-
ing increasingly difficult due to its large volume and rapid growth. In response, the National Center for Biotechnology
Information (NCBI) is continuously making changes to its PubMed Web service for improvement. Meanwhile, different
entities have devoted themselves to developing Web tools for helping users quickly and efficiently search and retrieve
relevant publications. These practices, together with maturity in the field of text mining, have led to an increase in the
number and quality of various Web tools that provide comparable literature search service to PubMed. In this study, we
review 28 such tools, highlight their respective innovations, compare them to the PubMed system and one another, and
discuss directions for future development. Furthermore, we have built a website dedicated to tracking existing systems and
future advances in the field of biomedical literature search. Taken together, our work serves information seekers in
choosing tools for their needs and service providers and developers in keeping current in the field.
Database URL: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/search
.............................................................................................................................................................................................................................................................................................
.............................................................................................................................................................................................................................................................................................
ß The Author(s) 2011. Published by Oxford University Press.
This is Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://
creativecommons.org/licenses/by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium,
provided the original work is properly cited. Page 1 of 13
(page number not for citation purposes)
Review Database, Vol. 2011, Article ID baq036, doi:10.1093/database/baq036
.............................................................................................................................................................................................................................................................................................
Figure 1. Growth of PubMed citations from 1986 to 2010. Over the past 20 years, the total number of citations in PubMed
has increased at a 4% growth rate. There are currently over 20-million citations in PubMed. 2010 is partial data (through
December 1).
In response to such a problem of information overload, the major aspects. First, the majority of the systems (19/28) in
NCBI has made efforts (see detailed discussion in ‘Changes our review were not previously discussed due to different
to PubMed and looking into the future’ section) in enhan- selection criteria or emergence since 2008. Second, we use
cing standard PubMed searches by suggesting more specific different classification criteria for categorizing and compar-
queries (4). At the same time, the free availability of ing systems so readers can find discussion from different
MEDLNE data and Entrez Programming Utilities (2) make perspectives. Third, we provide a more detailed overview
it possible for external entities—from either academia or of each system and its unique features. In particular, we
industry—to create alternative Web tools that are comple- describe PubMed and its recent development in greater
mentary to PubMed. detail based on our own experience. Lastly, we have built
We present herein a list of 28 such systems, group them by a website with links to existing systems and mechanisms for
their unique features, compare their differences (with registering future systems. All together, our work comple-
PubMed and one another), and highlight their individual ments the previous survey, and more importantly it provides
innovations. First and foremost, we aim to provide general one-stop shopping for biomedical literature search systems.
readers an overview of PubMed and its recent development,
as well as short summaries for other comparable systems PubMed: the primary tool for
that are freely accessible from the Internet. The second
objective is to provide researchers, developers and service
searching biomedical literature
providers a summary of innovative aspects in recently Contents and intended audience
developed systems, as well as a comparison of different PubMed’s intended users include researchers, healthcare
systems. Finally, we have developed a website that is dedi- professionals and the general public, who either have a
cated to online biomedical literature search systems. In add- need for some specific articles (e.g. search with an article
ition to the systems discussed in this article, we will keep it title) or more generally, they search for the most relevant
updated with new systems so that readers can always be articles pertaining to their individual interests (e.g. infor-
informed of the most current advances in the field. mation about a disease). A general workflow of how
We believe this work represents the most comprehensive users interact with PubMed is displayed in Figure 2: a user
review of systems for seeking information in biomedical queries PubMed or other similar systems for a particular
literature to date. Unlike many other review articles on biomedical information need. Offered a set of retrieved
text-mining systems (5–11), we limited our focus exclusively documents, the user can browse the result set and subse-
to systems that are: (i) for biomedical literature search and quently click to view abstracts or full-text articles, issue a
(ii) comparable to the PubMed system. The most compar- new query, or abandon the current search.
able work is an earlier survey of 18 tools in 2008 (12). From a search perspective, PubMed takes as input
However, our review is significantly different in several natural language, free-text keywords and returns a list of
.............................................................................................................................................................................................................................................................................................
Page 2 of 13
Database, Vol. 2011, Article ID baq036, doi:10.1093/database/baq036 Review
.............................................................................................................................................................................................................................................................................................
Figure 2. Overview of general user interactions with PubMed (or similar systems) for searching biomedical literature. Adapted
from Islamaj Dogan et al., (3).
citations that match input keywords (PubMed ignores stop- chronological order. More specifically, PubMed returns
words). Its search strategy has two major characteristics: matched citations in the time sequence of when they
first, by default it adds Boolean operators into user queries were first entered in PubMed by default. This date is for-
and uses automatic term mapping (ATM). Specifically, the mally termed as the Entrez Date (EDAT) in PubMed.
Boolean operator ‘And’ is inserted between multi-term user
queries to require retrieved documents to contain all the
user keywords. For example, if a user issued the query
Other tools comparable to PubMed
‘pubmed search’, the Boolean operator ‘AND’ would be Standards for selecting comparable systems
automatically inserted between the two words as In this work, we selected systems for review based on the
‘pubmed AND search’. following three criteria. First, they should be Web-based
In addition, PubMed automatically compares and maps and operate on equivalent or similar content as PubMed.
keywords from a user query to lists of pre-indexed terms Systems that are designed to search beyond abstract,
(e.g. Medical Subject Headings MeSHÕ ) through its ATM such as full text (e.g. PubMed Central; Google Scholar) or
process (http://www.nlm.nih.gov/pubs/techbull/mj08/mj08_ figure/tables [e.g. BioText (14); Yale image finder (15)] are
pubmed_atm_cite_sensor.html; 13). That is, if a user query thus not included for consideration in this work. Moreover,
can be mapped to one or more MeSH concepts, PubMed we focus on tools developed specifically for the biomed-
will automatically add its MeSH term(s) to the original ical domain. Hence, some general Web-based services
query. As a result, in addition to retrieving documents such as Google Scholar are excluded in the discussion.
containing the query terms, PubMed also retrieves docu- Second, a system should be capable of searching an arbi-
ments indexed with those MeSH terms. Take the earlier trary topic in the biomedical literature as opposed to
example ‘pubmed search’ for illustration, because the some limited areas. Although most citations in PubMed
word ‘pubmed’ can be mapped to MeSH so the final exe- are of biologically relevant subjects (e.g. gene or disease),
cuted search is [‘pubmed’ (MeSH terms) or ‘pubmed’ (all the topics in the entire biomedical literature are of a much
fields)] and ‘search’ (all fields)’ where the PubMed search broader coverage. For example, it includes a number
tags (all fields) and (MeSH terms) indicate the preceding of interdisciplinary subjects such as bioinformatics. In
word will be searched in all indexed fields or only the other words, the proposed system needs to be developed
MeSH indexing field, respectively. generally enough so that different kinds of topics can
The second major uniqueness of PubMed is its choice be searched. Third, the online Web system should require
for ranking and displaying search results in reverse no installation or subscription fee (i.e. freely accessible),
.............................................................................................................................................................................................................................................................................................
Page 3 of 13
Review Database, Vol. 2011, Article ID baq036, doi:10.1093/database/baq036
.............................................................................................................................................................................................................................................................................................
which would allow the users to readily experience the we compare a set of features that affect the value and
service. By these three standards, a total of 28 qualified utility of different tools from a user perspective. For in-
systems were found and they are listed in Tables 1 and 2 stance, we report the last content update time for each
below. Moreover, we classified them into four cate- system as most users would like to keep informed with
gories depending on the best match between their most the latest publications. Specifically, we used the PubMed
notable features and the category theme. Note that content as the study control and searched for the latest
some systems may have features belonging to multiple PubMed citation (PMID: 20726112 on 23 August 2010)
groups and that within each group, we list systems in in all the systems during comparison. When the citation
reverse chronological order. In Table 1, we show the can be found in a system, we consider its content as
year when a system was first introduced and highlight ‘current’ with PubMed. Otherwise, either an exact date
major features that distinguish different systems from (if such information is provided at the Website) or approxi-
the technology development perspective. In Table 2, mate year is labeled.
Table 1. PubMed derivatives are grouped according to their most notable features
.............................................................................................................................................................................................................................................................................................
Page 4 of 13
Database, Vol. 2011, Article ID baq036, doi:10.1093/database/baq036 Review
.............................................................................................................................................................................................................................................................................................
Tools are listed in the same order as they appear in Table 1. PubMed was used as the study control (assessed on 23 August 2010) for
content last update (i.e. current means its content is current with the PubMed content). Latest year information was used when no exact
date can be determined. Symbol 3 stands for yes, and for no. Govn’t, government.
Based on the content of both tables, we have the follow- driving forces in the development of various systems
ing observations: reviewed in this work.
(1) The majority (16/28) of systems contains either ‘Pub’ (3) Most systems were developed by academics research-
or ‘Med’ in their name, indicating their strong bond ers. Yet, several systems also came from the private
to the PubMed system. sector (i.e. Hakia, Cognition, ClusterMed, Quertle) or
(2) All reviewed systems have been developed continu- the public sector (e.g. CiteXplore from the European
ously during the past 10 or so years, starting from Bioinformatics Institute). In addition to free access (a
the introduction of PubCrawler in 1999 to iPubMed, requirement for all the systems), the source code of
the newest member in 2010. It is roughly the same two academic systems (MScanner and Twease) are
period of time that a significant advance and maturity freely available at their websites under the GNU
take place in the fields of text mining and Web tech- General Public License.
nology. Many novel techniques in those two fields (4) Similar to the general Web search engines such as
(e.g. named entity recognition techniques) were Google, the presentation of search results in the
.............................................................................................................................................................................................................................................................................................
Page 5 of 13
Review Database, Vol. 2011, Article ID baq036, doi:10.1093/database/baq036
.............................................................................................................................................................................................................................................................................................
reviewed tools is primarily list based. For some sys- the form of log likelihood ratios. MEDLINE citations
tems that perform result clustering, the list can be that contain a larger number of such relevant terms
further grouped into different topics. Other output would be ranked higher than those with a lesser
formats include tabular and graph presentations, number of such terms. In their implicit relevance feed-
which are designed for systems that are able to ex- back model, they also take the recency effect into
tract and display semantic relations. consideration.
(5) Although only few systems offer links to full-text and Hikia (23) offers access to more than 10-million
related articles, and allow export to bibliographic MEDLINE citations through pubmed.hakia.com.
management software after searches (desirable func- Because it is a product of a private company, it is un-
tions in literature search), one can always (except in clear which ranking algorithm is employed in their
one system) follow the PubMed link to use those system, except that it is said of some kind of semantic
utilities. search technology.
(6) When comparing the four different development Semantic MEDLINETM (24) was built based on
themes, improving ranking and the user interface CognitionSearchTM, a system developed by Cognition’s
seem to be the more popular directions. In the fol- proprietary Semantic NLPTM technology, which incorp-
lowing sections, we describe each of the 28 systems in orates word and phrase knowledge for understanding
greater detail. the semantic meaning of the English language. The
Semantic MEDLINE system adds specific vocabularies
Ranking search results from biomedicine in order to better understand the
domain specific language. Like Hikia, details are not
PubMed returns search results in reverse chronological
revealed to the public.
order by default. In other words, most recent publications
MScanner (25) is mostly comparable to MedlineRanker
are always returned first. Although returning results by
in terms of its functionality. The major difference is that
time order has its own advantages, several systems are
devoted to seeking alternative strategies in ranking results. it uses MEDLINE annotations (MeSH and journal identi-
fiers) instead of words (nouns) in the abstract when
RefMed (16) is a recent development based on both doing the classification. As a result, Mscanner is able
machine learning and information retrieval (IR) tech- to process documents faster but it cannot process art-
niques. It first retrieves search results based on user icles with incomplete or missing annotations.
queries. Next, it asks for explicit user feedback on rele- eTBLAST (26) is capable of identifying relevancy by find-
vant documents and uses such information to learn a ing documents similar to the input text. Unlike
ranking function by a so-called learning-to-rank algo- PubMed’s related articles (27) that uses summed
rithm RankSVM (17,18). Subsequently, the learned func- weights of overlapping words between two documents,
tion ranks retrieval results by relevance in the next eTBLAST determines text similarity based on word
iteration. alignment. Thus, abstract-length textual input is super-
Quertle (19) is a recent biomedical literature search ior to short queries in obtaining good results.
engine developed by a for-profit private enterprise. Its PubFocus (28) sorts articles based on a hybrid of domain
core concept recognition features allow the users to specific factors for ranking scientific publications: jour-
incorporate concept categories into their searches. For nal impact factor, volume of forward references, refer-
instance, one of their concept categories represents all ence dynamics, and authors’ contribution level.
protein names, thus users can search all specific proteins Twease (29) was built on the classic Okapi BM25 rank-
as a whole. It is also claimed that they extract relation- ing algorithm (30) with twists such that retrieval per-
ships based on the context for improving text retrieval. formance can be maintained when query terms are
However, its details are not clearly described to the automatically expanded through the biomedical the-
public. sauri or post-indexing stemming.
MedlineRanker (20) takes as input a set of documents
relating to a certain topic, and automatically learns a
list of most discriminative words representing that topic Clustering results into topics
based on a Naı̈ve Bayes classifier. Then it can use the The common theme of the five systems in the second group
learned words to score and rank newly published art- is about categorization of search results, aiming for quicker
icles pertaining to the topic. navigation and easier management of large numbers of
MiSearch (21) is an online tool that ranks citations by returned results. Such a technique is developed to respond
using implicit relevance feedback (22). Unlike RefMed, to the problem of information overload: users are often
it uses user clickthrough history as implicit feedback for overwhelmed by a long list of returned documents. As
identifying terms relevant to user’s information need in pointed out in ref. (31), this technique is generally shown
.............................................................................................................................................................................................................................................................................................
Page 6 of 13
Database, Vol. 2011, Article ID baq036, doi:10.1093/database/baq036 Review
.............................................................................................................................................................................................................................................................................................
to be effective and useful for seeking relevant information present chains of closely related keywords, while the
from medical journal articles. As discussed in details below, latter allows you to explore the relationships between
the five systems mainly differ in the manner by which different keywords and their mentions in MEDLINE art-
search results are clustered. icles. Finally by selecting one or more chained key-
words, the system returns a list of articles ranked by
Anne O’Tate (32) post-processes retrieved results from
those selected keywords.
PubMed searches and groups them into one of the
pre-defined categories: important words, MeSH topics,
affiliations, author names, journals and year of publica- Enriching results with semantics and visualization
tion. Important words have more frequent occurrences The five systems in this group aim to analyze search results
in the result subset than in the MEDLINE as a whole, and present summarized knowledge of semantics (biomed-
thus they distinguish the result subset from the rest of ical concepts and their relationships) based on information
MEDLINE. Clicking on a given category name will dis- extraction techniques. They differ in three aspects: (i) the
play all articles in that category. To find a article by types of biomedical concepts and relations to be extracted;
multiple categories, one can follow the categories pro- (ii) the computational techniques used for information
gressively (e.g. first restricting results by year of publi- extraction; and (iii) how they present extraction results.
cation, then by journals). MedEvi (37) provides 10 concept variables of major
McSyBi (33) presents clustered results in two distinct biological entities (e.g. gene) to be used in semantic
fashions: hierarchical or non-hierarchical. While the queries such that the search results are bound to the
former provides an overview of the search results, the associated biological entities. Additionally, it also priori-
latter shows relationships among the search results. tizes search results to return first those citations with
Furthermore, it allows users to re-cluster results by matching keywords aligned to the order as they occur
imposing either a MeSH term or ULMS Semantic Type in original queries.
of her research interest. Updated clusters are automat- EBIMED (38) extracts proteins, GO annotations, drugs
ically labeled by relevant MeSH terms and by signature and species from retrieved documents. Relationships
terms extracted from title and abstracts. between extracted concepts are identified based on
GOPubMed (34) was originally designed to leverage the co-occurrence analysis. The overall results are presented
hierarchy in Gene Ontology (GO) to organize search in table format.
results, thus allowing users to quickly navigate results CiteXplore (39) is a system that combines literature
by GO categories. Recently, it was made capable of search with text-mining tools in order to provide inte-
sorting results into four top-level categories: what (bio- grated access to both literature and biological data. In
medical concepts), who (author names), where (affili- addition to the content of PubMed, it also contains ab-
ations and journals) and when (date of publications). stract records from patent applications from the Europe
In the what category, articles are further sorted accord- Patent office and from the Shanghai Information
ing to relevant GO, MeSH or UniProt concepts. Center for Life Sciences, Chinese Academy of Sciences.
ClusterMed (35) can cluster results in six different ways: One other feature of CiteXplore is its inclusion of ref-
(i) title, abstract and MeSH terms (TiAbMh); (ii) title and erence citation information.
abstract (TiAb); (iii) MeSH terms (Mh); (iv) author names MEDIE (40) provides semantic search in addition to
(Au); (v) affiliations (Ad) and (vi) date of publication standard keyword search in the format of (subject,
(Dp). For example, when clustering results by TiAbMh, verb, object) and returns text fragments (abstract sen-
both selected words from title/abstract and MeSH terms tences) that match the queried semantic relations. Its
are used as filters. Like Hakia, ClusterMed is a propri- output is based on both syntactic and semantic parses
etary product from a commercial company (Vivisimo) of the abstract sentences. For example, a semantic
that specializes in enterprise search platforms. Thus, search such as ‘what causes colon cancer?’ will require
how the filters are selected is not known to the public. the output sentences to match ‘cause’ and ‘colon
XplorMed (36) not only organizes results by MeSH cancer’ as the event verb and object, respectively.
classes, it also allows users to explore the subject and PubNet (41) stands for Publication Network Graph
words of interest. Specifically, it first returns a coarse Utility. It parses the XML output of standard PubMed
level clustering of results using MeSH, offering an op- queries and creates different kinds of networks de-
portunity for users to restrict their search to certain pending on the type of nodes and edges a user selects.
categories of interest. Next, the tool displays keywords Nodes can be representatives of article, author or some
in the selected abstracts. At this step, users can choose database IDs (e.g. PDB ids) and edges are constructed
to either go directly to the next step or start a deeper based on shared authors, MeSH terms or location (art-
analysis of the displayed subjects. The former would icles have identical affiliation zip codes). The graph
.............................................................................................................................................................................................................................................................................................
Page 7 of 13
Review Database, Vol. 2011, Article ID baq036, doi:10.1093/database/baq036
.............................................................................................................................................................................................................................................................................................
networks are drawn with the aid of private visualiza- PubCrawler (50,51) checks and emails daily updates in
tion software. MEDLINE to the pre-specified searches saved by the
users.
.............................................................................................................................................................................................................................................................................................
Page 8 of 13
Database, Vol. 2011, Article ID baq036, doi:10.1093/database/baq036 Review
.............................................................................................................................................................................................................................................................................................
Searching
Since most users only examine a few returned results on the
Use cases beyond typical PubMed searches first result page [Figure 7 in ref. (3)], it is unquestionable
Based on the novel features in each system described that displaying citations by relevance is a desired feature in
above, we show in Figure 3 a list of specific use scenarios literature search. The 10 systems listed in ‘Ranking search
that are beyond typical searches in PubMed. Specifically, we results’ section differed with PubMed in this regard.
first identified a diverse set of 12 use cases, to each of which Although most of those systems take as input user key-
we further attached applicable systems accordingly. For in- words, they differ from each other on how they process
stance, one can use tools surveyed in this work to search for the keywords and subsequently use them to retrieve rele-
experts on a specific topic or to visualize search results in vant citations. Like PubMed’s ATM, Twease also has its own
networks. Although traditionally PubMed can not meet query expansion component where additional MeSH terms
many of the listed special user needs, its recent develop- and others can be added to the original user keywords. This
ment allowed it to perform certain tasks such as identifying technique can typically boost recall and is especially useful
similar publications, alerting users with updates and provid- when the original query retrieves few or zero results (13).
ing feedback in query refinement. More details are pre- On the other hand, other systems listed in ‘Ranking search
sented in ‘Changes to PubMed and looking into the results’ section are mostly aim for improved precision over
future’ section. PubMed’s default reverse time sorting scheme. Their
Figure 3. A diverse set of use cases in which different tools may be used.
.............................................................................................................................................................................................................................................................................................
Page 9 of 13
Review Database, Vol. 2011, Article ID baq036, doi:10.1093/database/baq036
.............................................................................................................................................................................................................................................................................................
ranking strategies are very different from one another, ran- summarized results visible in graphs (ALiBaba and
ging from traditional IR techniques like explicit/implicit PubNet). Second, several systems provide easier access to
feedback (RefMed/MiSearch) and relevance ranking PDFs (PubGet) and external citation mangers (PubMed
(Twease), to utilizing domain specific importance factors assistant; HubMed).
like journal impact factors and citation numbers
(PubFocus), to some unknown proprietary semantic NLP Changes to PubMed and looking into the future
technologies (Hikia and SemanticSearch). In response to the great need and challenge in literature
search, PubMed has also gone through a series of signifi-
Results analysis cant changes to better serve its users. As shown in Figure 4,
By default, PubMed returns 20 search results in a page and many of the recent changes happened during the same
displays the title, abstract and other bibliographic informa- time period the 28 reviewed systems were developed. So
tion when a result is clicked. Recent studies focus on two they may have learned from each other. Indeed, some fea-
kinds of extensions to the standard PubMed output. First, tures were first developed in PubMed (e.g. related articles)
because a PubMed search typically results in a long list of while others in third party applications (e.g. email alerts).
citations for manual inspection, systems mentioned in A new initiative geared towards promoting scientific
‘Clustering results into topics’ section aim to provide an discoveries was introduced to PubMed a few years ago.
aid with a short list of major topics summarized from the Specifically, by providing global search across NCBI’s differ-
retrieved articles. Thus, users can navigate and choose to ent databases through the Entrez System (http://www.ncbi
focus on the subjects of interest. This is similar to building .nlm.nih.gov/gquery/), users now have integrated access to
filters for the result set (66). In this regard, choosing appro- all the stored information in different databases to know
priate topic terms to cluster search results into meaningful about a biological entity—be it related publications, DNA
groups is the key to the success of such approaches. sequences or protein structures. Furthermore, inter-
Currently, most systems rely on selecting either important database links have been established and made obvious
words from title/abstract or terms from biomedical con- in search result pages, making the related data readily ac-
trolled vocabularies/ontologies (e.g. MeSH) as representa- cessible between literature and other NCBI’s biological
tive topic terms. databases. For instance, through integrated links originat-
The second extension to the standard PubMed output is ing in PubMed results, users can access information about
due to the advances in text-mining techniques. In particu- chemicals in PubChem or protein structures in the Structure
lar, semantic annotation is believed to be one of the prob- database. Another category of discovery components is
able cornerstones in future scientific publishing (67) despite known as sensors (http://www.nlm.nih.gov/pubs/techbull/
the fact that its full benefits are yet to be determined. Thus nd08/nd08_pm_gene_sensor.html; http://www.nlm.nih
with the development and maturity of techniques in .gov/pubs/techbull/mj08/mj08_pubmed_atm_cite_sensor
named entity recognition and biomedical information ex- .html). A sensor detects certain types of search terms and
traction, some systems present summarized results of deep provides access to relevant information other than litera-
semantic enrichment. Existing systems (‘Enriching results ture. For instance, PubMed’s gene sensor detects gene men-
with semantics and visualization’ section) have mostly tions in user queries and shows links directing users to the
focused on finding genes, proteins, drugs, diseases and spe- associated gene records in Entrez Gene. Although these
cies in free text and their biological relationships such as new additions are specific to PubMed and developed inde-
protein–protein interactions. Problems in these areas have pendently, they nevertheless all reflect the idea of seman-
received the most attention in the text mining community tically enriching the literature with biological data of
(68,69). various kinds, to achieve the goal of more efficient acqui-
sition of knowledge.
Interface and usability With respect to research and retrieval, there are also sev-
In addition to providing improved search quality, a number eral noteworthy endeavors in PubMed development al-
of systems strive to provide a better search interface, though its default sorting schema has been kept intact.
including various changes to input and output. An innova- First, the related article feature was integrated into
tive feature in iPubMed is ‘search-as-you-type’, thus PubMed so that users can readily examine similar articles
enabling users to dynamically choose queries while inspect- in content. eTBLAST has a similar feature, but as explained
ing retrieved results. Other proposals for an alternative earlier, the two systems rely on different techniques for
input interfaces facilitate user-specific questions (PICO, obtaining similar documents. Second, specific tools were
askMedline), allow non-English queries (BabelMeSH), and added into PubMed for different information needs. For
promote use of sliders to set limits (SLIM). With respect to instance, the citation matcher is designed for those who
changes to output, there are two major directions. First, search for specific articles. Another example is clinical
two systems employ additional components to make queries, an interface designed to serve the specific needs
.............................................................................................................................................................................................................................................................................................
Page 10 of 13
Database, Vol. 2011, Article ID baq036, doi:10.1093/database/baq036 Review
.............................................................................................................................................................................................................................................................................................
Figure 4. Technology development timeline for PubMed (in light green color) and other biomedical literature search tools
(in light orange color). For PubMed, it shows the staring year when various recent changes (limited to those mentioned in
‘Changes to PubMed and looking into the future’ section) were introduced. For other tools, we show the time period in which
tools of various features were first appeared.
of clinicians. It is fundamentally akin to the idea of categor- such entities as author names, disorders, genes/proteins and
izing search results (‘Ranking search results’ section) be- chemicals/drugs as they are repeatedly and heavily sought
cause the tool essentially discards any non-clinical results topics (3,70) in biomedicine. In addition, one key factor for
using a set of predefined filters. Finally, in order to help future system developers is the need to keep their content
users avert a long list of return results and narrow their current with the growth of the literature, as literature
searches, a new feature named ‘also try’ was recently intro- search has a recency effect—most users still prefer to be in-
duced, which offers query suggestions from the most popu- formed of the most current findings in the literature. Finally,
lar PubMed queries that contain the user search term (4). to be able to provide one-stop shopping for all 28 reviewed
Regarding the user interface and usability, the My NCBI systems plus the ones in the ‘Other honorable mentions’
tool was introduced to PubMed, which let users select and section and keep track of future developments in this area,
create filter options, save search results, apply personal we have built a website at http://www.ncbi.nlm.nih.gov/
preferences like highlighting search terms in results, and CBBresearch/Lu/search. It contains for every system, a high-
share collections of citations. Similar to PubCrawler, it also light and short description of its unique features, one or
allows users to set automatic emails for receiving updates more related publications, and a link to the actual system
of saved searches. Additional search help such as a spell on the Internet. To facilitate busy scientists to quickly find
checker and query auto-complete have also been deployed appropriate tools for their specific search needs, we have
in PubMed. Finally in 2009, the PubMed interface including built a set of search filters. For instance, one can narrow
its homepage was substantially redesigned such that it is down the entire list of systems to the only ones that keep
now simplified and easier to navigate and use. its content current with PubMed. Future systems will be
Literature search is a fundamentally important problem added to the website either through our quarterly update
in research and it will only become harder as the literature or by individual request. On the website, we have set up a
grows at a faster speed and broader scope (across the trad- mechanism for registering future systems. Once we receive
itional disciplinary boundaries). Therefore we expect con- such a request, we will curate the necessary information (e.g.
tinuous developments and new emerging systems in this system highlights) about the submitted system and make it
field. In particular, with the advances in search and Web immediately available at the website.
technologies in general, we are likely to see progress in
literature search as well. With the maturity of biomedical
text-mining techniques in recognizing biological entities
Conclusions
and their relations, better semantic identification and sum- By our three selection standards, a total of 28 Web systems
marization of search results may be achieved, especially for were included in this review. They are comparable to
.............................................................................................................................................................................................................................................................................................
Page 11 of 13
Review Database, Vol. 2011, Article ID baq036, doi:10.1093/database/baq036
.............................................................................................................................................................................................................................................................................................
PubMed given that they are designed for the same purpose 9. Krallinger,M., Valencia,A. and Hirschman,L. (2008) Linking genes
to literature: text mining, information extraction, and retrieval
and make use of full or partial PubMed data. We first pro-
applications for biology. Genome Biol., 9 (Suppl. 2), S8.
vided a general description of PubMed including its content
10. Cohen,K.B. and Hunter,L. (2008) Getting started in text mining.
and unique characteristics. Next, according to their differ-
PLoS Comput. Biol., 4, e20.
ent features, we classified the 28 systems into four major 11. Clegg,A.B. and Shepherd,A.J. (2008) Text mining. Methods Mol.
groups in which we further described each of them in Biol., 453, 471–491.
greater detail and showed their differences. Finally we re- 12. Kim,J.J. and Rebholz-Schuhmann,D. (2008) Categorization
viewed the 28 systems as a whole and discussed their in- of services for seeking information in biomedical literature:
novative aspects with respect to searching, result analysis a typology for improvement of practice. Brief. Bioinform., 9,
452–465.
and enrichment, and user interface/usability. This review
13. Lu,Z., Kim,W. and Wilbur,W.J. (2009) Evaluation of query expansion
can directly serve both non-experts and expert users
using MeSH in PubMed. Inf. Retr., 12, 69–80.
when they wish to find systems other than PubMed.
14. Hearst,M.A., Divoli,A., Guturu,H. et al. (2007) BioText
Moreover, the review provides a detailed summary for
Search Engine: beyond abstract search. Bioinformatics, 23,
the recent advances in the field of biomedical literature 2196–2197.
search. This is particularly useful for existing service pro- 15. Xu,S., McCusker,J. and Krauthammer,M. (2008) Yale Image Finder
viders and anyone interested in future development (YIF): a new search engine for retrieving biomedical images.
in the field. Finally the constructed website make an inte- Bioinformatics, 24, 1968–1970.
grated and readily access to all reviewed systems and 16. Yu,H., Kim,T., Oh,J. et al. (2010) Enabling multi-level relevance
provides a venue for registering future systems. feedback on PubMed by integrating rank learning into DBMS.
BMC Bioinformatics, 11 (Suppl. 2), S6.
17. Joachims,T. (2002) Optimizing search engines using clickthrough
data. In: Proceedings of the eighth ACM SIGKDD international con-
Acknowledgements ference on Knowledge discovery and data mining ACM, Edmonton,
Alberta, Canada.
The author is grateful to the helpful discussion with John
18. Liu,T.-Y., Joachims,T., Li,H. et al. (2010) Introduction to special issue
Wilbur, Minlie Huang and Natalie Xie.
on learning to rank for information retrieval. Inform. Retr., 13,
197–200.
19. Quertle (2009) http://www.quertle.info (23 August 2010, date last
Funding accessed).
Funding for this work and open access charge: Intramural 20. Fontaine,J.F., Barbosa-Silva,A., Schaefer,M. et al. (2009)
MedlineRanker: flexible ranking of biomedical literature. Nucleic
Research Program of the National Institutes of Health,
Acids Res., 37, W141–W146.
National Library of Medicine.
21. States,D.J., Ade,A.S., Wright,Z.C. et al. (2009) MiSearch adaptive
Conflict of interest: None declared. pubMed search tool. Bioinformatics, 25, 974–976.
22. Crestani,F., Girolami,M., van Rijsbergen,C. et al. (2002) The use of
implicit evidence for relevance feedback in web retrieval. In:
.............................................................................................................................................................................................................................................................................................
Page 12 of 13
Database, Vol. 2011, Article ID baq036, doi:10.1093/database/baq036 Review
.............................................................................................................................................................................................................................................................................................
30. Robertson,S.E., Walker,S., Jones,S. et al. (1994) Okapi at TREC-3. 52. Ding,J., Hughes,L.M., Berleant,D. et al. (2006) PubMed Assistant: a
Third Text REtrieval Conference. NIST Gaithersburg Maryland, USA. biologist-friendly interface for enhanced PubMed search.
31. Pratt,W. and Fagan,L. (2000) The usefulness of dynamically categor- Bioinformatics, 22, 378–380.
izing search results. J. Am. Med. Inform. Assoc., 7, 605–617. 53. Plake,C., Schiemann,T., Pankalla,M. et al. (2006) ALIBABA: PubMed
32. Smalheiser,N.R., Zhou,W. and Torvik,V.I. (2008) Anne O’Tate: a tool as a graph. Bioinformatics, 22, 2444.
to support user-driven summarization, drill-down and browsing of 54. Tsai,R.T., Dai,H.J., Lai,P.T. et al. (2009) PubMed-EX: a web browser
PubMed search results. J. Biomed. Discov. Collab., 3, 2. extension to enhance PubMed search with text mining features.
33. Yamamoto,Y. and Takagi,T. (2007) Biomedical knowledge naviga- Bioinformatics, 25, 3031–3032.
tion by literature clustering. J. Biomed. Inform., 40, 114–130. 55. Fernandez,J.M., Hoffmann,R. and Valencia,A. (2007) iHOP web
34. Doms,A. and Schroeder,M. (2005) GoPubMed: exploring PubMed services. Nucleic Acids Res., 35, W21–W26.
with the Gene Ontology. Nucleic Acids Res., 33, W783. 56. Chen,H. and Sharp,B.M. (2004) Content-rich biological network
35. ClusterMed (2004) http://demos.vivisimo.com/clustermed (23 August constructed by mining PubMed abstracts. BMC Bioinformatics, 5,
2010, date last accessed). 147.
36. Perez-Iratxeta,C. (2001) XplorMed: a tool for exploring MEDLINE 57. Cheng,D., Knox,C., Young,N. et al. (2008) PolySearch: a web-based
abstracts. Trends Biochem. Sci., 26, 573–575. text mining system for extracting relationships between human
37. Kim,J.J., Pezik,P. and and Rebholz-Schuhmann,D. (2008) MedEvi: diseases, genes, mutations, drugs and metabolites. Nucleic Acids
retrieving textual evidence of relations between biomedical con- Res., 36, W399–W405.
cepts from Medline. Bioinformatics, 24, 1410–1412. 58. Wermter,J., Tomanek,K. and Hahn,U. (2009) High-performance
38. Rebholz-Schuhmann,D., Kirsch,H., Arregui,M. et al. (2007) EBIMed– gene name normalization with GeNo. Bioinformatics, 25, 815–821.
text crunching to gather facts for proteins from Medline. 59. Torvik,V.I. and Smalheiser,N.R. (2009) Author name disambiguation
Bioinformatics, 23, e237. in MEDLINE. ACM Trans. Knowl. Discov. Data, 3, 11:1–11:29.
39. CiteXplore (2006) http://www.ebi.ac.uk/citexplore/ (23 August 2010, 60. Goetz,T. and Von Der Lieth,C.-W. (2005) PubFinder: a tool for im-
date last accessed). proving retrieval rate of relevant PubMed abstracts. Nucleic Acids
40. Ohta,T., Tsuruoka,Y., Takeuchi,J. et al. (2006) An intelligent search Res., 33, W774.
engine and GUI-based efficient MEDLINE search tool based on 61. Siadaty,M.S., Shu,J. and Knaus,W.A. (2007) Relemed: sentence-level
deep syntactic parsing. In: Proceedings of the COLING/ACL on search engine with relevance score for the MEDLINE database of
Interactive presentation sessions. Association for Computational biomedical articles. BMC Med. Inform. Decis. Mak., 7, 1.
Linguistics, Sydney, Australia.
62. Tanabe,L., Scherf,U., Smith,L.H. et al. (1999) MedMiner: an Internet
41. Douglas,S., Montelione,G. and Gerstein,M. (2005) PubNet: a flexible text-mining tool for biomedical information, with application
system for visualizing literature derived networks. Genome Biol. to gene expression profiling. Biotechniques, 27, 1210–1214,
2008, 9, S1. 1216–1217.
42. Wang,J., Cetindil,I., Ji,S. et al. (2010) Interactive and fuzzy search: a 63. Fattore,M. and Arrigo,P. (2005) Knowledge discovery and system
dynamic way to explore MEDLINE. Bioinformatics, 26, 2321–2327. biology in molecular medicine: an application on neurodegenera-
43. Pubget (2007) http://pubget.com/ (23 August 2010, date last tive diseases. In Silico Biol., 5, 199–208.
accessed).
64. Kolchanov,N., Hofestaedt,R., Milanesi,L. et al. (2006) Topical clus-
44. Liu,F., Ackerman,M. and Fontelo,P. (2006) BabelMeSH: develop- tering of biomedical abstracts by self-organizing maps. In:
ment of a cross-language tool for MEDLINE/PubMed. AMIA Annu. Bioinformatics of Genome Regulation and Structure II. Springer,
Symp. Proc., 2006, 1012. US, pp. 481–490.
45. Eaton,A.D. (2006) HubMed: a web-based biomedical literature 65. Lopez-Rubio,E. (2010) Probabilistic self-organizing maps for quali-
search interface. Nucleic Acids Res., 34, W745–W747. tative data. Neural Netw, 23, 1208–1225.
46. Fontelo,P., Liu,F. and Ackerman,M. (2005) askMEDLINE: a free-text, 66. Kilicoglu,H., Demner-Fushman,D., Rindflesch,T.C. et al. (2009)
natural language query tool for MEDLINE/PubMed. BMC Med. Towards automatic recognition of scientifically rigorous clinical re-
Inform. Decis. Mak., 5, 5. search evidence. J. Am. Med. Inform. Assoc., 16, 25–31.
47. Fontelo,P., Liu,F., Ackerman,M. et al. (2006) askMEDLINE: a report 67. Rinaldi,A. (2010) For I dipped into the future. EMBO Rep., 11,
on a year-long experience. AMIA Ann. Symp. Proc., 923. 345–359.
48. Muin,M., Fontelo,P., Liu,F. et al. (2005) SLIM: an alternative Web 68. Krallinger,M., Leitner,F., Rodriguez-Penagos,C. et al. (2008)
interface for MEDLINE/PubMed searches - a preliminary study. BMC Overview of the protein-protein interaction annotation extraction
Med. Inform. Decis. Mak., 5, 37.
task of BioCreative II. Genome Biol., 9 (Suppl. 2), S4.
49. Schardt,C., Adams,M.B., Owens,T. et al. (2007) Utilization of the
69. Morgan,A.A., Lu,Z., Wang,X. et al. (2008) Overview of BioCreative II
PICO framework to improve searching PubMed for clinical ques-
gene normalization. Genome Biol., 9 (Suppl. 2), S3.
tions. BMC Med. Inform. Decis. Mak., 7, 16.
70. Neveol,A., Islamaj-Dogan,R. and Lu,Z. (2010) Semi-automatic se-
50. Hokamp,K. and Wolfe,K.H. (2004) PubCrawler: keeping up comfort-
mantic annotation of PubMed Queries: a study on quality, effi-
ably with PubMed and GenBank. Nucleic Acids Res., 32, W16–W19.
ciency, satisfaction. J. Biomed. Inform.
51. Hokamp,K. and Wolfe,K. (1999) What’s new in the library? What’s
new in GenBank? let PubCrawler tell you. Trends Genet., 15,
471–472.
.............................................................................................................................................................................................................................................................................................
.............................................................................................................................................................................................................................................................................................
Page 13 of 13