0% found this document useful (0 votes)
25 views9 pages

F.4 Topic Detection and Tracking

This document discusses topic detection and tracking in news and media. Topic detection and tracking aims to automatically detect and group stories that discuss the same event or topic. It involves tasks like story segmentation, new event detection, cluster detection, and topic tracking. Effective topic detection relies on identifying both named entities and topic terms that are shared across stories on the same topic. Considering both named entities and topic terms separately can help systems more accurately determine whether a new story is discussing an existing topic or a completely new event.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views9 pages

F.4 Topic Detection and Tracking

This document discusses topic detection and tracking in news and media. Topic detection and tracking aims to automatically detect and group stories that discuss the same event or topic. It involves tasks like story segmentation, new event detection, cluster detection, and topic tracking. Effective topic detection relies on identifying both named entities and topic terms that are shared across stories on the same topic. Considering both named entities and topic terms separately can help systems more accurately determine whether a new story is discussing an existing topic or a completely new event.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

366 Part F.

Web Information Retrieval

F.4 Topic Detection and Tracking

Detecting and Tracking Current Events

In the World Wide Web, the Deep Web (e.g. information services by news agencies)
as well as in other media (e.g. radio or television), there are documents regarding
current events and whose content can frequently be found in various different sources
(sometimes in slight variations). This is particularly true for news, but also for certain
entries to weblogs (Zhou, Zhong, & Li, 2011), for texts in Bulletin Board Systems (Zhao
& Xu, 2011) and for large amounts of e‑mails organized by topic (Cselle, Albrecht,
& Wattenhofer, 2007). In contrast to “normal” information retrieval, the search here
does not start with a user’s recognized information need, but instead with a new
event that must be detected and represented. The user is offered information about
these events via specialized news systems, such as Google News, in the manner of
a push service. It is as if a user had tasked an SDI service with keeping him up to
date around the clock. Allan, who has fundamentally shaped this area of research,
calls this domain of information retrieval “topic detection and tracking” (TDT). Allan
(2002b, 139) defines this research area:

Topic Detection and Tracking (TDT) is a body of research and an evaluation paradigm that
addresses event-based organization of broadcast news. The TDT evaluation tasks of tracking,
cluster detection, and first story detection are each information filtering technology in the sense
that they require that “yes or no” decisions be made on a stream of news stories before additional
stories have arrived.

Google News restricts its offer to documents available in the WWW, which are offered
for free by news agencies or the online versions of newspapers. It does not take into
consideration the commercial offers of News Wires, most articles in the print versions
of newspapers and magazines (which still make up the majority of all news) as well
as all information transmitted non-digitally (via radio broadcasting). Google News
places its focus on new articles and those that are of general interest. Bharat (2003,
9) reports:

Specifically, freshness—measurable from the age of articles, and global editorial interest—meas-
urable from the number of original articles published worldwide on the subject, are used to infer
the importance of the story at a given time. If a story is fresh and has caused considerable origi-
nal reporting to be generated it is considered important. The final layout is determined based on
additional factors such as (i) the fit between the story and the section being populated, (ii) the
novelty of the story relative to other stories in the news, and (iii) the interest within the country,
when a country specific edition is being generated.

Unauthenticated
Download Date | 4/30/16 5:54 PM
 F.4 Topic Detection and Tracking 367

Four fundamental concepts are of importance at this point:


–– A “story” is a definable passage (or an entire document) in which an event is
discussed.
–– A “topic” is the description of an event in the respective stories, according to
Fiscus and Doddington (2002, 18): “a seminal event or activity, along with directly
related events and activities.”
–– The arrangement of current stories into a new topic is “topic detection”.
–– The adding of stories to a known topic is “topic tracking”.
Topic detection and tracking consists of several individual tasks (of which the first
five follow Allan, 2002a):
–– Story Segmentation: Isolating those stories that contain the respective topic (in
documents discussing several events),
–– New Event Detection: Identifying the first story that addresses a new event,
–– Cluster Detection: Summarizing all stories that contain the same topic,
–– Topic Tracking: Analysis of the ongoing news stream for known topics,
–– Link Detection: Analysis tool for determining the topical similarity of two stories,
–– Allocating a Title to a Cluster: Either the title of the first story or allocation of the
first n terms, arranged by weight, from all stories belonging to the cluster,
–– Extract: Writing a short summary of the topic (as a form of automatic extracting),
–– Ranking the Stories: Where a cluster contains several stories, ranking them by
importance.
An overview on the working steps is shown in Figure F.4.1.

Topic Detection

In the case of news from radio and television, the audio signals must first be trans-
lated into (written) text. This is accomplished either via intellectual transcription or
by using speech recognition systems. In a news broadcast (e.g. a German “Tages­
schau” transmission running fifteen minutes), several singular stories are discussed,
each counting as individual units via a segmentation of the overall text (Allan et al.,
1998, 196 et seq.). Dealings with an agency’s news stream are analogous. To simplify,
we can assume in this case that each news document contains exactly one story.

Unauthenticated
Download Date | 4/30/16 5:54 PM
368 Part F. Web Information Retrieval

Figure F.4.1: Working Steps of Topic Detection and Tracking.

The decisive question in analyzing the news stream is: does the recently arrived story
discuss a new topic, or does it address one that is already known? A (recognized) topic
is expressed via the mean vector (centroid) of its stories. In topic detection we calcu-

Unauthenticated
Download Date | 4/30/16 5:54 PM
 F.4 Topic Detection and Tracking 369

late the similarity, or dissimilarity, between the current story and all others in the
database. If no similarity can be detected (i.e. if a new topic is at hand), this first story
is taken to be representative of the new topic. On the other hand, when similarities are
observed between the current story and older ones, we are dealing with a case of topic
tracking. What follows is a comparison between the current story and all previously
known topics. A central role in topic detection and tracking is assumed by “story link
detection”, whose algorithm finds out whether the current story is dealing with a new
topic or a known one.
At this point, Allan et al. (2005) use the Vector Space Model and a variant of
TF*IDF to determine the term weight. For every story from the news stream, a weight-
ing value is calculated for every single term. tft,s is the (absolute) frequency of occur-
rence of a term t in the story s, dft counts all stories in the database that contain the
term t and N is the number of stories in the database. The term weight w of t in s is
calculated as follows:

wt,s = [tft,s * log((0,5 + N) / dft)] / [log(N + 1)].

Allan et al. suggest taking into consideration the first 1,000 terms, arranged by weight,
for inclusion in a story vector. With the exception of a few long news texts, all words
of a story should be acknowledged. The similarity between the story vector and all
other story vectors in the database is analyzed by calculating the cosine. The authors
have empirically determined a value of Sim(s1,s2) = 0.21, which separates new stories
from old ones. If the highest similarity between the new story and a random old one
is below 0.21, the current story will be registered as a new topic; if it is above that
number, it will then be asked to what known topic the new story belongs.
In practice, it is shown that this general approach is not enough to separate new
stories from old ones with any reliability. The performance of TDT systems improves
via the addition of complementary factors.
When identifying a topic, the central roles are played by personal names (“named
entities”) on the one hand and further words (“topic terms”) on the other. Two stories
address the same topic if they frequently contain both “named entities” and the rest
of the words in combination. Kumaran and Allan (2005, 123) justify this approach as
follows:

The intuition behind using these features is that we believe every event is characterized by a set
of people, places, organizations, etc. (named entities), and a set of terms that describe the event.
While the former can be described as the who, where, and when aspects of an event, the latter
relates to the what aspect. If two stories were on the same topic, they would share both named
entities as well as topic terms. If they were on different, but similar, topics, then either named
entities or topic terms will match but not both.

A representative example of a “mismatch” is shown in Figure F.4.2. In this case, the


system has not made use of the distinction between “named entities” and “topic

Unauthenticated
Download Date | 4/30/16 5:54 PM
370 Part F. Web Information Retrieval

terms”. Due to the high TF value of “Turkey” and “Turkish”, respectively, and the very
high IDF value of “Ismet Sezgin”, the Vector Space Machine claims that the above
story is similar to the below, and hence not new. In fact, the above text is new. Not a
single “topic term” co-occurs in both reports, leaving Kumaran and Allan (2005, 124)
to conclude:

Determining that the topic terms didn’t match would have helped the system to avoid this
mistake.

It thus appears to make sense to calculate the similarity (cosine) between two stories
separately for “named entities” and “topic terms”. Only when both similarity values
exceed a threshold value will a story be classified as belonging to an “old” topic.

Figure F.4.2: The Role of “Named Entities” and “Topic Terms” in Identifying a New Topic. Source:
Kumaran & Allan, 2005, 124.

A very important aspect of news is their relation to time and place. Makkonen,
Ahonen-Myka and Salmenkivi (2004, 354 et seq.) work with “temporal” and “spatial
similarity” in addition to “general similarity”. To determine the temporal relation, it
is at first required to derive exact data from the statements in the text. Let us suppose
a piece of news bears the date of May 27th, 2003. Phrasings in the text such as “last
week”, “last Wednesday”, “next Thursday” etc. need to be translated into exact dates,
i.e. “2003-05-19:2003-05-25”, “2003-05-21” and “2003-05-29”, respectively. Stories that
are similar to each other, all things being equal, without overlapping in terms of their
date, probably belong to different topics. Reports on Carnival processions in Cologne

Unauthenticated
Download Date | 4/30/16 5:54 PM
 F.4 Topic Detection and Tracking 371

for the years 2011 and 2012 hardly differ in regard to the terms that are used (“millions
of visitors”, “float”, “candy” etc.), but they do bear different dates.
The spatial relation is pursued via a geographical concept system. The similar-
ity between different spatial statements in two stories can be expressed via the path
length between identified geographical concepts in a geographical KOS (Makkonen,
Ahonen-Myka, & Salmenkivi, 2004, 357 et seq.). If one source talks about Puchheim,
and another, thematically related source talks about the county Fürstenfeldbruck,
both stories will be classified as similar to each other due to their path length of 1
(Puchheim is part of Fürstenfeldbruck county). Another pair of news reports like-
wise discusses similar topics, with one talking once more about Puchheim and the
other about Venlo, in the Netherlands. Since the path length in this instance is, say, 8,
nothing can lead us to suppose that they are discussing the same event.

Topic Tracking

When tracking topics, we assume the existence of a large number of known topics.
The similarity calculation now proceeds by aligning the topic vectors, i.e. the topic’s
respective name and topic centroids, with the new stories admitted into the database.
In addition, comparisons can be made between temporal and spatial relations.
In the first story, the centroid is identical to the vector of this story. Only when
there is at least a second story can we meaningfully speak of a “mean vector”. The
centroid changes as long as further stories are identified for the topic. If the centroid
is used to determine the title, e.g. by designating the first ten topics of the centroid,
ranked by weight, as “title”, the title can indeed change as long as new stories are
allocated to the topic.
Reports about events are written in different languages, given international inter-
est. Tracking known topics beyond language borders becomes a task for multilingual
topic tracking. Larkey et al. (2003) use automatic translation, which however does not
lead to satisfactory results.
If a topic consists of several stories, these must be ranked. In a patent by Google,
Curtiss, Bharat and Schmitt (2003, 3) pursue the path of developing quality criteria
for the respective sources:

(T)he group of metrics may include the number of articles produced by the news source during
a given time period, an average length of an article from the news source, the importance of
coverage from the news source, a breaking news score, usage patterns, human opinions, cir-
culation statistics, the size of the staff associated with the news source, the number of news
bureaus associated with the news source, the number of original named entities the source news
produces within a cluster of articles, the breath of coverage, international diversity, writing style,
and the like.

Unauthenticated
Download Date | 4/30/16 5:54 PM
372 Part F. Web Information Retrieval

If a TDT system comprises all relevant sources, it appears obvious that one should
grant the first story the “honor” of being prominently named in the top spot. The rest
of the stories can be arranged according to the Google News criteria.

Conclusion

–– In topic detection and tracking, one analyzes the stream of news from WWW, Deep Web (particu-
larly databases from news agencies and newspapers) as well as broadcasting (radio as well as
television). The goals are (1) to identify new events and (2) to allocate stories to already known
topics.
–– When discovering a new topic, one analyzes the similarity between a current story and all previ-
ously stored stories in the database. If there is no match, the story will be introduced as the first
representative of a new topic.
–– If similarities with previously stored stories arise, a second step will calculate the similarity of
the current story to known topics and allocate the story to a topic.
–– For the concrete calculation of similarities between stories, as well as between story and topic,
TF*IDF as well as the Vector Space Model suggest themselves. A topic is represented via the cen-
troid of all stories that discuss the respective event.
–– In news, “named entities” play a central role. It proves pertinent to select two vectors for every
document, one for personal names and one for the “topic terms”. Only when both vectors display
similarities to other stories can it be concluded that the current story belongs to a known topic.
–– In addition, it must be considered whether to draw on concrete temporal and spatial relations as
discriminatory characteristics in stories.
–– If several stories are available for a topic, these will be ranked. The top spot should be occu-
pied by that story which first reported on the event in question. Afterward, the ranking can be
arranged via quality criteria of the sources.

Bibliography
Allan, J. (2002a). Introduction to topic detection and tracking. In J. Allan (Ed.), Topic Detection and
Tracking. Event-based Information Organization (pp. 1-16). Boston, MA: Kluwer.
Allan, J. (2002b). Detection as multi-topic tracking. Information Retrieval, 5(2-3), 139-157.
Allan, J., Carbonell, J., Doddington, G., Yamron, J., & Yang, Y. (1998). Topic detection and tracking
pilot study. Final report. In Proceedings of the DARPA Broadcast News Transcription and
Understanding Workshop (pp. 194-218).
Allan, J., Harding, S., Fisher, D., Bolivar, A., Guzman-Lara, S., & Amstutz, P. (2005). Taking topic
detection from evaluation to practice. In Proceedings of the 38th Annual Hawaii International
Conference on System Sciences.
Bharat, K. (2003). Patterns on the Web. Lecture Notes in Computer Science, 2857, 1-15.
Cselle, G., Albrecht, K., & Wattenhofer, R. (2007). BuzzTrack. Topic detection and tracking in email.
In Proceedings of the 12th International Conference on Intelligent User Interfaces (pp. 190-197).
New York, NY: ACM.
Curtiss, M., Bharat, K., & Schmitt, M. (2003). Systems and methods for improving the ranking of
news articles. Patent No. US 7,577,655 B2.

Unauthenticated
Download Date | 4/30/16 5:54 PM
 F.4 Topic Detection and Tracking 373

Fiscus, J.G., & Doddington, G.R. (2002). Topic detection and tracking evaluation overview. In J. Allan
(Ed.), Topic Detection and Tracking. Event-based Information Organization (pp. 17-31). Boston,
MA: Kluwer.
Kumaran, G., & Allan, J. (2005). Using names and topics for new event detection. In Proceedings
of Human Language Technology Conference / Conference on Empirical Methods in Natural
Language Processing, Vancouver (pp. 121-128).
Larkey, L.S., Feng, F., Connell, M., & Lavrenko, V. (2003). Language-specific models in multilingual
topic tracking. In Proceedings of the 27th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval (pp. 402-409). New York, NY: ACM.
Makkonen, J., Ahonen-Myka, H., & Salmenkivi, M. (2004). Simple semantics in topic detection and
tracking. Information Retrieval, 7(3-4), 347-368.
Zhao, Y., & Xu, J. (2011). A novel method of topic detection and tracking for BBS. In Proceedings of
the 3rd International Conference on Communication Software and Networks, ICCSN 2011 (pp.
453-457). Washington, DC: IEEE.
Zhou, E., Zhong, N., & Li, Y. (2011). Hot topic detection in professional blogs. In AMT’11. Proceedings
of the 7th International Conference on Active Media Technology (pp. 141-152). Berlin, Heidelberg:
Springer.

Unauthenticated
Download Date | 4/30/16 5:54 PM
Unauthenticated
Download Date | 4/30/16 5:54 PM

You might also like