0% found this document useful (0 votes)
36 views10 pages

Wang 2006

This paper introduces a novel topic model called Topics over Time (TOT) that captures the dynamics of topic trends without relying on Markov assumptions or discretizing time. By associating each topic with a continuous distribution over timestamps, TOT improves topic discovery and temporal prediction in various datasets, including presidential addresses and research papers. The model demonstrates enhanced interpretability and distinct topic identification compared to traditional models like Latent Dirichlet Allocation (LDA).

Uploaded by

凌翎
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views10 pages

Wang 2006

This paper introduces a novel topic model called Topics over Time (TOT) that captures the dynamics of topic trends without relying on Markov assumptions or discretizing time. By associating each topic with a continuous distribution over timestamps, TOT improves topic discovery and temporal prediction in various datasets, including presidential addresses and research papers. The model demonstrates enhanced interpretability and distinct topic identification compared to traditional models like Latent Dirichlet Allocation (LDA).

Uploaded by

凌翎
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Research Track Paper

Topics over Time: A Non-Markov


Continuous-Time Model of Topical Trends

Xuerui Wang, Andrew McCallum


Department of Computer Science
University of Massachusetts
Amherst, MA 01003
xuerui@cs.umass.edu, mccallum@cs.umass.edu

ABSTRACT senders and recipients [10], and of words and relations (such
This paper presents an LDA-style topic model that captures as voting patterns) [18]. In each case, graphical model struc-
not only the low-dimensional structure of data, but also tures are carefully-designed to capture the relevant structure
how the structure changes over time. Unlike other recent and co-occurrence dependencies in the data.
work that relies on Markov assumptions or discretization of Many of the large data sets to which these topic mod-
time, here each topic is associated with a continuous distri- els are applied do not have static co-occurrence patterns;
bution over timestamps, and for each generated document, they are instead dynamic. The data are often collected over
the mixture distribution over topics is influenced by both time, and generally patterns present in the early part of
word co-occurrences and the document’s timestamp. Thus, the collection are not in effect later. Topics rise and fall in
the meaning of a particular topic can be relied upon as con- prominence; they split apart; they merge to form new top-
stant, but the topics’ occurrence and correlations change ics; words change their correlations. For example, across 17
significantly over time. We present results on nine months years of the Neural Information Processing Systems (NIPS)
of personal email, 17 years of NIPS research papers and conference, activity in “analog circuit design” has fallen off
over 200 years of presidential state-of-the-union addresses, somewhat, while research in “support vector machines” has
showing improved topics, better timestamp prediction, and recently risen dramatically. The topic “dynamic systems”
interpretable trends. used to co-occur with “neural networks,” but now co-occurs
with “graphical models.”
However none of the above mentioned topic models are
Categories and Subject Descriptors aware of these dependencies on document timestamps. Not
I.2.6 [Artificial Intelligence]: Learning; H.2.8 [Database modeling time can confound co-occurrence patterns and re-
Management]: Database Applications—data mining sult in unclear, sub-optimal topic discovery. For example,
in topic analysis of U.S. Presidential State-of-the-Union ad-
General Terms dresses, LDA confounds Mexican-American War (1846-1848)
with some aspects of World War I (1914-1918), because LDA
Algorithms, Experimentation is unaware of the 70-year separation between the two events.
Some previous work has performed some post-hoc analysis—
Keywords discovering topics without the use of timestamps and then
projecting their occurrence counts into discretized time [5]—
Graphical Models, Temporal Analysis, Topic Modeling
but this misses the opportunity for time to improve topic
discovery.
1. INTRODUCTION This paper presents Topics over Time (TOT), a topic
Research in statistical models of co-occurrence has led model that explicitly models time jointly with word co-
to the development of a variety of useful topic models— occurrence patterns. Significantly, and unlike some recent
mechanisms for discovering low-dimensional, multi-faceted work with similar goals, our model does not discretize time,
summaries of documents or other discrete data. These in- and does not make Markov assumptions over state transi-
clude models of words alone, such as Latent Dirichlet Allo- tions in time. Rather, TOT parameterizes a continuous dis-
cation (LDA) [2, 5], of words and research paper citations tribution over time associated with each topic, and topics are
[4], of word sequences with Markov dependencies [6, 17], of responsible for generating both observed timestamps as well
words and their authors [12], of words in a social network of as words. Parameter estimation is thus driven to discover
topics that simultaneously capture word co-occurrences and
locality of those patterns in time.
When a strong word co-occurrence pattern appears for
Permission to make digital or hard copies of all or part of this work for a brief moment in time then disappears, TOT will create
personal or classroom use is granted without fee provided that copies are a topic with a narrow time distribution. (Given enough
not made or distributed for profit or commercial advantage and that copies
evidence, arbitrarily small spans can be represented, unlike
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific schemes based on discretizing time.) When a pattern of
permission and/or a fee. word co-occurrence remains consistent across a long time
KDD’06 August 20–23, 2006, Philadelphia, Pennsylvania, USA. span, TOT will create a topic with a broad time distribution.
Copyright 2006 ACM 1-59593-339-5/06/0008 ...$5.00.

424
Research Track Paper

In current experiments, we use a Beta distribution over a network roles over time [10], and the Group-Topic model to
(normalized) time span covering all the data, and thus we capture changes in group formation over time [18].
can also flexibly represent various skewed shapes of rising We present experimental results with three real-world data
and falling topic prominence. sets. On more than two centuries of U.S. Presidential State-
The model’s generative storyline can be understood in two of-the-Union addresses, we show that TOT discovers top-
different ways. We fit the model parameters according to a ics with both time-localization and word-clarity improve-
generative model in which a per-document multinomial dis- ments over LDA. On the 17-year history of the NIPS con-
tribution over topics is sampled from a Dirichlet, then for ference, we show clearly interpretable topical trends, as well
each word occurrence we sample a topic; next a per-topic as a two-fold increase in the ability to predict time given
multinomial generates the word, and a per-topic Beta dis- a document. On nine months of the second author’s email
tribution generates the document’s time stamp. Here the archive, we show another example of clearly interpretable,
time stamp (which in practice is always observed and con- time-localized topics, such as springtime faculty recruiting.
stant across the document) is associated with each word in On all three data sets, TOT provides more distinct topics,
the document. We can also imagine an alternative, cor- as measured by KL divergence.
responding generative model in which the time stamp is
generated once per document, conditioned directly on the 2. TOPICS OVER TIME
per-document mixture over topics. In both cases, the likeli-
Before introducing the Topics over Time (TOT) model,
hood contribution from the words and the contribution from
let us review the basic Latent Dirichlet Allocation model.
the timestamps may need to be weighted by some factor, as
Our notation is summarized in Table 1, and the graphical
in the balancing of acoustic models and language models
model representations of both LDA and our TOT models
in speech recognition. The later generative storyline more
are shown in Figure 1.
directly corresponds to common data sets (with one times-
Latent Dirichlet Allocation (LDA) is a Bayesian network
tamp per document); the former is easier to fit, and can also
that generates a document using a mixture of topics [2].
allow some flexibility in which different parts of the docu-
In its generative process, for each document d, a multino-
ment may be discussing different time periods.
mial distribution θd over topics is randomly sampled from a
Some previous studies have also shown that topic discov-
Dirichlet with parameter α, and then to generate each word,
ery can be influenced by information in addition to word
a topic zdi is chosen from this topic distribution, and a word,
co-occurrences. For example, the Group-Topic model [18]
wdi , is generated by randomly sampling from a topic-specific
showed that the joint modeling of word co-occurrence and
multinomial distribution φzdi . The robustness of the model
voting relations resulted in more salient, relevant topics.
is greatly enhanced by integrating out uncertainty about the
The Mixed-Membership model [4] also showed interesting
per-document topic distribution θ and the per-topic word
results for research papers and their citations.
distribution φ.
Note that, in contrast to other work that models trajec-
In TOT, topic discovery is influenced not only by word
tories of individual topics over time, TOT topics and their
co-occurrences, but also temporal information. Rather than
meaning are modeled as constant over time. TOT captures
modeling a sequence of state changes with a Markov as-
changes in the occurrence (and co-occurrence conditioned
sumption on the dynamics, TOT models (normalized) abso-
on time) of the topics themselves, not changes in the word
lute timestamp values. This allows TOT to see long-range
distribution of each topic. The classical view of splitting
dependencies in time, to predict absolute time values given
and merging of topics is thus reflected as dynamic changes
an unstamped document, and to predict topic distributions
in the co-occurrence of constant topics. While choosing to
given a timestamp. It also helps avoid a Markov model’s
model individual topics as mutable could be useful, it can
risk of inappropriately dividing a topic in two when there is
also be dangerous. Imagine a subset of documents contain-
a brief gap in its appearance.
ing strong co-occurrence patterns across time: first between
Time is intrinsically continuous. Discretization of time
birds and aerodynamics, then aerodynamics and heat, then
always begs the question of selecting the slice size, and the
heat and quantum mechanics—this could lead to a single
size is invariably too small for some regions and too large
topic that follows this trajectory, and lead the user to inap-
for others. TOT avoids discretization by associating with
propriately conclude that birds and quantum mechanics are
each topic a continuous distribution over time. Many pos-
time-shifted versions of the same topic.
sible parameterized distributions are possible. Our earlier
Alternatively, consider a large subject like medicine, which
experiments were based on Gaussian. All the results in this
has changed drastically over time. In TOT we choose to
paper employ the Beta distribution (which can behave ver-
model these shifts as changes in topic co-occurrence—a de-
satile shapes), for which the time range of the data used for
crease in occurrence of topics about blood-letting and bile,
parameter estimation is normalized to a range from 0 to 1.
and an increase in topics about MRI and retrovirus, while
Another possible choice of bounded distributions is the Ku-
the topics about blood, limbs, and patients continue to co-
maraswamy distribution [8]. Double-bounded distributions
occur throughout. We do not claim that this point of view
are appropriate because the training data are bounded in
is better, but the difference makes TOT much simpler to
time. If it is necessary to predict in a small window into
understand and implement.
the future, the bounded region can be extended, yet still
In comparison to more complex alternatives, the relative
estimated based on the data available up to now.
simplicity of TOT is a great advantage—not only for the
Topics over Time is a generative model of timestamps
relative ease of understanding and implementing it, but also
and the words in the timestamped documents. There are
because this approach can in the future be naturally injected
two ways of describing its generative process. The first,
into other more richly structured topic models, such as the
which corresponds to the process used in Gibbs sampling
Author-Recipient-Topic model to capture changes in social
for parameter estimation, is as follows:

425
Research Track Paper

α α α

θ θ t ψ θ

β z β z β z

φ w φ w φ w t ψ
T Nd T Nd T Nd
D D D
(a) LDA (b) TOT model, (c) TOT model,
alternate view for Gibbs sampling

Figure 1: Three topic models: LDA and two perspectives on TOT

SYMBOL DESCRIPTION
T number of topics
D number of documents θd |α ∼ Dirichlet(α)
V number of unique words φz |β ∼ Dirichlet(β)
Nd number of word tokens in document d
zdi |θd ∼ Multinomial(θd )
θd the multinomial distribution of topics
specific to the document d wdi |φzdi ∼ Multinomial(φzdi )
φz the multinomial distribution of words tdi |ψzdi ∼ Beta(ψzdi ).
specific to topic z
ψz the beta distribution of time specific Inference can not be done exactly in this model. We
to topic z employ Gibbs sampling to perform approximate inference.
zdi the topic associated with the ith token Note that we adopt conjugate prior (Dirichlet) for the multi-
in the document d nomial distributions, and thus we can easily integrate out
wdi the ith token in document d θ and φ, analytically capturing the uncertainty associated
tdi the timestamp associated with the ith with them. In this way we facilitate the sampling—that is,
token in the document d (in Figure 1(c)) we need not sample θ and φ at all. Because we use the
continuous Beta distribution rather than discretizing time,
sparsity is not a big concern in fitting the temporal part of
Table 1: Notation used in this paper the model. For simplicity and speed we estimate these Beta
distributions ψz by the method of moments, once per itera-
1. Draw T multinomials φz from a Dirichlet prior β, one for tion of Gibbs sampling. One could estimate the values of the
each topic z; hyper-parameters of the TOT model, α and β, from data
2. For each document d, draw a multinomial θd from a Dirich-
using a Gibbs EM algorithm [1]. For many applications,
let prior α; then for each word wdi in document d: topic models are very sensitive to hyper-parameters, and it
is extremely important to get the right values for the hyper-
(a) Draw a topic zdi from multinomial θd ; parameters. In the particular applications discussed in this
paper, we find that the sensitivity to hyper-parameters is
(b) Draw a word wdi from multinomial φzdi ; not very strong. Thus, again for simplicity, we use fixed
(c) Draw a timestamp tdi from Beta ψzdi . symmetric Dirichlet distributions (α = 50/T and β = 0.1)
in all our experiments.
The graphical model is shown in Figure 1(c). Although, In the Gibbs sampling procedure above, we need to cal-
in the above generative process, a timestamp is generated culate the conditional distribution P (zdi |w, t, z−di , α, β, Ψ),
for each word token, all the timestamps of the words in a where z−di represents the topic assignments for all tokens
document are observed as the same as the timestamp of except wdi . We begin with the joint probability of a data
the document. One might also be interested in capturing set, and using the chain rule, we can obtain the conditional
burstiness, and some solution such as Dirichlet compound probability conveniently as
multinomial model (DCM) can be easily integrated into the
P (zdi |w, t, z−di , α, β, Ψ) ∝ (mdzdi + αzdi − 1)
TOT model [9]. In our experiments there are a fixed number
ψz 2 −1
of topics, T ; although a non-parametric Bayes version of nz w + βwdi − 1 (1 − tdi )ψzdi 1 −1 tdi di
TOT that automatically integrates over the number of topics × PV di di ,
v=1 (nzdi v + βv ) − 1
B(ψzdi 1 , ψzdi 2 )
would certainly be possible.
As shown in the above process, the posterior distribution where nzv is the number of tokens of word v are assigned to
of topics depends on the information from two modalities— topic z, mdz represent the number of tokens in document d
both text and time. TOT parameterization is are assigned to topic z. Detailed derivation of Gibbs sam-

426
Research Track Paper

pling for TOT is provided in Appendix A. An overview sible for the relative weight of the time modality versus the
of the Gibbs sampling procedure we use is shown in Algo- text modality. Thus we use such a weighting parameter to
rithm 1. rescale the likelihoods from different modalities, as is also
common in speech recognition when the acoustic and lan-
Algorithm 1 Inference on TOT guage models are combined, and in the Group-Topic model
1: initialize topic assignment randomly for all tokens [18] in which relational Blockstructures and topic models
2: for iter = 1 to Niter do are integrated. Here a natural setting for the weighting pa-
3: for d = 1 to D do rameter is the inverse of the number of words Nd in the
4: for w = 1 to Nd do document, which is equivalent to generating Nd indepen-
5: draw zdw from P (zdw |w, t, z−dw , α, β, Ψ) dent and identically distributed (i.i.d.) samples from the
6: update nzdw w and mdzdw document-specific mixture of Beta distributions. Thus, it
7: end for is probabilistically equivalent to drawing Nd samples from
8: end for the individual Beta distributions according to the mixture
9: for z = 1 to T do weights θd , which exactly corresponds to the generative pro-
10: update ψz cess in Figure 1 (c). In practice, it is also important to have
11: end for such a hyper-parameter when the likelihoods from discrete
12: end for and continuous modalities are combined. We find that this
13: compute the posterior estimates of θ and φ hyper-parameter is quite sensitive, and set it by trial and
error.
Although a document is modeled as a mixture of topics,
there is typically only one timestamp associated with a docu- 3. RELATED WORK
ment. The above generative process describes data in which Several previous studies have examined topics and their
there is a timestamp associated with each word. When fit- changes across time. Rather than jointly modeling word co-
ting our model from typical data, each training document’s occurrence and time, many of these methods use post-hoc
timestamp is copied to all the words in the document. How- or pre-discretized analysis.
ever, after fitting, if actually run as a generative model, this The first style of non-joint modeling involves fitting a
process would generate different time stamps for the words time-unaware topic model, and then ordering the documents
within the same document. In this sense, thus, it is formally in time, slicing them into discrete subsets, and examining
a deficient generative model, but still remains powerful in the topic distributions in each time-slice. One example is
modeling large dynamic text collections. Griffiths and Steyvers’ study of PNAS proceedings [5], in
An alternative generative process description of TOT, (bet- which they identified hot and cold topics based on examina-
ter suited to generate an unseen document), is one in which tion of topic mixtures estimated from an LDA model.
a single timestamp is associated with each document, gener- The second style of non-joint modeling pre-divides the
ated by rejection or importance sampling, from a mixture of data into discrete time slices, and fits a separate topic model
per-topic Beta distributions over time with mixtures weight in each slice. Examples of this type include the experiments
as the per-document θd over topics. As before, this dis- with the Group-Topic model [18], in which several decades
tribution over time is ultimately parameterized by the set worth of U.N. voting records (and their accompanying text)
of timestamp-generating Beta distributions, one per topic. were divided into 15-year segments; each segment was fit
The graphical model for this alternative generative process with the GT model, and trends were compared. One diffi-
is shown in Figure 1(b). culty with this approach is that aligning the topics from each
Using this model we can predict a time stamp given the time slice can be difficult, although starting Gibbs sampling
words in the document. To facilitate the comparison with using parameters from the previous time slice can help, as
LDA, we can discretize the timestamps (only for this pur- shown in [14]. Similarly, the TimeMines system [15] for TDT
pose). Given a document, we predict its timestamp by tasks (single topic in each document) constructs overview
choosing the discretized timestamp that maximizes the pos- timelines of a set of news stories. A χ2 test is performed to
terior which is calculated by multiplying the timestamp prob- identify days on which the number of occurrences of named
ability of all word tokens from their correspondingQ dtopic-wise entities or noun phrases produces a statistic above a given
Beta distributions over time, that is, arg maxt N i=1 p(t|ψzi ). threshold; consecutive days under this criterion are stitched
It is also interesting to consider obtaining a distribution together to form an interval to be added into the timeline.
over topics, conditioned on a timestamp. This allows us to Time series analysis has a long history in statistics, much
see the topic occurrence patterns over time. By Bayes rule, of which is based on dynamic models, with a Markov as-
E(θzi |t) = P (zi |t) ∝ p(t|zi )P (zi ) where P (zi ) can be esti- sumption that the state at time t + 1 or t + Δt is indepen-
mated from data or simply assumed as uniform. Examples of dent of all other history given the state at time t. Hidden
expected topic distributions θd conditioned on timestamps Markov models and Kalman filters are two such examples.
are shown in Section 5. For instance, recent work in social network analysis [13] pro-
Regarding parameter estimation, the two processes in Fig- poses a dynamic model that accounts for friendships drifting
ure 1 (b) and (c) can become equivalent when we introduce over time. Blei and Lafferty recently present dynamic topic
a balancing hyper-parameter between the likelihood from models (DTMs) in which the alignment among topics across
two modalities. In the second process, not surprisingly, the time steps is captured by a Kalman filter [3].
generation of one timestamp would be overwhelmed by the Continuous Time Bayesian Networks (CTBN) [11] are an
plurality of words generated under the bag of words assump- example of using continuous time without discretization. A
tion. To balance the influence from two different modali- CTBN consists of two components: a Bayesian network and
ties, a tunable hyper-parameter is needed which is respon- a continuous transition model, which avoids various granu-

427
Research Track Paper

larity problems due to discretization. Unlike TOT, however, new text entered by the author of each message, it is nec-
CTBNs use a Markov assumption. essary to remove “quoted original messages” in replies. We
Another Markov model that aims to find word patterns eliminate this extraneous text by a simple heuristic: all text
in time is Kleinberg’s “burst of activity model” [7]. This in a message below a “forwarded message” line or timestamp
approach uses a probabilistic infinite-state automaton with is removed. This heuristic does incorrectly delete text that
a particular state structure in which high activity states are interspersed with quoted email text. Words are formed
are reachable only by passing through lower activity states. from sequences of alphabetic characters; stopwords are re-
Rather than leveraging time stamps, it operates on a stream moved, and all text is downcased. The data set contains
of data, using data ordering as a proxy for time. Its infinite- 13,300 email messages, 22,379 unique words, and 453,743
state automaton has a continuous transition scheme similar word tokens in total. Each document’s timestamp is deter-
to CTBNs. However, it operates only on one word at a mined by the day and time the message was sent or received.
time, whereas TOT finds time-localized patterns in word
co-occurrences. 4.3 NIPS Papers
TOT uses time quite differently than the above models. The NIPS data set (provided to us by Gal Chechik) con-
First, TOT does not employ a Markov assumption over time, sists of the full text of the 17 years of proceedings from
but instead treats time as an observed continuous variable. 1987 to 2003 Neural Information Processing Systems (NIPS)
Second, many other models take the view that the “mean- Conferences. In addition to downcasing and removing stop-
ing” (or word associations) of a topic changes over time; in- words and numbers, we also remove the words appearing less
stead, in TOT we can rely on topics themselves as constant, than five times in the corpus—many of them produced by
while topic co-occurrence patterns change over time. OCR errors. Two letter words (primarily coming from equa-
Although not modeling time, several other topic models tions), are removed, except for “ML”, “AI”, “KL”, “BP”,
have associated the generation of additional modalities with “EM” and “IR.” The data set contains 2,326 research pa-
topics. For example, the aforementioned GT model condi- pers, 24,353 unique words, and 3,303,020 word tokens in to-
tions on topics for both word generation and relational links. tal. Each document’s timestamp is determined by the year
As in TOT, GT results also show that jointly modeling an of the proceedings.
additional modality improves the relevance of the discov-
ered topics. Another flexible, related model is the Mixed 5. EXPERIMENTAL RESULTS
Membership model [4], which treats the citations of papers
In this section, we present the topics discovered by the
as additional “words”, thus the formed topics are influenced
TOT model and compare them with topics from LDA. We
by both words and citations.
also demonstrate the ability of the TOT model to predict
the timestamps of documents, more than doubling accuracy
4. DATA SETS in comparison with LDA. We furthermore find topics discov-
We present experiments with the TOT model on three ered by TOT to be more distinct from each other than LDA
real-world data sets: 9 months of email sent and received topics (as measured by KL Divergence). Finally we show
by the second author, 17 years of NIPS conference papers, how TOT can be used to analyze topic co-occurrence condi-
and 21 decades of U.S. Presidential State-of-the-Union Ad- tioned on a timestamp. Topics presented in this section are
dresses. In all cases, for simplicity, we fix the number of extracted from a single sample at the 1000th iteration of the
topics T = 501 . Gibbs sampler. For the address data set, 1000 iterations of
the Gibbs sampler took 3 hours on a dual-processor Opteron
4.1 State-of-the-Union Addresses (Linux), 2 hours for the email data set, and 10 hours for the
The State of the Union is an annual message presented by NIPS data set.
the President to Congress, describing the state of the coun- 5.1 Topics Discovered for Addresses
try and his plan for the future. Our data set2 consists of the
transcripts of 208 addresses during 1790-2002 (from George The State-of-the-Union addresses contain the full range of
Washington to George W. Bush). We remove stopwords and United States history. Analysis of this data set shows strong
numbers, and all text is downcased. Because the topics dis- temporal patterns. Some of them are broad historical is-
cussed in each address are so diverse, and in order to improve sues, such as a clear “American Indian” topic throughout the
the robustness of the discovered topics, we increase the num- 1800s and peaking around 1860, or the rise of “Civil Rights”
ber of documents in this data set by splitting each transcript across the second half of the 1900s. Other sharply localized
into 3-paragraph “documents”. The resulting data set has trends are somewhat influenced by the individual president’s
6,427 (3-paragraph) documents, 21,576 unique words, and communication style, such as Theodore Roosevelt’s sharply
674,794 word tokens in total. Each document’s time stamp increased use of the words “great”, “men”, “public”, “coun-
is determined by the date on which the address was given. try”, and “work”. Unfortunately, space limitations prevent
us from showing all 50 topics.
4.2 A Researcher’s Email Four TOT topics, their most likely words, their Beta dis-
tributions over time, their actual histograms over time, as
This data set consists of the second author’s email archive
well as comparisons against their most similar LDA topic
of the nine months from January to September 2004, includ-
(by KL divergence), are shown in Figure 2. Immediately we
ing all emails sent and received. In order to model only the
see that the TOT topics are more neatly and narrowly fo-
1 cused in time; (time analysis for LDA is done post-hoc). An
It would be straightforward to automatically infer the num-
ber of topics using algorithms such as Hierarchical Dirichlet immediate and obvious effect is that this helps the reader un-
Process [16]. derstand more precisely when and over what length of time
2
http://www.gutenberg.org/dirs/etext04/suall11.txt the topical trend was occurring. For example, in the left-

428
Research Track Paper

Mexican War Panama Canal Cold War Modern Tech

states 0.02032 government 0.02928 world 0.01875 energy 0.03902


mexico 0.01832 united 0.02132 states 0.01717 national 0.01534
government 0.01670 states 0.02067 security 0.01710 development 0.01448
united 0.01521 islands 0.01167 soviet 0.01664 space 0.01436
war 0.01059 canal 0.01014 united 0.01491 science 0.01227
congress 0.00951 american 0.00872 nuclear 0.01454 technology 0.01227
country 0.00906 cuba 0.00834 peace 0.01408 oil 0.01178
texas 0.00852 made 0.00747 nations 0.01069 make 0.00994
made 0.00727 general 0.00731 international 0.01024 effort 0.00969
great 0.00611 war 0.00660 america 0.00987 administration 0.00957

mexico 0.06697 government 0.05618 defense 0.05556 program 0.02674


government 0.02254 american 0.02696 military 0.03819 energy 0.02477
mexican 0.02141 central 0.02518 forces 0.03308 development 0.02287
texas 0.02109 canal 0.02283 security 0.03020 administration 0.02119
territory 0.01739 republic 0.02198 strength 0.02406 economic 0.01710
part 0.01610 america 0.02170 nuclear 0.01858 areas 0.01585
republic 0.01344 pacific 0.01832 weapons 0.01654 programs 0.01578
military 0.01111 panama 0.01776 arms 0.01254 major 0.01534
state 0.00974 nicaragua 0.01381 maintain 0.01161 nation 0.01242
make 0.00942 isthmus 0.01137 strong 0.01106 assistance 0.01052

Figure 2: Four topics discovered by TOT (above) and LDA (bottom) for the Address data set. The titles are
our own interpretation of the topics. Histograms show how the topics are distributed over time; the fitted
beta PDFs is shown also. (For LDA, beta distributions are fit in a post-hoc fashion). The top words with
their probability in each topic are shown below the histograms. The TOT topics are better localized in time,
and TOT discovers more event-specific topical words.

most topic, TOT clearly shows that the Mexican-American U.S. warships to return to the Caribbean via Cape Horn.
war (1846-1848) occurred in the few years just before 1850. The LDA counterpart is not only widely spread through
In LDA, on the other hand, the topic spreads throughout time, but also confounding topics such as modern trade rela-
American history; it has its peak around 1850, but seems to tions with Central America and efforts to build the Panama
be getting confused by a secondary peak around the time Railroad in the 1850s.
of World War I, (when “war” words were used again, and The third topic shows the rise and fall of the Cold War,
relations to Mexico played a small part). It is not so clear with a peak on the Reagan years, when Presidential rhetoric
what event is being captured by LDA’s topic. on the subject rose dramatically. Both TOT and LDA topics
The second topic, “Panama Canal,” is another vivid ex- mention “nuclear,” but only TOT correctly identifies “so-
ample of how TOT can successfully localize a topic in time, viet”. LDA confounds what is mostly a cold war topic (al-
and also how jointly modeling words and time can help though it misses “soviet”) with words and events from across
sharpen and improve the topical word distribution. The American history, including small but noticeable bumps for
Panama Canal (constructed during 1904-1914) is correctly World War I and the Civil War. TOT correctly has its own
localized in time, and the topic accurately describes some of separate topic for World War I.
the issues motivating canal construction: the sinking of the Lastly, the rightmost topics in Figure 2, “Modern Tech,”
U.S.S. Maine in a Cuban harbor, and the long time it took shows a case in which the TOT topic is not necessarily

429
Research Track Paper

Faculty Recruiting ART Paper MALLET CVS Operations

cs 0.03572 xuerui 0.02113 code 0.05668 check 0.04473


april 0.02724 data 0.01814 files 0.04212 page 0.04070
faculty 0.02341 word 0.01601 mallet 0.04073 version 0.03828
david 0.02012 research 0.01408 java 0.03085 cvs 0.03587
lunch 0.01766 topic 0.01366 file 0.02947 add 0.03083
schedule 0.01656 model 0.01238 al 0.02479 update 0.02539
candidate 0.01560 andres 0.01238 directory 0.02080 latest 0.02519
talk 0.01355 sample 0.01152 version 0.01664 updated 0.02317
bruce 0.01273 enron 0.01067 pdf 0.01421 checked 0.02277
visit 0.01232 dataset 0.00960 bug 0.01352 change 0.02156

cs 0.05137 email 0.09991 code 0.05947 paper 0.06106


david 0.04592 ron 0.04536 mallet 0.03922 page 0.05504
bruce 0.02734 messages 0.04095 version 0.03772 web 0.04257
lunch 0.02710 data 0.03408 file 0.03702 title 0.03526
manmatha 0.02391 calo 0.03236 files 0.02534 author 0.02763
andrew 0.02332 message 0.03053 java 0.02522 papers 0.02741
faculty 0.01764 enron 0.03028 cvs 0.02511 email 0.02204
april 0.01740 project 0.02415 directory 0.01978 pages 0.02193
shlomo 0.01657 send 0.02023 add 0.01932 nips 0.01967
al 0.01621 part 0.01680 checked 0.01481 link 0.01860

Figure 3: Four topics discovered by TOT (above) and LDA (bottom) for the Email data set, showing improved
results with TOT. For example, the Faculty Recruiting topic is correctly identified in the spring in the TOT
model, but LDA confuses it with other interactions among faculty.

better—just interestingly different than the LDA topic. The other types of faculty interactions and collaboration. The
TOT topic, with mentions of energy, space, science, and temporal information captured by TOT plays a very impor-
technology, is about modern technology and energy. Its em- tant role in forming meaningful time-sensitive topics.
phasis on modern times is also very distinct in its time dis- The topic “ART paper” reflects a surge of effort in col-
tribution. The closest LDA topic also includes energy, but laboratively writing a paper on the Author-Recipient-Topic
focuses on economic development and assistance to other model. Although the co-occurrence pattern of the words
nations. Its time distribution shows an extra bump around in this topic is strong and distinct, LDA failed to discover a
the decade of the Marshal Plan (1947-1951), and a lower corresponding topic—likely because it was a relatively short-
level during George W. Bush’s presidency—both inconsis- lived phenomena. The closest LDA topic shows the general
tent with the time distribution learned by the TOT topic. research activities, work on the DARPA CALO project, and
various collaborations with SRI to prepare the Enron email
data set for public release. Not only does modeling time
5.2 Topics Discovered for Email help TOT discover the “ART paper” task, but an alterna-
In Figure 3 we demonstrate TOT on the Email data set. tive model that relied on coarse time discretization may miss
Email is typically full of seasonal phenomena (such as paper such topics that have small time spans.
deadlines, summer semester, etc.). One such seasonal exam- The “MALLET” topic shows that, after putting in an
ple is the “Faculty Recruiting” topic, which (unlike LDA) intense effort in writing and discussing Java programming
TOT clearly identifies and localizes in the spring. The LDA for the MALLET toolkit, the second author had less and
counterpart is widely spread over the whole time period, less time to write code for the toolkit. In the correspond-
and consequently, it cannot separate faculty recruiting from

430
Research Track Paper

Recurrent NN Game Theory


Table 2: Average KL divergence between topics for
TOT vs. LDA on three data sets. TOT finds more
distinct topics.

Address Email NIPS


TOT 0.6266 0.6416 0.5728
LDA 0.5965 0.5943 0.5421

state 0.05963 game 0.02850 Table 3: Predicting the decade, in the Address data
recurrent 0.03765 strategy 0.02378 set. L1 Error is the difference between predicted
sequence 0.03616 play 0.01490 and true decade. In the Accuracy column, we see
sequences 0.02462 games 0.01473 that TOT predicts exactly the correct decade nearly
time 0.02402 player 0.01451 twice as often as LDA.
states 0.02057 agents 0.01346
transition 0.01300 expert 0.01281 L1 Error E(L1) Accuracy
finite 0.01242 strategies 0.01123 TOT 1.98 2.02 0.19
length 0.01154 opponent... 0.01088 LDA 2.51 2.58 0.10
strings 0.01013 nash 0.00848

and measure the impact of differently shaped profiles in


time.
Figure 4 shows two topics discovered from the NIPS pro-
ceedings. “Recurrent Neural Networks” is clearly identified
by TOT, and correctly shown to rise and fall in prominence
within NIPS during the 1990s. LDA, unaware of the fact
that Markov models superceded Recurrent Neural Networks
for dynamic systems in the later NIPS years, and unaware
state 0.05957 game 0.01784
of the time-profiles of both, ends up mixing the two methods
sequence 0.03939 strategy 0.01357
together. LDA has a second topic elsewhere that also covers
sequences 0.02625 play 0.01131
Markov models.
time 0.02503 games 0.00940
On the right, we see “Games” and game theory. This
states 0.02338 algorithm 0.00915
is an example in which TOT and LDA yield nearly identi-
recurrent 0.01451 expert 0.00898
cal results, although, if the terms beyond simply the first
markov 0.01398 time 0.00837
ten are examined, one sees that LDA is emphasizing board
transition 0.01369 player 0.00834
games, such as chess and backgammon, while TOT used its
length 0.01164 return 0.00750
ramping-up time distribution to more clearly identify game
hidden 0.01072 strategies 0.00640
theory as part of this topic (e.g., the word “Nash” occurs in
position 12 for TOT, but not in the top 50 for LDA).
Figure 4: Two topics discovered by TOT (above)
We have been discussing the salience and specificity of
and LDA (bottom) for the NIPS data set. For ex- TOT’s topics. Distances between topics can also be mea-
ample, on the left, two major approaches to dynamic sured numerically. Table 2 shows the average distance of
system modeling are confounded by LDA, but TOT word distributions between all pairs of topics, as measured
more clearly identifies waning interest in Recurrent by KL Divergence. In all three data sets, the TOT topics are
Neural Networks, with a separate topic (not shown) more distinct from each other. Partially because the Beta
for rising interest in Markov models.
distribution is rarely multi-modal, the TOT model strives to
separate events that occur during different time spans, and
ing LDA topic, MALLET development is confounded with in real-world data, time differences are often correlated with
CVS operations—which were later also used for managing word distribution differences that would have been more dif-
collaborative writing of research papers. ficult to tease apart otherwise. The MALLET-CVS-paper
TOT appropriately and clearly discovers a separate top- distinction in the email data set is one example. (Events
ics for “CVS operations,” seen in the rightmost column. with truly multi-modal time distributions would be mod-
The closest LDA topic is the previously discussed one that eled with alternatives to the Beta distribution.)
merges MALLET and CVS. The second closest LDA topic
(bottom right) discusses research paper writing, but not 5.4 Time Prediction
CVS. All these examples show that TOT’s use of time can One interesting feature of our approach (not shared by
help it pull apart distinct events, tasks and topics that may state-transition-based Markov models of topical shifts) is the
be confusingly merged by LDA. capability of predicting the timestamp given the words in a
document. This task also provides another opportunity to
5.3 Topics Discovered for NIPS quantitatively compare TOT against LDA.
Research paper proceedings also present interesting trends On the State-of-the-Union Address data set, we measure
for analysis. Successfully modeling trends in the research lit- the ability to predict the decade given the text of the ad-
erature can help us understand how research fields evolve, dress, as measured in accuracy, L1 error and average L1

431
Research Track Paper

probability, inference 0.5


cells, corte mixture mo
x dels
0.9

0.45 background
0.8

0.4
training constraints, optimiza
0.7 neura
neural
network state, p
olicy, ac
tion NLP
l netw s tion
ork s 0.35
tructu
re
0.6
SVMs
0.3
boosting
0.5

0.25
stimulu
s respo
nse
0.4
0.2
SVMs
learning, generalization
0.3
neural network dynamics 0.15

neural distance digits mixtu


gradient, convergence
0.2
re m
0.1
network odels
boosting
0.1
0.05
learning neural network
NLP
analog circuit
structure
0 0
2 4 6 8 10 12 14 16
2 4 6 8 10 12 14 16

Figure 5: The distribution over topics given time Figure 6: Eight topics co-occurring strongly with
in the NIPS data set. Note the rich collection of the “classification” topic in the NIPS data set.
shapes that emerge from the Bayesian inversion of Other co-occurring topics are labeled as a com-
the collection of per-topic Beta distributions over bined background topic. Classification with neural
time. networks declined, while co-occurrence with SVMs,
boosting and NLP are on the rise.
distance to the correct decade (number of decades differ-
ence between predicted and correct decade). As shown in
Table 3, TOT achieves double the accuracy of LDA, and can count the number of documents in which certain topics
provides an L1 relative error reduction of 20%. (strongly) co-occur, and map out how co-occurrence pat-
terns change over time.
5.5 Topic Distribution Profile over Time Figure 6 shows the prominence profile over time of those
It is also interesting to consider the TOT model’s distri- topics that co-occur strongly with the NIPS topic “classifica-
bution over topics as a function of time. The time distribu- tion.” We can see that at the beginning NIPS, this problem
tion of each individual topic is described as a Beta distribu- was solved primarily with neural networks. It co-occurred
tion (having flexible mean, variance and skewness), but even with the “digit recognition” in the middle 90’s. Later, prob-
more rich and complex profiles emerge from the interactions abilistic mixture models, boosting and SVM methods be-
among these Beta distributions. TOT’s approach to model- came popular.
ing topic distributions conditioned on time stamp—based on
multiple time-generating Betas, inverted with Bayes rule— 6. CONCLUSIONS
has the dual advantages of a relatively simple, easy-to-fit pa- This paper has presented Topic over Time (TOT), a model
rameterization, while also offering topic distributions with that jointly models both word co-occurrences and localiza-
a flexibility that would be more difficult to achieve with tion in continuous time. Results on three real-world data
a direct, non-inverted parameterization, (i.e. one generat- sets show the discovery of more salient topics that are as-
ing topic distributions directly conditioned on time, without sociated with events, and clearly localized in time. We also
Bayes-rule inversion). show improved ability to predict time given a document.
The expected topic mixture distributions for the NIPS Reversing the inference by Bayes rule, yields a flexible pa-
data set are shown in Figure 5. The topics are consistently rameterization over topics conditioned on time, as deter-
ordered in each year, and the heights of a topic’s region rep- mined by the interactions among the many per-topic Beta
resents the relative weight of the corresponding topic given distributions.
a timestamp, calculated using the procedure described in Unlike some related work with similar motivations, TOT
Section 2. We can clearly see that topic mixtures change does not require discretization of time or Markov assump-
dramatically over time, and have interesting shapes. NIPS tions on state dynamics. The relative simplicity of our ap-
begins with more emphasis on neural networks, analog cir- proach provides advantages for injecting these ideas into
cuits and cells, but now emphasizes more SVMs, optimiza- other topic models. For example, in ongoing work we are
tion, probability and inference. finding patterns in topics and group membership over time,
5.6 Topic Co-occurrences over Time with a Group-Topic model over time. Many other extensions
are possible.
We can also examine topic co-occurrences over time, which,
as discussed in Section 1, are dynamic for many large text
collections. In the following, we say two topics z1 and z2 7. ACKNOWLEDGMENTS
(strongly) co-occur in a document d if both θz1 and θz2 are This work was supported in part by the Center for In-
greater than some threshold h (we set h = 2/T ); then we telligent Information Retrieval and in part by the Defense

432
Research Track Paper

Advanced Research Projects Agency (DARPA), through the SIGKDD International Conference on Knowledge
Department of the Interior, NBC, Acquisition Services Di- Discovery and Data Mining Workshop on Link Discovery:
vision, under contract number NBCHD030010, and under Issues, Approaches and Applications, pages 28–35, 2005.
contract number HR0011-06-C-0023. Any opinions, find-
ings and conclusions or recommendations expressed in this APPENDIX
material are those of the author(s) and do not necessarily
reflect those of the sponsor. We also thank Charles Sutton A. GIBBS SAMPLING DERIVATION FOR
and David Mimno for helpful discussions. TOT
We begin with the joint distribution P (w, t, z|α, β, Ψ). We can
take advantage of conjugate priors to simplify the integrals. All
8. REFERENCES symbols are defined in Section 2.
[1] C. Andrieu, N. de Freitas, A. Doucet, and M. Jordan. An
introduction to MCMC for machine learning. Machine P (w, t, z|α, β, Ψ)
Learning, 50:5–43, 2003. = P (w|z, β)p(t|Ψ, z)P (z|α)
[2] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Z Z
Journal of Machine Learning Research, 3:993–1022, 2003. = P (w|Φ, z)p(Φ|β)dΦp(t|Ψ, z) P (z|Θ)p(Θ|α)dΘ
[3] D. M. Blei and J. D. Lafferty. Dynamic topic models. In
Z Y
D Y
Nd
Y
T Y
D Y
Nd
Proceedings of the 23rd International Conference on
Machine Learning, 2006. = P (wdi |φzdi ) p(φz |β)dΦ p(tdi |ψzdi )
d=1 i=1 z=1 d=1 i=1
[4] E. Erosheva, S. Fienberg, and J. Lafferty. Mixed 0 1
membership models of scientific publications. Proceedings Z Y
D Y
Nd
of the National Academy of Sciences, 101(Suppl. 1), 2004. × @ P (zdi |θd )p(θd |α)A dΘ
[5] T. Griffiths and M. Steyvers. Finding scientific topics. d=1 i=1
Proceedings of the National Academy of Sciences, Z Y
T Y P !
V Y
T
Γ( V YV
101(suppl. 1):5228–5235, 2004. v=1 βv ) βv −1
= φnzv
zv QV φ zv dΦ
[6] T. Griffiths, M. Steyvers, D. Blei, and J. Tenenbaum. z=1 v=1 z=1 v=1 Γ(βv ) v=1
Integrating topics and syntax. In Advances in Neural Z YD Y P !
Information Processing Systems (NIPS) 17, 2004.
T YD
Γ( T Y
T
mdz z=1 αz ) αz −1
× θdz QT θdz dΘ
[7] J. Kleinberg. Bursty and hierarchical structure in streams.
d=1 z=1 d=1 z=1 Γ(αz ) z=1
In Proceedings of the 8th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, Y
D Y
Nd
2002. × p(tdi |ψzdi )
[8] P. Kumaraswamy. A generalized probability density d=1 i=1
function for double-bounded random processes. Journal of P !T P !D D N
Γ( V Γ( T YY d
Hydrology, 46:79–88, 1980. v=1 βv ) z=1 αz )
= QV QT p(tdi |ψzdi )
[9] R. E. Madsen, D. Kauchak, and C. Elkan. Modeling word v=1 Γ(βv ) z=1 Γ(αz ) d=1 i=1
burstiness using the Dirichlet distribution. In Proceedings QV QT
YT YD
of the 22nd International Conference on Machine v=1 Γ(nzv + βv ) z=1 Γ(mdz + αz )
× PV PT
Learning, 2005. Γ( (n + β )) Γ(
z=1 v=1 zv v d=1 z=1 (mdz + αz ))
[10] A. McCallum, A. Corrada-Emanuel, and X. Wang. Topic
and role discovery in social networks. In Proceedings of Using the chain rule, we can obtain the conditional probability
19th International Joint Conference on Artificial conveniently,
Intelligence, 2005. P (zdi |w, t, z−di , α, β, Ψ)
[11] U. Nodelman, C. Shelton, and D. Koller. Continuous time
Bayesian networks. In Proceedings of the 18th Conference P (zdi , wdi , tdi |w−di , t−di , z−di , α, β, Ψ)
=
on Uncertainty in Artificial Intelligence, pages 378–387, P (wdi , tdi |w−di , t−di , z−di , α, β, Ψ)
2002. P (w, t, z|α, β, Ψ)
[12] M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The ∝
P (w−di , t−di , z−di |α, β, Ψ)
author-topic model for authors and documents. In
Proceedings of the 20th Conference on Uncertainty in nzdi wdi + βwdi − 1
∝ PV (mdzdi + αzdi − 1)p(tdi |ψzdi )
v=1 (nzdi v + βv ) − 1
Artificial Intelligence, 2004.
[13] P. Sarkar and A. Moore. Dynamic social network analysis nz w + βwdi − 1
using latent space models. In The 19th Annual Conference ∝ (mdzdi + αzdi − 1) PV di di
on Neural Information Processing Systems, 2005. v=1 (nzdi v + βv ) − 1
[14] X. Song, C.-Y. Lin, B. L. Tseng, and M.-T. Sun. Modeling ψz 2 −1
(1 − tdi )ψzdi 1 −1 tdi di
and predicting personal information dissemination ×
behavior. In Proceedings of the 11th ACM SIGKDD B(ψzdi 1 , ψzdi 2 )
International Conference on Knowledge Discovery and In practice, the balancing hyper-parameter often appears as an
Data Mining, 2005. exponential power of the last term above. Since timestamps are
[15] R. Swan and D. Jensen. Timemines: Constructing timelines drawn from continuous Beta distributions, sparsity is not a big
with statistical models of word usage. In The 6th ACM problem for parameter estimation of Ψ. For simplicity, we update
SIGKDD International Conference on Knowledge Ψ after each Gibbs sample by the method of moments, detailed
Discovery and Data Mining Workshop on Text Mining, as follows:
pages 73–80, 2000. „ «
t̄z (1 − t̄z )
[16] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. ψ̂z1 = t̄z − 1
Hierarchical Dirichlet processes. Technical report, UC s2z
„ «
Berkeley Statistics TR-653, 2004. t̄z (1 − t̄z )
[17] X. Wang and A. McCallum. A note on topical n-grams. ψ̂z2 = (1 − t̄z ) − 1
s2z
Technical report, UMass UM-CS-2005-071, 2005.
[18] X. Wang, N. Mohanty, and A. McCallum. Group and topic where t̄z and s2z indicate the sample mean and the biased sample
discovery from relations and text. In The 11th ACM variance of the timestamps belonging to topic z, respectively.

433

You might also like