A two-stage statistical word segmentation system for Chinese
Guohong Fu                                                K.K. Luke
            Dept of Linguistics                                       Dept of Linguistics
        The University of Hong Kong                              The University of Hong Kong
        Pokfulam Road, Hong Kong                                  Pokfulam Road, Hong Kong
         ghfu@hkucc.hku.hk                                       kkluke@hkusua.hku.hk
                                                        known word segmentation. Section 3 describes a
                     Abstract                           hybrid approach for unknown word identification.
                                                        In section 4, we report the results of our system at
    In this paper we present a two-stage                the SIGHAN evaluation program, and in the final
    statistical word segmentation system for            section we give our conclusions on this work.
    Chinese based on word bigram and word-
    formation models. This system was                   2        The first stage: Segmentation of known
    evaluated on Peking University corpora at                    words
    the First International Chinese Word
    Segmentation Bakeoff. We also give                  In a sense, known word segmentation is a process
    results and discussions on this evaluation.         of disambiguation. In our system, we use word
                                                        bigram language models and Viterbi algorithm
                                                        (1967) to resolve word boundary ambiguities in
1   Introduction                                        known word segmentation.
                                                        For a particular input Chinese character string
Word segmentation is very important for Chinese         C = c1 c 2 L c n , there is usually more than one
language processing, which aims to recognize the
                                                        possible segmentation W = w1 w2 L wm according to
implicit word boundaries in Chinese text. During
the past decades, great success has been achieved in    given system dictionary. Word bigram segmentation
Chinese word segmentation (Nie, et al, 1995; Yao,       aims to find the most appropriate segmentation
1997; Fu and Wang, 1999; Wang et al, 2000;              Wˆ = w1 w2 L wm that maximizes the probability
Zhang, et al, 2002). However, there still remain two     m
difficult problems, i.e. ambiguity resolution and       ∏ Pr ( wi | wi−1) , i.e.
                                                        i =1
unknown word (so-called OOV word) identification,                                                    m
while developing a practical segmentation system               Wˆ = arg max Pr (W | C ) ≈ arg max   ∏ Pr (wi | wi −1 )   (1)
                                                                       W                       W     i =1
for large open applications.
                                                        where Pr ( wi | wi −1 ) is the probability that word wl
In this paper, we present a two-stage statistical
word segmentation system for Chinese. In the first      will occur given previous word wi −1 , which can be
stage, we employ word bigram model to segment           easily estimated from segmented corpus using
known words (viz. the words included in the system      maximum likelihood estimation (MLE), i.e.
dictionary) in input. In the second stage, we develop                                       Count ( wi −1 wi )
                                                                       Pr (wi | wi −1 ) ≈                                (2)
a hybrid algorithm to perform unknown word                                                   Count ( wi −1 )
identification incorporating word contextual            To avoid the problem of data sparseness in MLE,
information, word-formation patterns and word           here we apply the linear interpolation technique
juncture model.                                         (Jelinek and Mercer, 1980) to smooth the estimated
The rest of this paper is organized as follows:         word bigram probabilities.
Section 2 presents a word bigram solution for
3      The second stage: Unknown word                                                      general, a known word w may take one of the
       identification                                                                      following four patterns while forming a word: (1)
                                                                                            w itself is a word. (2) w is the beginning of an
The second stage mainly concerns unknown words                                             unknown word. (3) w is at the middle of an
segmentation that remains unresolved in first stage.                                       unknown word. (4) w appears at the end of an
This section describes a hybrid algorithm for                                              unknown word. For convenience, we use S , B ,
unknown word identification, which can incorporate                                          M and E to denote the four patterns respectively.
word juncture model, word-formation patterns and                                           Let pttn(w) denote a particular pattern of w in an
contextual information. To avoid the complicated
                                                                                           unknown word and P r ( pttn( w)) denote the relevant
normalization of the probabilities of different
dimensions, the simple superposition principle is                                          probability, then
                                                                                                                    def
also used in merging these models.                                                                                        Count ( pttn( w))
                                                                                                       Pr ( pttn( w)) =                        (5)
                                                                                                                            Count ( w)
3.1       Word juncture model                                                              Obviously, ∑ Pr ( pttn(w)) = 1 . And 1- Pr ( S ( w)) is the
                                                                                                        pttn
Word juncture model score an unknown word by
assigning word juncture type. Obviously, most                                              word-formation power of the known word w .
unknown words appear as a string of known words                                            Let    Ppttn ( wU ) be the overall word-formation
after segmentation in first stage. Therefore,                                              pattern probability of a certain unknown word
unknown word identification can be viewed as a                                             wU = w1 w2 L wl , then
process of re-assigning correct word juncture type
to each known word pair in input. Given a known
                                                                                                      Ppttn ( wU ) =    ∏ Pr ( pttn( wi ))      (6)
                                                                                                                       wi ∈wU
word string W = w1w2 Lwn , between each word pair                                          Theoretically speaking, a known word can take any
 wi wi +1(1 ≤ i ≤ n − 1) is a word juncture. In general,                                   pattern while forming an unknown word. But it is
there are two types of junctures in unknown word                                           not even in probability for different known words
identification, namely word boundary (denoted by                                           and different patterns. For example, the word 性
 t B ) and non-word boundary (denoted by t N ).                                            (xing4, nature) is more likely to act as the suffix of
Let t (wi wi +1 ) denote the type of a word juncture                                       words. According to our investigation on the
 wi wi +1 , and Pr (t(wi wi+1 )) denote the relevant                                       training corpus, the character 性 appears at the end
conditional probability, then                                                              of a multiword in more than 93% of cases.
                                    def
                                          Count (t (wi wi +1 ))
                Pr (t ( wi wi +1 )) =                                              (3)     3.3    Hybrid algorithm for unknown word
                                           Count (wi wi +1 )
                                                                                                  identification
Thus, the word juncture probability PCJM ( wU ) of a
                                                                                           Current algorithm for unknown word identification
particular unknown word wU = wi wi +1 L w j
                                                                                           consists of three major components: (1) an
 (1 ≤ i ≤ j ≤ n) can be calculated by                                                      unknown word extractor firstly extracts a fragment
                                                               j −1
                                                                                           of known words w1 w2 L wn that that may have
 PCJM (wU ) = Pr (t B (wi −1wi )) × Pr (t B (w j w j +1 )) ×   ∏ Pr (t N (cl cl +1)) (4)
                                                               l =i                        unknown words based on the related word-
In a sense, word juncture model mirrors the affinity                                       formation power and word juncture probability and
of known word pairs in forming an unknown word.                                            its left and right contextual word w L , w R from the
For a word juncture ( wi , wi +1 ) , the larger the                                        output of the first stage. (2) A candidate word
probability Pr (t N (wi wi +1 )) , the more likely the two                                 constructor then generates a lattice of all possible
words are merged together into one new word.                                               new segmentations {WU | WU = x1 x2 L xm } that may
                                                                                           involve unknown words from the extracted
3.2       Word-formation patterns                                                          fragment. (3) A decoder finally incorporates word
Word-formation pattern model scores an unknown                                             juncture model PWJM (WU ) , word-formation
word according to the probability of how each                                              patterns Ppttn (WU ) and word bigram probability
internal known word contributes to its formation. In
Pbigram (WU ) to score these candidates, and then                                           4.3   Experimental results and discussion
applies the Viterbi algorithm again to find the best
new segmentation WˆU = x1 x2 L xm that has the                                                Items F      R     P     OOV ROOV Riv
maximum score:                                                                                Closed 0.939 0.936 0.942 0.069 0.675 95.5
      WˆU = arg max{Ppttn (WU ) + PCJM (WU ) + Pbigram (WU )}                                 Open 0.937 0.933 0.941 0.094 0.762 95.0
                                                                                      (7)
               WU
                                                                                                    Table 2: Test results on PK corpus
          = arg max{
               WU
                       ∑ ( Ppttn ( xi ) + PCJM ( xi ) + Pr ( xi | xi −1 ))}
                     i =1,L, n
where x0 = wL and x n+1 = w R . Let wU denote any                                           Segmentation speed: There are in all about 28,458
unknown word in the training corpus. If xi is an                                            characters in the test corpus. It takes about 3.21
                                                     ∑ Count( xi −1wU )                     and 3.07 seconds in all for our system to perform
unknown word, then Pr ( xi | xi −1 ) = w              U                       .             full segmentation (including known word
                                                          Count ( xi −1 )
                                                                                            segmentation and unknown word identification) on
                                                                                            the closed and open test corpus respectively,
4      Experiments                                                                          running on an ACER notebook (TM632XC-P4M).
Our system participated in both closed and open                                             This indicates that our system is able to process
tests on Peking University corpora at the First                                             about 531,925~556,182 characters per minute.
International Chinese Word Segmentation Bakeoff.                                            Results and discussions: The results for the closed
This section reports the results and discussions on                                         and open test are presented in Table 2. We can
its evaluation.                                                                             draw some conclusions from these results.
                                                                                            Firstly, the overall performance of our system is
4.1       Measures                                                                          very stable in both the closed and open tests. As
                                                                                            shown in Table 2, the out-of-vocabulary (OOV)
In the evaluation program of the First International
                                                                                            rate is 6.9% in the closed test and 9.4% in the open
Chinese Word Segmentation Bakeoff, six measures
                                                                                            test. However, the overall test F-measure drops by
are employed to score the performance of a word
                                                                                            only 0.2 percent in the open test, compared with the
segmentation system, namely test recall (R), test
                                                                                            closed test.
precision (denoted by P), the balanced F-measure
                                                                                            Secondly, our approach can handle most unknown
(F), the out-of-vocabulary (OOV) rate for the test
                                                                                            words in the input. As can be seen from Table 2,
corpus, the recall on OOV words (ROOV), and the
                                                                                            the recall on OOV words are 67.5% the closed-test
recall on in-vocabulary (Riv) words. OOV is defined
                                                                                            and 76.2% in the open-test. Wang et al (2000) and
as the set of words in the test corpus not occurring
                                                                                            Yao (1997) have proposed a character juncture
in the training corpus in the closed test, and the set
                                                                                            model and word-formation patterns for Chinese
of words in the test corpus not occurring in the
                                                                                            unknown word identification. However, their
lexicon used in the open test.
                                                                                            approaches can only work for the unknown words
4.2       Experimental lexicons and corpora                                                 that are made up of pure monosyllable character in
                                                                                            that they are character-based methods. To address
As shown in Table 1, we only used the training data                                         this problem, we introduce both word juncture
from Peking University corpus to train our system                                           model and word-based word-formation patterns into
in both the open and closed tests. As for the                                               our system. As a result, our system can deal with
dictionary, we compiled a dictionary for the closed                                         different unknown words that consist of different
test from the training corpus, which contained 55,                                          known words, including monosyllable characters
226 words, and used a dictionary in the open test                                           and multiword.
that contained about 65, 269 words.                                                         Although our system is effective for most
                                                                                            ambiguities and unknown words in the input, it has
       Items             # words in             #    train.                 # test.
                                                                                            its inherent deficiencies. Firstly, to avoid data
                         lexicon                words                       words
       Closed            55,226                 1,121,017                   17,194
                                                                                            sparseness, we do not differentiate known words
       Open              65,269                 1,121,017                   17,194          and unknown words while estimating word juncture
      Table 1: Experimental lexicons and corpora                                            models and word-formation patterns from the
training corpus. This simplification may introduce     Acknowledgments
some noises into these models for identifying
unknown words. Our further investigations show         We would like to thank all colleagues of the First
that the precision on OOV words is still very low,     International Chinese Word Segmentation Bakeoff
i.e. 67.1% for closed test and 72.5% for open test.    for their evaluations of the results and the Institute
As a result, our system may yield a number of          of Computational Linguistics, Peking University for
mistaken unknown words in the processing.              providing the experimental corpora.
Secondly, we regard known word segmentation and
unknown word identification as two independent         References
stages in our system. This strategy is obviously
simple and more easily applicable. However, it does    Fu, Guohong and Xiaolong Wang. 1999. Unsupervised
not work while the input contains a mixture of           Chinese word segmentation and unknown word
ambiguities and unknown words. For example,              identification. In: Proceedings of NLPRS’99,
                                                         Beijing, China, 32-37.
there was a sentence 中行长葛支行注重健身 in
the test corpus, where, the string 中行长葛 is a           Jelinek, Frederick, and Robert L. Mercer. 1980.
fragment mixed with ambiguity and unknown                 Interpolated estimation of Markov source parameters
                                                          from sparse data. In: Proceedings of Workshop on
words. The correct segmentation should be 中行/长            Pattern Recognition in Practice, Amsterdam, 381-
葛/, where 中行(Zhonghang, the Bank of China) is             397.
a abbreviation of organization name, and 长 葛           Nie, Jian-Yuan, M.-L. Hannan and W.-Y. Jin. 1995.
(Changge) is a place name. Actually, this fragment       Unknown word detection and segmentation of
is segmented as 中/行长/葛/ in the first stage of our        Chinese using statistical and heuristic knowledge.
system. However, the unknown word identification         Communication of COLIPS, 5(1&2): 47-57.
stage does not have a mechanism to split the word      Viterbi, A.J. 1967. Error bounds for convolutional
行长(Hangzhang, president) and finally resulted in         codes and an asymmetrically optimum decoding
wrong segmentation.                                      algorithm. IEEE Transactions on Information
                                                         Theory, IT-13(2): 260-269.
5   Conclusions                                        Wang, Xiaolong, Fu Guohong, Danial S.Yeung, James
                                                        N.K.Liu, and Robert Luk. 2000. Models and
This paper presents a two-stage statistical word        algorithms of Chinese word segmentation. In:
segmentation system for Chinese. In the first stage,    Proceedings of the International Conference on
word bigram model and Viterbi algorithm are             Artificial Intelligence (IC-AI’2000), Las Vegas,
applied to perform known word segmentation on           Nevada, USA, 1279-1284.
input plain text, and then a hybrid approach is        Yao, Yuan. 1997. Statistics Based approaches towards
employed in the second stage to incorporate word         Chinese language processing. Ph.D. thesis. National
bigram probabilities, word juncture model and            University of Singapore.
word-based word-formation patterns to detect OOV
                                                       Zhang, Hua-Ping, Qun Liu, Hao Zhang, and Xue-Qi
words. The experiments on Peking University
                                                         Cheng. 2002. Automatic recognition of Chinese
corpora have shown that the present system based         unknown words based on roles tagging. In:
on fairly simple word bigram and word-formation          Proceedings of The First SIGHAN Workshop on
models can achieve a F-score of 93.7% or above. In       Chinese Language Processing, Taiwan, 71-77.
future work, we hope to improve our strategies on
estimating word juncture model and word-formation
patterns and develop an integrated segmentation
technique that can perform known word
segmentation and unknown word identification at
one time.