0% found this document useful (0 votes)

26 views32 pages

A Formal Computational Analysis of Indic Scripts: University of Illinois at Urbana-Champaign

A paper.

Uploaded by

rwsproat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views32 pages

A Formal Computational Analysis of Indic Scripts: University of Illinois at Urbana-Champaign

A paper.

Uploaded by

rwsproat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

A FORMAL COMPUTATIONAL ANALYSIS OF INDIC

SCRIPTS
Richard Sproat
University of Illinois at Urbana-Champaign

Introduction
The Brahmi-derived Indic scripts occupy a special place in the study
of writing systems. They are alphasyllabic scripts (Bright, 1996a)
(though Daniels (1996) prefers the term abugida), meaning that they
are basically segmental in that almost all segments are represented in
the script, yet the fundamental organizing principle of the script is the
(orthographic) syllable: crudely, an orthographic syllable — aks.ara —
in an Indic script consists of a consonantal core consisting of one or
more consonant symbols, with the vowels variously arranged around
the core. The one exception to the principle that segments are repre-
sented is the inherent vowel, usually a schwa or /a/-like phoneme: when
the consonantal core is marked with no vowel, it is interpreted as hav-
ing the inherent vowel (though phonological rules such as Hindi schwa-
deletion (Ohala, 1983), can complicate this characterization).
Brahmi-derived scripts are by no means the only scripts that are ba-
sically segmental but have the syllable as an organizing principle: no-
table other scripts are Ethiopic (Haile, 1996) and Korean Hankul (King,
1996); there have been claims that the latter is derived from Indic
scripts, though King notes (page 219) that there have been “no less than
ten different ‘origin theories’.”1 But Brahmi scripts are indubitably the
most successful, spawning not only several scripts in the Indian sub-
continent, but also spreading into Southeast Asia and spawning scripts
such as Thai and Lao (Diller, 1996), and even arguably influencing the
development in modern times of the Pahawh Hmong messianic script
(Smalley, Vang, and Yang, 1990; Ratliff, 1996).
The Indic scripts have some properties that make them interesting
from a number of perspectives. For instance, while the individual
scripts differ in details, they share a lot of properties; for example,
many scripts have encodings for the full set of Sanskrit-derived con-
sonants, and in all the scripts, vowel-initial aks.ara are written with a
full vowel form,2 whereas in consonant-initial aks.ara are written with
diacritic vowels around the consonant. This is convenient from the
point of view of text encoding since it means that to a large extent to
transliterate between different Indic scripts, one merely needs to shift
the codespace of the text; this is what is done in Unicode (Aliprand,
2003, and www.unicode.org). The parallels between scripts can be
quite strong indeed: as I shall argue in a later section, the encoding of
diacritic vowels in Devanagari and Oriya are identical at the right level
of abstraction. Indic scripts also make a much richer use of the two-
dimensional layout possibilities afforded by the written medium than
do Western alphabetic scripts, since vowels may be written above, be-
low or even before the consonants they logically follow; and in conso-
nant sequences in many scripts—Tamil is a notable exception (Steever,
1996)— all but one of the consonants takes on a reduced form, which
may be written either in linear sequence with, or above or below the
“primary” consonant (usually either the first or last in the logical se-
quence).3 In this paper I will express these two-dimensional layout pos-
sibilities in terms of a planar regular calculus.
Finally Indic scripts have some interesting psycholinguistic proper-
ties. With the exception of the inherent vowel, vowels are always rep-
resented explicitly in the Indic scripts, so in that sense the scripts are
segmental. But vowels are clearly subordinate to consonants (except
in vowel-initial aks.ara) since they are written as diacritics on the con-
sonant. As a result, various studies have shown that readers of Indian
languages written in Indic scripts have less overall segmental aware-
ness than readers of a language like English. Other studies have al-
so demonstrated psycholinguistic consequences of certain properties of
Indic scripts; for example, /i/ in Devanagari, which is written before
the consonant it logically follows, causes processing problems in that
words that contain /i/ take longer to identify and longer to name (i.e.,
pronounce aloud) in controlled experiments. I will discuss these issues
and cite the relevant studies in a later section.
The goal in this paper is threefold. First, I will introduce a formalism
for describing planar arrangements of symbols and apply that formalism
to selected phenomena in Indic scripts. Second, I will present in some
detail a computational model of Devanagari that was developed as part
of a Hindi text-to-speech system. Finally, I will explore some of the
results of psycholinguistic research on Indic scripts, and relate them to
the formal model previously developed: since the formal model as stat-
ed treats Indic scripts as segmental at an abstract level, this potentially
has psycholinguistic implications, which we will explore.
Formal Preliminaries
The relation between writing and the linguistic information it encodes
can naturally be thought of as a mapping in the algebraic sense. Consid-
er that we have a relation , that takes a string of linguistic units (say,

a sequence of phonemes), and maps it to a graphemic representation,

; we may refer to as the image of under . One question
that immediately arises about is what kind of relation it is. In (Sproat,
2000) I argue that it is a regular relation (Kaplan and Kay, 1994) in
the formal sense of regular languages. Regular relations are those rela-
tions that can be implemented using finite-state transducers (FSTs), for
which efficient algorithms are known, (Mohri, 1997; Mohri, Pereira,
and Riley, 1998). This conclusion has implications for implementations
of systems, such as text-to-speech (TTS) systems that compute the map-
ping between text and linguistic structure (Sproat, 1997a).
Regular relations, like regular languages, are definable in terms of an
alphabet, and a set of operations on that alphabet; see (Hopcroft and Ull-
man, 1979) for these notions for regular languages. Specifically, regular
relations are those relations that can be constructed out of a finite alpha-
bet of pairs of symbols using only the operations of (con)catenation
( ), union ( ) and closure ( ). Union and closure are to be understood in
their normal set-theoretic sense, and catenation means string concatena-
tion such that for two strings and , their catenation ( ) is simply
the combined string ( ).
In the analysis of writing systems it is desirable to extend the defini-
tion of catenation to include two-dimensional operations. Specifically,
I propose five planar catenation operators:

left catenation

right catenation

downwards catenation

upwards catenation

surrounding catenation
The semantics of each of these is illustrated in Figure 1.
Surrounding catenation seems not to be needed for Indic scripts,
though it is needed for Chinese; see (Sproat, 2000). However most
Indic scripts employ all the other catenators; I will give illustrations in
the next section. Note that catenators are associative within the same
type so that ; however across types this is not
the case so that in general
.
The planar catenators are, of course, similar to the traditional way of
describing the placement of vowel symbols in Indic scripts, by using
a dash to represent the consonant, and by placing the vowel symbol in
γ(α) γ(β) γ(β) γ(α)

(a) (b)

γ(α) γ(β)

γ(β) γ(α)

(e)

Figure 1: the five planar concatenation operations:

(a) ; (b) ; (c) ; (d) ; (e)
.

the appropriate geometric position relative to the dash, representing the

normal position of that vowel when written with a real consonant sym-
bol. There is a difference, however: the catenators are more abstract in
that they distinguish only four relative positions, namely left-of, right-
of, above and below. Indeed, the specification of the combinatorics of
glyphs in terms of the planar catenators is at an intermediate level of
abstractness: it is more abstract than the traditional description since it
does not give the exact placement of glyphs only the general placement
relative to other glyphs; on the other hand it is less abstract than a rep-
resentation that merely says that two symbols are combined (catenated)
in a logical order, as is the case in the standard Unicode representation.
One may legitimately ask if there is any justification for having such an
intermediate level of representation. I believe there is, and I will attempt
to justify this later in the paper.
All scripts have an overall, or macroscopic, direction.4 In Indic
scripts the macroscopic direction is left-to-right. In terms of the planar
catenators, the catenator encoding the macroscopic direction in Indic
scripts is . Let us term the macroscopic catenator for Indic scripts.
Deviation from the macroscopic direction—choice of a catenator other
than the macroscopic catenator—is always local and occurs within
what I term the small linguistic unit (SLU) in (Sproat, 2000):

L OCALITY HYPOTHESIS : Changes from the macroscopic catena-

tion type can only occur within a graphic unit that corresponds to
a small linguistic unit.
The SLU seems to vary across writing systems, though in many writing
systems the SLU is the syllable. In most Indic scripts, the SLU is the
orthographic syllable, or aks.ara.
One other assumption is that when two elements are adjacent in the
linguistic representation, their graphical expressions must be adjacent,
according to whatever catenator is used. So is , for
some catenator ‘ ’. In particular, if is a sequence, and ‘ ’ is , we
expect to be written before whatever is written first in . Let us
refer to this as the A DJACENCY A XIOM. We can express this formally
as follows, where (read ‘ immediately path-precedes ’) de-

notes the situation where precedes and is adjacent to (Sproat, 2000):

A DJACENCY A XIOM: If and if is not idiosyn-

cratically specified to be written with a single symbol or ligature,
then
.
In what follows I will develop some additional formalisms beyond what
was introduced above.
Illustrations
I will illustrate the formalism by describing some phenomena in a
variety of Indic scripts, starting with Devanagari.
D EVANAGARI
Devanagari (Bright, 1996a) is used to represent several Northern lan-
guages, including Hindi, Nepali and Marathi, and is the script most
commonly used to represent Sanskrit. For the sake of the present discus-
sion, we will assume its use for Hindi. In the transliterations throughout
this paper, I will mostly avoid phonetic symbols. Digraphs representing
single phonemes will be enclosed in square brackets; thus ‘[sh]’. Long
vowels will be shown with doubling; thus ‘[ii]’ versus ‘i’. I will use ‘M’
after vowels to represent vowel nasalization.
The script has several features in common with most other Indic
scripts:
Syllable-initial vowels are represented as full symbols, but post-
consonantal vowels appear as diacritics; see Table 1. /a/ (phonet-
ically / / in Hindi) is the inherent vowel, and thus has no diacritic
mark.
The orthographic syllable or aks.ara starts at the left edge of a se-
quence of consonants, and ends at the right edge of the syllable’s

vowel. Consonant sequences are represented orthographically as

ligatured groups. Thus
k + [.s] +
s +
m yields
k yields sk ;
k[.s]m . Ligatures may

atures, such as the

involve leftwards catenation or downwards catenation; some lig-
k[.s] ligature in the previous example
are synchronically unanalyzable or ‘suppletive’ (on analogy with

morphological suppletion). Diacritic vowels that combine with

(in Devanagari i ) appear before the entire sequence: /ski/ is
represented i+sk
The anusvara symbol, representing a nasalized vowel, or a

postvocalic homorganic nasal stop, appears above the rightmost
symbol of the aks.ara: h[aa]M . Strictly speaking, a pure-

ly nasalized vowel should be written with candrabindu (thus
h[aa]M , which has the same placement as anusvara, but this
is often not observed.

Orthographic-syllable-initial /r/ is represented by a superscript
symbol (repha or arka) occurring above the last symbol of the
aks.ara: thus /varm[aa]/ is represented as vm[aa]+r .
The procedure for marking orthographic syllables, given a string of
phonemes, that applies to most Indic scripts (Tamil is an exception) can
be stated as follows:

Divide the phonological string into orthographic sylla-
bles by placing a syllable boundary :
At the beginning of the word;
Between each pair of adjacent (non-tautosyllabic)
vowels (as between the /[aa]/ and /[ii]/ in bhaaii
‘brother’)
Catenator

Full form

Diacritic form
(Null
a

ka )

[aa] k[aa]

o ko

e
[au] k[au]

k[ai]

ke
[ii] k[ii]

[ai]
u[uu] kuk[uu]
k[ri]

[ri]
i ki
Table 1: Full and diacritic forms for Devanagari vowels, classified by catena-
tor inherent to the diacritic forms.

Before the first consonant of a consonant sequence,

with the exception of a homorganic nasal, which may
optionally be syllabified with the preceding aks.ara.

To get the ligatured consonants we assume a function , which takes

a sequence of consonant symbols and forms a ligatured consonant
sequence:
!#" $ #"

Lexical specifications for certain ligatures, such as + =
k[.s] will override the general ligature. By default the catenation

operator is , but in certain instances, especially when the full form

% d —the catenation operator may be : % +

of the consonant does not have a vertical bar (Bright, 1996a)—e.g.,
becomes & dv .

Unlike the diacritic vowels (see below) where catenation type is purely
a property of the vowel, this choice of seems to be a choice of the pair
of consonants. It also seems to be subject to some stylistic variation:

+ can be ( or )
'
one can sometimes combine a pair of consonants in more than one way:
' '
ll . Sequence-final /r/, is always produced
by a reduced form, which we will notate as +*-,. $/ , subscripted (or
in some cases idiosyncratically fused with) the previous consonant:
0/ $ 1*-,2. $/ so that, e.g., 3 +4 becomes 5
pr .
If represents the full (syllable-initial) form of , then the rule
for graphically encoding syllable-initial vowels can be given as follows:
.

For postconsonantal vowels, the following general rules apply,

where *-,. represents the reduced diacritic vowel:

1*-,2.

That is, /a/ deletes, and all other vowels are represented by their
diacritic form. The catenation operator depends upon the vowel: For a
consonant sequence and vowel , the catenation operators are chosen
as follows:
1*-,. if /a,o,[au],[ii]/
1*-,. if /e,[ai]/
1*-,. if /[uu],u,[ri]/
1*-,. if /i/
The vowels /e,[ai],o,[au]/ allow a deeper analysis, to which we
will return in the next section.
Anusvara and candrabindu are simply conjoined with their logical
predecessor using the operator:

Finally, repha/arka can be described as follows: Given an orthograph-

ic syllable starting with /r/ and remainder starting with a non-null
consonant sequence:

$/ .
This places the arka above the remainder of the aks.ara; to place
it on the righthand side of the aks.ara we assume a general stylistic
principle of Devanagari that prefers to place superscripted materials to
the righthand edge of the aks.ara. We will see in the next section that
this is useful beyond this one case.
Catenator
[sh] Example
p
[sh]p
s[.n] [.th]
+ =

+
t = =
st[.n][.th]
kr
dk br ==
+

+
bd

t[ng] +
sk ==

ts[ng]k
r
+
+

k = rk
p p =

py
+

(fused)

k +
+

[sh] =
k[sh]
Table 2: Catenators used for Oriya consonant sequences. In many of the
above cases, the glyphs take on a reduced form, and ligaturing applies. Note
the representation of rk with a superscript reduced r , as in Devanagari.
k[sh] is represented with a fused glyph, also as in Devanagari.

O RIYA
Oriya (Mahapatra, 1996) is a Northern Indic script distantly related to
Devanagari (Salomon, 1996), though quite different in appearance. The
basic design of the Oriya script is of course identical to that of Devana-
gari: syllable-initial vowels are represented with their own symbols,
otherwise dependent vowel symbols are used, so that the orthographic
syllable parsing procedure given above for Devanagari also works for
Oriya. Consonant sequences are represented by combining the single
consonants using various catenators and often involving ligaturing the
consonants together.
Oriya consonant sequences are complex and there is much sequence-
specificity in the choice of catenators (often or , but sometimes ),

as well as the degree of fusion. Simple cases include

[.n][.th] ; both using
p , which becomes [sh]p ; and [.n] plus
[sh] plus
[.th] which
becomes the operator. Various examples are
given in Table 2.
The diacritic vowels are more orderly. Oriya simplex diacritic vowels
are shown in Table 3. The grammar for these vowels is similar to that
for Devanagari, except that /i/ uses rather than , and /e/ uses rather
than :
Catenator

Diacritic form
(Null
k[aa]
ka )

k[ii]
ki

k[uu]

kuk[ri]
ke
Table 3: Forms for simplex diacritic Oriya vowels, classified by catenator.

1*-,. if /a,[ii]/
1*-,. if /i/
1*-,. if /[uu],u,[ri]/
1*-,. if /e/
The three vowels, /o, [ai], [au]/ are expressed using combinations

of symbols requiring more than one catenator. Thus ko is a
combination of diacritic e and diacritic [aa] , thus

; k[ai]
thus

is a combination of diacritic e and a new superscript diacritic,
; and k[au] is a combination of all three, thus .5 Such
complex vowels are common in Indic scripts (as well as Indic-derived
scripts of Southeast Asia, such as Thai), and Devanagari is more the
exception than the rule in seemingly lacking them.
Let us assume that the complex vowels are represented as a set of

symbols each with their own catenator; furthermore let us assume that
the glyph for [ai] and [au] represents a ‘diphthong’ feature

[ai]
DIPH (since historically at least, these vowels were diphthongs). Then:

* ,2. , DIPH

o[au]
=
= * ,2. , 1*-,.
= * ,2. , DIPH , * ,2.
Let us further assume that the catenation of the elements of a
symbol set is defined according to the following axiom, where ‘ ’ is
the catenator for the th set member:

" ! " "

In other words, the catenation of a symbol with a symbol set is simply
the conjunction of the catenation of that symbol with the individual
symbols of the set. Then following this axiom we have, for k[au] :

=
=
DIPH =
DIPH =

If we return to Devanagari, we can see that exactly the same grammar

works for the vowels /o,[ai],[au]/ as it does in Oriya. Starting with

ke
DIPH, so that

we can suggest that the second stroke in k[ai]
k[ai] involves a combination of e and DIPH.
Similarly, ko is just a combination of e and a . Here, the
encodes

general stylistic principle of Devanagari that we appealed to above that

places superscripts at the right edge of the aks.ara will guarantee that

the

e stroke will appear above the [aa] rather than the k :

. Finally, au is just a combination of e , [aa] and DIPH,
with the placement of the e +DIPH complex above the [aa] being
again due to the stylistic principle: . Thus, in detail:
=
=
DIPH

=
DIPH =

full form vow-
This decomposition also carries over to the Devanagari

els. Thus o is (full) [aa] plus diacritic e : . [ai] is (full)
e plus DIPH: . And [au] is (full) [aa] plus diacritic e

plus DIPH: .6 Note that this explains the otherwise puzzling fact that
for full [ai] , there is only one stroke above the , whereas for all
other ‘diphthongs’ there are two: in this case the single stroke repre-

sents the DIPH feature, and the second stroke, which would normally

represent diacritic e is unnecessary since we already have the full

form e . This system for representing e , o , ai and au ,
would seem to have been inherited from Brahmi; see (Salomon, 1996,
Table 30.2, page 374).
K ANNADA
Kannada is a Southern Brahmi-derived script used to write the
Kannada language (Spencer, 1950; Bright, 1996b). The underlying
structure of the script is again, essentially the same as other Indic
scripts, though the surface details differ. Consonant clusters are written
in a much more regular fashion than Devanagari or Oriya, with the typ-
ical mode of expression being to use the catenator to place a reduced

follows: thus

l becomes

form of a non-initial consonant underneath the consonant it logically
l , so that .
If there are more than two consonants in the sequence, then the
subsequent consonants are not stacked, but are written in a left-to-right

sequence following the reduced and subscripted second consonant:
lak[.s]ma[.n]a ‘Lakshmana’ (Spencer, 1950, page 99).
This motivates a general rule for subscripted sequences whereby in
a sequence of catenations, all but the initial catenator are changed to :

K ANNADA L INEARIZATION:
(Note that this rule must apply right-to-left.) Thus the !

sequence is analyzed as .

Anusvara and arka are written at the end of their aks.aras but are
written inline; thus in Kannada anusvara uses since it is logically at
the end; and arka uses since it is logically at the beginning:7
t rt : $/ /
t t(a)M :

becomes

becomes

The diacritic vowel system is also simplified. The catenator is not

used for vowels, which are limited to and , with being used only
for [ri] , [rii] and the lower symbol of ai (a complex sym-
bol). Table 4 gives the analysis of Kannada diacritic vowels in terms
of these three catenators. I also include the symbol sets for complex
vowels, where LONG is used for a glyph that seems to be associat-
ed only with long vowels.8 With consonant sequences the vowels are
standardly described as being associated with the first (inline) conso-
nant (Bright, 1996b). According to the A DJACENCY A XIOM, vowels
that are written above ( ) the aks.ara must appear above the first con-
sonant, which they do, consistent with the standard description. For
vowels that have an or component, these components must appear
to the right or below the vertically arranged consonant cluster: so either
. . . #" or $ . . . #" . Thus we have (with
Catenator
ka )
Diacritic form
(Null
ku

k[aa]

k[uu]
kike

k[au]

k[ri]

k[rii]
i ( ), LONG ( )

k[ii]
e ( ), [uu] ( )

ko
e ( ), [ai] ( )

k[ai]
e ( ), LONG ( )

k[ee]
e ( ), [uu] ( ), LONG ( )

k[oo]

Table 4: Forms for diacritic Kannada vowels, classified by catenator. Com-
plex vowels are given at the bottom of the table. Note that [rii] only occurs
in Kannada transliterations of Sanskrit (Bill Bright, p.c.).

)

ka[.n][.n][ii]ru ‘tears’ and lak[.s]m[ii][.s]a

‘husband of Laxmi (Vishnu)’ (Spencer, 1950, pages 29, 341); here

the right component of [ii] appears after the consonant sequence.
Similarly (with ) we have

kr[ai]sta ‘Christian’,

[ee][sh]v[ai]kya ‘universal unity’ (Spencer, 1950, page 156, 343);
where the lower component of [ai] appears after the sequence of
lowered consonants. Note in particular the application of K ANNADA
L INEARIZATION to the righthand component of [ai] . It is worth ob-

serving that while this arrangement is consistent with the A DJACENCY
A XIOM, it contradicts the standard description of the arrangement of
vowels with consonant sequences.
TAMIL
From a formal point of view, Tamil (Steever, 1996; Radhakrishnan,
2002) is probably the simplest Indic script since it has eliminated the
aks.ara as a fundamental unit of the script. Virtually all consonant con-
juncts have been eliminated, and consonant sequences are written left-
to-right with unligatured consonant symbols. Steever (1996, page 426)
suggests that this trend may have been related to the introduction of
typography from the West. However, the vowels behave as they do in
other Indic scripts: syllable-initial vowels are written in a full form,
whereas postconsonantal vowels are written with a diacritic form, with
various of the four catenators being used, depending upon the vowel.
Given that consonant sequences have been completely linearized in
Tamil, with all consonant sequences being constructed using the macro-
scopic catenator , there would seem to be no motivation for maintain-
ing that the SLU is the orthographic syllable. Rather the SLU in Tamil
is a maximally CV(V) unit, with the possibilities being V(V) (syllable-
initial short or long vowel), C (consonant-sequence-medial consonant)
and CV(V) (consonant with diacritic short or long vowels). Ignoring
for the moment the decomposability of CV(V) elements and the fact
that there are simplex vowel glyphs representing both long and short
vowels, Tamil orthography is almost what I termed in (Sproat, 2000) a
core syllabary, like Japanese kana. Tamil is moving towards being an
alphasyllabic version of kana.
If the SLU is the CV(V) unit, not an orthographic syllable, the theory
predicts that vowel diacritics in Tamil can only attach to the final conso-
nant in a consonant sequence. In particular, a vowel that uses the cate-
nator must appear before the last consonant of a sequence, and cannot,
for example appear before the initial consonant, as it would in Devana-
gari or Oriya. This is because appearance before a non-final consonant

would require deviation from the macroscopic order in a unit larger than
the SLU. This prediction is correct: thus

ten with the ee symbol appearing before the
hek[.t][ee] is writ-
[.t] .
A procedure that would be appropriate for dividing Tamil words into
SLU’s would be as follows:

Divide the phonological string into orthographic units

by placing a unit boundary ‘ ’:
At the beginning of the word;
Between each pair of adjacent vowels;
Before every consonant of a consonant sequence.

The macroscropic catenator in Tamil is , as in other Indic scripts.

The simplex Tamil vowel diacritics are shown in Table 5. It should be
noted that while the forms with p are fairly transparent, many of the
consonant-vowel combinations involve more or less opaque ligatures so
Catenator Diacritic form

(Null pa )

p[aa]

pi
p[ii]

pu

p[uu]
pe

p[ee]
p[ai]

Table 5: Forms for simplex diacritic Tamil vowels, classified by catenator.

that, for instance [.n] is , whereas [.n][uu] is , which is quite

different from what one might expect on the basis of puu . Tamil
seems to be evolving towards kana in another regard, namely that the
CV(V) units are becoming unanalyzable.

The diacritic forms of o , and [au] , involve the same symbol

sets as in Devanagari and Oriya, and in addition Tamil has [oo] ,
which is formed on the basis of [ee] (these latter two invented by

o po
missionaries, Bill Bright, p.c.):

-,. , 1-,.

[au]
[oo]
=
=
=
*-,.
*-,.
, DIPH , * ,2.
p[au]
p[oo]
, * ,2.

of the [aa] glyph, though we assume that it is specified as catenating
Note that in Tamil the glyph is ligatured to the beginning
DIPH

with to the preceding consonant.

Sequence-medial consonants are marked with a dot—pul..li—that
‘cancels’ the inherent vowel; this has the same function as, say, vir āma9
in Devanagari, but is more common in Tamil since vowellessness

cannot be encoded by membership in a conjunct. Thus note the dot
above the k ( ) in hek[.t][ee] . The pul..li can be placed
by rule as follows:

A similar rule (replacing with ) would be required to encode
cancellation signs in other Indic scripts.

Computational analysis of Devanagari
As part of a Hindi text-to-speech system that we were developing at
Bell Labs (Narasimhan, Sproat, and Kiraz, 2003; Sproat, 1997b) (and
see (Bhaskararao and Mathew, 1992; Bhaskararao, Peri, and Updikar,
1994) for other work on Hindi TTS), we developed a practical imple-
mentation of Hindi orthography as part of the text analysis component
of the system. The task of a text-analysis component of a TTS system is
to map from text into a linguistic representation; it can thus be thought
of as a computational model of what humans do when they read aloud.
The Hindi text-analysis component transformed raw text into a phono-
logical representation, including as part of that process the expansion
of abbreviations, numbers and time expressions, as well as phonolog-
ical phenomena such as schwa deletion. All of this was implemented
in terms of the FST-based approach to text-analysis outlined in (Sproat,
1997a; Sproat, 2000).

The first component of the text-analysis module converted from De-
vanagari text into an abstract orthographic representation, where sym-

bols are left-catenated together in their logical order. Thus
%
would

be represented as varm[aa] , and would be represented as
hiNdii . Note that while this representation could also be thought
of as phonological (Hindi orthography being fairly close to a phone-
mic representation), phonological phenomena such as schwa-deletion
are not represented at this level.
Input text was represented using the mock ISO-8859 xdvng encoding,
for use with the X window system under Unix.10 The fonts were de-
signed to work under an ordinary browser with no special Devanagari-
specific rendering. Super- and subscripted vowels are handled in this
scheme by having the corresponding character codes appear in their log-
ical order (as in Unicode), but with the glyphs so designed that they will
over- or understrike the previous consonant. The placement of super- or

ing. For symbols that are massively ‘out of order’, such as short i

subscripts was thus technically correct if not always esthetically pleas-

or preconsonantal r (arka), these would be coded in their surface

orthographic order rather than their logical order. The consonant sym-
bols are mainly represented in the font without the vertical bar, so that
consonant sequences can be (mostly) represented by simply catenating
the symbols together, reserving a vertical bar (which is a separate glyph
with its own code) for the last consonant.
We adopted a generative approach to the problem in that we started
with the abstract orthographic representation and built FSTs that trans-
duced from that level to the surface orthographic realization. Since
FSTs are closed under inversion (Kaplan and Kay, 1994), we merely
need to invert this model to get something that will map from the sur-
face orthography into the abstract representation from which subsequent
linguistic processing can then take place.
The first stage of the mapping involved parsing the input string into
orthographic syllables (aks.ara). This was handled using the following
rewrite rule, which simply introduces a syllable boundary between a
vowel and any following consonant:11

O RTH S YLL:

Here represents the empty string per convention. Note that

context-dependent rewrite rules have been known since the 1970’s to be
implementable by FSTs (Johnson, 1972), and practical algorithms have
been developed in the last decade or so (Kaplan and Kay, 1994; Mohri
and Sproat, 1996).
The orthographic syllable boundary introduced by O RTH S YLL plays

an important role in various processes. For example, the placement of

i is handled by a pair of rules, one which converts a postconsonantal

i into a ‘trace’ , and a second that inserts a precursor for the
diacritic (non-full-form) i ( . * ) after an orthographic syllable
boundary, before a consonant sequence that is followed by the i

trace:

I - TRACE: i (i)

i . *
ORTH - I: (i)
Note that the I - TRACE is needed since even though the i is

written before the consonant sequence, the consonant sequence still
behaves as if there is a vowel after it: the rule VBAR that later inserts
a vertical bar on the final consonant of the sequence is sensitive to the
presence of a vowel in order to find the end of the sequence.

In a similar way, preconsonantal r is positioned by a pair of rules
to the end of the orthographic syllable, as follows:

R - TRACE: r (r)

r. * (r)
ORTH - R:

In the case of r there is no reason to preserve the trace, so
this can be automatically deleted.
The two-stage process involving deletion and insertion can be viewed
as a poor-man’s implementation of the directional catenators introduced
in the previous discussion, necessitated by the fact that the FST toolk-
it used in the implementation effectively encodes only one catenation
operator. Thus a more faithful implementation would select for i

and leave the relative order of the glyphs up to the final rendering.
Postconsonantal vowels other than /i/ are spelled out in their diacritic
form:

a a. *

aa aa . *

ii ii . *

u u . *

uu uu . *

e e. *

ai ai . *

o o . *

au au . *

ri ri . *

Nasalized vowels are marked with candrabindu, though this is

supposed to be replaced by anusvara if the vowel diacritic is a super-

script. Thus we have 3 for puM with candrabindu, but 3 with
anusvara for peM .12 We had the following rules for nasalized

vowels:

aM a. * [candrabindu]

aa M aa . * [candrabindu]

ii M ii . * [anusvara]

uM u. * [candrabindu]

uu M uu . * [candrabindu]

eM e. * [anusvara]

ai M ai . * [anusvara]

oM o. * [anusvara]

au M au . * [anusvara]

Full vowels are encoded using a set of rules similar to the above
rules for diacritic vowels, except that the rules are context independent
(the post-consonantal vowels having already been taken care of by the
rules above), and the full rather than the diacritic form

were also included for English open-o / / (full form ).
is used. Rules

Additional rules translated consonants into their orthographic encod-

ings, deleted the orthographic syllable boundary,
ed homorganic nasals as anusvara so that 3
and
'
optionally encod-
pensil ‘pencil’

could also be encoded as 3 '
peMsil .
A final set of rules converts the abstract specifications of glyphs as
output by the rules above into actual xdvng glyphs. Consonants are

divided into two classes, those that need a vertical bar to be complete in
final position (e.g. 3 p , b , [bh] , n ), and those that
do not (e.g. t. , [t.h] )

Non-VBAR consonants, including consonant sequences ending in
r or y (which cannot include any following consonant) can be
translated directly into glyphs. Thus, for example:

pr 5
sr
t.k
d.y

A vertical bar is inserted between VBAR consonants and a fol-

lowing vowel, which includes I - TRACE. Within consonant sequences,

no vertical bar is inserted, which produces the appropriate ligatured

forms in most cases. Thus skr would appear as and st as

, with the vertical bar attached to the t in the latter case. Special
ligatures such as ks. are also produced by rule.
Vowels are either spelled out in their full form or diacritic form de-

pending upon the output of the abstract rules above; the same is true for
pre-consonantal r .
The complete set of rules in lextools format13 can be made available
upon request.
Psycholinguistic Implications
The formal model developed above explicitly treats Indic scripts as
segmental at some level of analysis. Nevertheless, they are interestingly
different from Western segmental scripts in that they make far greater
use of two-dimensional layout in arranging their symbols, the vowel
symbols are largely subordinate to the consonant symbols, and there is
often a great deal of fusion between different segments which tends to
render the aks.ara, or various portions thereof, rather opaque. What do
these similarities and differences with Western segmental scripts imply
about how Indic scripts are understood and processed by native readers?
Much has been made in the literature on writing and language on the
issue of “segmental awareness” and its relation to the use of alphabet-
ic scripts. It has been claimed that people who learn writing systems
based on non-alphabetic scripts, including morphosyllabic systems like
Chinese, or quasi-syllabic systems like Japanese kana, are not able to
perform certain tasks that require awareness of phonological structure
at the level of segments. For example, phoneme reversal, whereby one
takes an input like /poki/ and one is required to reverse the vowels to
produce /piko/, is very hard for even literate speakers of Japanese to
do, though it is quite easy for literate speakers of English. In contrast,
the ability of speakers to manipulate syllables — e.g. syllable reversal
whereby /poki/ becomes /kipo/ — is unaffected by the writing system
one learns, and can even be handled by illiterates; see, e.g., (Prakash et
al., 1993).
An obvious question that comes up in this context is where one draws
the line between what one considers ‘alphabetic’ and what is ‘non-
alphabetic’. Indic scripts are particularly interesting in this regard since
they are clearly segmental in their abstract design, and yet the (ortho-
graphic) syllable plays an important role; hence the commonly used
term alphasyllabary (Bright, 1996a). More importantly, perhaps, they
are frequently taught as syllabaries (Karanth, 2003)14 something that
would surely affect literate speakers’ conscious phonological aware-
ness.
The strongest position on this issue is perhaps the one taken by Faber
(Faber, 1992). Faber explicitly argues that phonemic segmentation is an
epiphenomenon due to alphabetic writing (page 111):

. . . investigations of language use suggest that many

speakers do not divide words into phonological segments
unless they have received explicit instruction in such seg-
mentation comparable to that involved in teaching an al-
phabetic writing system.

She ties this claim in with a point made earlier on the same page to the
effect that recent phonological theories have argued against a primitive
status for segments:

There is by now a large and convincing body of evi-

dence that linguistic units representing acoustic or articu-
latory steady states need not be included as primitives in
linguistic representations of phonological structure.15

We first need to observe that Faber is raising two quite distinct issues
here. The first relates to the issue of speakers’ conscious awareness of
their language, and their ability to consciously manipulate phonolog-
ical units of a particular kind. The second relates to speakers uncon-
scious knowledge of language, and phonological and phonetic theories
of that knowledge. The two are not necessarily connected. It is perfect-
ly conceivable, for example, that segments in the traditional structural-
ist linguistic sense are a primitive of people’s unconscious phonological
system, yet they do not become consciously aware of segments unless
explicitly trained (any more than untrained people are conscious of how
their articulators make sounds).
Yet, one activity that would surely seem to require a conscious aware-
ness of segments is the design of a segmental writing system, so if con-
scious awareness of segments is an epiphenomenon we would seem to
have a classic chicken-and-egg problem: how were segmental systems
invented in the first place? For the Greek alphabet, the main question is
how the vowel symbols developed, since the consonants were already
provided by the Phoenician precursor. Faber’s explanation (Faber, 1992,
page 126) is that the vowels were a misinterpretation (emphasis hers) of
several of the Phoenician consonant symbols as vowel symbols, due to
the misperception of consonant sounds not found in Greek. 16 According
to this story, then, the Greek alphabet is the accident that happened just
once. But what about various other scripts including the alphasyllabic
scripts of South Asia and Ethiopia, as well as Hankul? In all of these
scripts the (orthographic) syllable plays a key role in the arrangement of
symbols, yet the design of the system is basically segmental in nature.
Faber defines these cases away by claiming that only scripts where con-
sonants and vowels are both represented (thus Semitic scripts fail), are
on a par (thus Indic scripts fail), and are linearly arranged (thus Hankul
fails), count as alphabetic and thus segmental. So, in Faber’s classifica-
tion, Indic scripts do not count as alphabetic.
Now, there is certainly evidence that people who were taught to read
Indic scripts (and are not literate in a purely alphabetic script such as
English), are less segmentally aware than their counterparts in places
where alphabetic scripts are used. This in turn is in keeping with Faber’s
expectations.
Consider, for example, work by Padakannaya (2000) on Kannada,
where he experimented with a variety of tasks that tested the abilities
of children of different ages to manipulate the sounds of words. These
tasks included recognition or manipulation of syllables:

Rhyme recognition: determining whether two words rhyme or

not;

Syllable deletion;

Syllable reversal;
as well as tasks that involved recognition or manipulation of segments:
Phoneme oddity: given four stimuli, determine the one stimu-
lus for which there is a different phoneme at a specified position.
Thus for nonsense words chota, beti, kale, mito, the odd one out
is kale since there is an /l/ rather than a /t/ in the third position;
Phoneme deletion;
Phoneme reversal.
Padakannaya compared two sets of children in school grades I through
VII. The first set were sighted children learning the standard Kannada
alphasyllabary, and the second were blind children who were learning a
purely alphabetic Kannada braille.
On syllable manipulation tasks there was no significant difference
found between the two sets of children. So for example, except for first
grade, where blind children outperformed sighted children, blind chil-
dren and sighted children were identical in their performance on syllable
deletion. On the other hand, for phoneme manipulation, blind children,
exposed to the alphabetic script, consistently outperformed sighted chil-
dren. For example, Figure 2 shows the performance in terms of percent
correct for the phoneme reversal task for both populations of children
from grades I through VII. For the first three grades, where the blind
children steadily improve in performance, the sighted children are not
able to do the task at all. In fourth grade sighted children start to be able
to do the task, but it is not until fifth grade that they start to catch up with
the blind children’s ability; not coincidentally, it is in the fifth grade
that English is first introduced into the curriculum. Results showing
similarly lower performance for segment manipulation tasks by read-
ers of Indic scripts are reported in (Prakash et al., 1993; Padakannaya,
1999; Karanth, 2002), inter alia.
On the other hand, as we can see even in Figure 2, segmental ma-
nipulation is not a categorical ability: by fourth grade, before they
started learning English, the sighted children had developed some, al-
beit weak, ability to reverse phonemes. One factor that seems to af-
fect phonemic awareness with readers of Indic scripts is how “sepa-
rable” or “noticeable” particular glyphs are are (Prakash et al., 1993,
and Prakash Padakannaya, p.c.). For example, Padakannaya and his
colleagues (1993, page 65) note that their Hindi subjects were 95% suc-

cessful in a phoneme deletion task at deleting the /d/ in doshii
do[sh][ii] to yield osh[ii]. Here, the
%
d forms a separate glyph
%

with an additional symbol to represent the vowel. On the other hand,

Figure 2: Phoneme reversal in sighted and blind Kannada children (the blind
children learning an alphabetic Braille), from (Padakannaya, 2000), used with
permission of the author. The horizontal axis represents the breakdown by
grade in school of the subjects; the vertical axis is percent correct on the
phoneme reversal task. Note in particular that the normals only begin to catch
up with their blind counterparts in fifth grade, which is when they start to learn
English.

they were not able to correctly delete the /n/ in nadii

%
nd[ii] ; in

this case the first syllable has the inherent vowel /a/, which is not written
with a separate glyph from the

ers found it easier to delete the /y/ in
n . Similarly, literate Hindi speak-
py which, while fused with

the preceding p is still in more its full form; than they did with the
/r/ in 5 pr , which is written as a small stroke at the bottom of the
p .
Cross-scriptal variation also results in differences in the ability to treat
certain sounds as separate units. Thus as Padakannaya and his col-
leagues (Prakash et al., 1993, page 66) note, Hindi speakers find it

hard to treat anusvara and arka as separate segments, since they are

written as diacritics above the end of the orthographic syllable: 3
peMsil ; aka+r ( arka ). On the other hand, this is easy for
'

Kannada speakers since in each case these symbols are written inline in

Kannada script (though in the case of arka, out of its “logical” order):

peMsil ;

aka+r .17
So some conscious segment-level manipulation is clearly possible, but
one might wonder how the segmental aspects of Indic scripts affect the
unconscious processing that is presumably involved when fluent read-
ers read words. A study by Vaid and Gupta (Vaid and Gupta, 2002)
addresses this question. Vaid and Gupta investigated naming latencies
(the speed with which readers are able to vocalized printed words) in
40 Hindi-speaking adults, and naming errors (the numbers of errors
made when vocalizing printed words) among 10 Hindi-speaking chil-
dren (7.5–10.0 years old) with words containing short /i/, which is writ-

ered cases where there was a single consonant (

'

ten before the consonants it logically follows. In particular, they consid-
i+tlk /tilak/

syllabic consonant sequence ( %

‘sectarian mark’, versus cases where there was a phonologically hetero-
m i+sjd /masjid/ ‘mosque’).
If Hindi readers of the Devanagari script processed things syllable-by-

syllable (as one might expect if Devanagari is really a syllabary) then
one would expect that the misorder of the i would only cause prob-

cases like
'

lems if a phonological syllable boundary intervened. In particular, in

i+tlk /tilak/, the first syllable would simply be pro-
cessed as a unit, and it should not matter that the i is out of its logi-
cal order; such cases should be as readily processed as cases where the

other hand, with cases like %

vowel symbol is written after the consonant it logically follows. On the
m i+sjd /masjid/, there would be

a processing cost since the i is separated from the syllable it logical-
ly belongs with by the intervening s . The results are consistent with
the view that at least some segmental processing is going on. In partic-
ular, while masjid-type cases resulted in slower naming than tilak-type

cases, tilak-type cases were also significantly slower than cases not in-
volving short i .18 Similarly, children produced more naming errors

for masjid-type cases than for tilak-type cases, they also produced more
errors for tilak-type than for cases not involving short i . Thus, it can-
not be the case that Hindi readers simply process the script syllable-by-
syllable: at least for glyphs that occur ‘inline’ (Vaid and Gupta (2002)
present no evidence of what happens with super- or subscripted vowels),
there is clear evidence that readers also process segment-by-segment.

What the cases cited from (Prakash et al., 1993) and (Vaid and Gupta,
2002) — Devanagari d , y , short i , and Kannada arka and
anusvara — all have in common is that each of these involves a glyph
that is written inline. That is, they involve a catenation operator (ei-
ther or ) that is horizontal to the overall script direction ( ). This
leads one to the hypothesis that the ability to process a symbol as a
separate segment is enhanced when that symbol is written horizontal
to the overall script direction. This view is consistent with other re-
search that shows that superscripted diacritics are less noticeable than
inline glyphs; thus van Heuven (2002) shows for Dutch that errors in
the placement of dieresis (required by the rules of Dutch orthography in
cases where adjacent vowels constitute separate syllables) have hardly
any effect on lexical decision tasks, whereas an error in a letter has a
substantial effect.
On the other hand, this cannot be the whole story. Padakannaya (2001)
notes that in Kannada consonant sequences C C , C had a much lower
success rate in a phoneme deletion task than C . Thus in
rakta
‘blood’, the /t/ should be easier to delete than the /k/. Padakannaya
suggests that this is because the first consonant, while written inline, is
generally ligatured with vowel symbols (though not in this particular

case), whereas the second position (though written underneath the pre-
ceding consonant and though often reduced, as in the case of t here)
is not adorned with vowel symbols. So there must be an additional fac-
tor of visual distinctiveness: if it is easier to parse out the segment in
question, it will be processed more like a segment. The use of ligature
is clearly a factor in making a segmental analysis less accessible.
So, there is clear evidence that properties of Indic scripts have an ef-
fect on processing at the segmental level. But while it is clearly the case
that learners of Indic scripts are not able to perform segmental manip-
ulations as well as learners of purely alphabetic scripts, it is also clear
that segmental abilities do show up in cases where the script supports
it. In particular, segmental glyphs that are written inline, and glyphs
that are readily separable from their surroundings have a better chance
of engendering segmental awareness than glyphs that do not have these
properties.
Still, no matter what the psycholinguistic effects of learning an Indic
script may be, one must still deal with the point that Indic scripts are
segmental in their basic design, and this raises the question of how the
Brahmi script was invented in the first place. Assuming it was not the
same ‘accident’ that has been ascribed to the development of the Greek
alphabet from Phoenician, and perhaps even if it was, this would seem
to presume a fair level of segmental awareness and general phonologi-
cal sophistication on the part of the previously illiterate inventors of the
script (Patel, 1993). One can speculate that such phonological sophisti-
cation may have arisen out of the language games that were played by
Vedic pracitioners for the purpose of preserving the sacred texts in an
oral tradition. These games were quite complex: a particularly baroque
instance was the ghana-pāt.ha (Macdonell, 1900, page 42) which re-
quired one to step through a sequence of words a b c d as ab, ba, abc,
cba, abc; bc, cb, bcd . . . . Since this process required untying segmen-
tal sandhi phenomena at each word juncture, one can imagine that this
exercise would enhance the practitioner’s awareness of segmental phe-
nomena. It is an open question whether the most elaborate of these
games predated literacy, but it seems to be widely believed that Vedic
practitioners did use such techniques even before literacy (and perhaps
even as an antidote to the threat of literacy) in order to guarantee that
the sacred texts were preserved in their correct form; Steve Farmer, p.c.,
and see also (Farmer, Henderson, and Witzel, 2002, page 68). 19
In principle, there is no reason to assume that the ability to manipulate
segments (or phonological units of any size) depends solely upon learn-
ing a script with a particular set of properties. Rather, it is reasonable to
assume that the propensity for this ability is there in all of us, but that it
takes certain external factors, such as learning a script, or participating
in certain kinds of language games, for the ability to fully develop.
Justification for the Level of Abstraction
One issue that we have left hanging is the justification for the partic-
ular level of abstraction I have chosen for the formal description. As
we noted before, there are a number of levels of abstraction that one
could choose in describing a writing system, ranging from a very ab-
stract level at which elements are catenated in their logical order, to
the surface level where the exact placement and rendering of glyphs
is accomplished. We have chosen a level intermediate to that, where
directional catenators allow one to broadly describe the arrangement

of symbols, but we still abstract away from the exact placement. In
this scheme it is sufficient to say that the diacritic i in Devanagari

occurs before the consonant sequence it logically follows, abstracting

away from the detail that the upper stroke of the i is actually written
over the following consonant sequence as in ki . Similarly one

k[uu] , without the finer
can describe the diacritic vowel [uu] in Oriya as occurring under
the consonant it logically follows as in
grained description that it actually occurs on the right at the bottom.
The justification for this is that it seems to be the right level of gran-
ularity to describe some of the psycholinguistic results we have dis-
cussed. So, it seems to be a factor that certain glyphs are written above
or below the main glyphs, but there is no justification for making a

finer-grained distinction than that. In Vaid and Gupta’s (2002) exper-
iment, the placement of diacritic i before the consonant sequence
was clearly relevant, whereas the fact that it also extends above the con-
sonant sequence was not, so far as one knows, a factor.
Of course, this is all extremely sketchy at this point since no psy-
cholinguistic work has yet been done that addresses the question of
whether, e.g., placement of a glyph on top to the right, or on top to
the left, makes a significant difference in some measurable behavior.
My expectation is that no such results will be forthcoming, but that it
will be possible to measure more effects of choice among the four basic
catenators used here.
Conclusions
The formal model presented in this paper allows for the succinct de-
scription of some quite intricate writing systems, such as the writing
systems of India. The theory is able to make predictions, as we saw in
the case of the placement of vowels in Tamil that deviate in their catena-
tion from the macroscopic direction, and the placement of the secondary
- and - catenated vowel components in Kannada, and it is able to pro-

vide some insights into the commonalities among Indic scripts, as in the
case of the encoding of o in diphthongs in Devanagari and Oriya.
The model also has implications for human processing of scripts.
Since it treats Indic scripts as fundamentally segmental, despite their ob-
vious differences from Western alphabets, it implies that there should be
some evidence that readers of Indic scripts are able to develop some seg-
mental awareness, and that psycholinguistic studies of reading should
show some processing at the segment level rather than just at the syl-
lable level. These expectations are borne out: although it is very clear
that readers of Indic scripts are much less able to manipulate segments
than readers of alphabetic scripts, there are clearly also cases where they
are able to do so. And the naming studies reported in (Vaid and Gup-
ta, 2002) show that processing of Devanagari at least partly occurs at
the level of the segment. Through the work of Padakannaya and others,
there is much that is now understood about the psycholinguistic effects
of learning Indic scripts, but there is also much that is not known. To
my mind this will be one of the most fruitful areas of future research on
Indian writing systems.
Acknowledgments
I wish to thank Bill Bright, Steve Farmer, Prakash Padakannaya,
Pramod Pandey, and George Thompson, for helpful comments on pre-
vious versions of this paper.
References
Aliprand, Joan, editor. 2003. The Unicode Standard, Version 4.0. Addison Wesley
Professional, Reading, MA.
Bhaskararao, Peri and Suresh Mathew. 1992. Phonemic transcription rules for text-to-
speech synthesis of Hindi. In R.M.K. Sinha, editor, Computer Processing of Asian
Languages. TATA McGraw Hill, New Delhi.
Bhaskararao, Peri, Venkata Peri, and Vishwas Updikar. 1994. A text-to-speech sys-
tem for application by visually handicapped and illiterate. In Proceedings of ICSLP
94, pages 1239–1242, Yokohama.
Blevins, Juliette. 2002. Evolutionary Phonology. Ms., University of California,
Berkeley, Submitted to Cambridge University Press.
Bright, William. 1996a. The Devanagari script. In Peter Daniels and William Bright,
editors, The World’s Writing Systems. Oxford University Press, New York, NY,
pages 384–390.
Bright, William. 1996b. Kannada and Telugu writing. In Peter Daniels and William
Bright, editors, The World’s Writing Systems. Oxford University Press, New York,
NY, pages 413–419.
Browman, Cathe and Louis Goldstein. 1989. Articulatory gestures as phonological
units. Phonology Yearbook, 6:201–251.
Daniels, Peter. 1996. The study of writing systems. In Peter Daniels and William
Bright, editors, The World’s Writing Systems. Oxford University Press, New York,
NY, pages 3–17.
DeFrancis, John. 1989. Visible Speech: The Diverse Oneness of Writing Systems.
University of Hawaii Press, Honolulu, HI.
Diller, Anthony. 1996. Thai and Lao writing. In Peter Daniels and William Bright,
editors, The World’s Writing Systems. Oxford University Press, New York, NY,
pages 457–466.
Faber, Alice. 1992. Phonemic segmentation as epiphenomenon. evidence from the
history of alphabetic writing. In Pamela Downing, Susan Lima, and Michael Noo-
nan, editors, The Linguistics of Literacy. John Benjamins, Amsterdam, pages 111–
34.
Farmer, Steve, J.B. Henderson, and Michael Witzel. 2002. Neurobiology, layered
texts, and correlative cosmologies: A cross-cultural framework for premodern his-
tory. Bulletin of the Museum of Far Eastern Antiquities, 72:48–90.
Fujimura, Osamu. 1994. The C/D model: A computational model of phonetic imple-
mentation. In DIMACS Series in Discrete Mathematics and Theoretical Computer
Science, volume 17, pages 1–20. American Mathematical Socieaty.
Gill, Harjeet Singh. 1996. The gurmukhi script. In Peter Daniels and William Bright,
editors, The World’s Writing Systems. Oxford University Press, New York, NY,
pages 395–398.
Haile, Getatchew. 1996. Ethiopic writing. In Peter Daniels and William Bright, edi-
tors, The World’s Writing Systems. Oxford University Press, New York, NY, pages
569–576.
Hopcroft, John and Jeffrey Ullman. 1979. Introduction to Automata Theory, Lan-
guages and Computation. Addison-Wesley, Reading, MA.
Johnson, C. Douglas. 1972. Formal Aspects of Phonological Description. Mouton,
Mouton, The Hague.
Kaplan, Ronald and Martin Kay. 1994. Regular models of phonological rule systems.
Computational Linguistics, 20:331–378.
Karanth, Prathibha. 2002. Reading into reading research through nonalphabetic lens-
es: Evidence from Indian languages. Topics in Language Disorders, 22(5):20–31.
Karanth, Prathibha. 2003. Literacy and language processes — orthographic and struc-
tural effects. In Prathibha Karanth and Joe Rozario, editors, Learning Disability in
India: Willing the Mind to Learn. Sage, chapter 6, pages 145–160.
King, Ross. 1996. Korean hankul. In Peter Daniels and William Bright, editors, The
World’s Writing Systems. Oxford University Press, New York, NY, pages 218–227.
Macdonell, Arthur. 1900. A History of Sanskrit Literature. Heinemann, London.
Mahapatra, B. P. 1996. Oriya writing. In Peter Daniels and William Bright, editors,
The World’s Writing Systems. Oxford University Press, New York, NY, pages 404–
407.
Mohri, Mehryar. 1997. Finite-state transducers in language and speech processing.
Computational Linguistics, 23(2).
Mohri, Mehryar, Fernando Pereira, and Michael Riley. 1998. A rational design for a
weighted finite-state transducer library. Lecture Notes in Computer Science, (1436).
Mohri, Mehryar and Richard Sproat. 1996. An efficient compiler for weighted rewrite
rules. In 34th Annual Meeting of the Association for Computational Linguistics,
pages 231–238, Santa Cruz, CA. ACL.
Narasimhan, Bhuvana, Richard Sproat, and George Kiraz. 2003. Schwa-deletion in
Hindi text-to-speech synthesis. International Journal of Speech Technology. forth-
coming.
Ohala, Manjari. 1983. Aspects of Hindi Phonology. Motilal Banarsidass, Delhi.
Padakannaya, Prakash. 1999. Reading disability and knowledge of orthographic prin-
ciples. Psychological Studies, 44:59–64.
Padakannaya, Prakash. 2000. Is phonemic awareness an artefact of alphabetic litera-
cy?! Presented at ARMADILLO 11, Texas A&M, October 13–14, 2000.
Padakannaya, Prakash. 2001. Syllabic and phonemic awareness in children acquiring
literacy in a semisyllabic script. Department of Psychology, University of Mysore.
Patel, P. G. 1993. Ancient India and the orality-literacy divide theory. In Robert
Scholes, editor, Literacy and Language Analysis. Lawrence Erlbaum Associates,
Hillsdale, NJ, pages 199–208.
Prakash, P., D. Rekha, R. Nigam, and P. Karanth. 1993. Phonological awareness, or-
thography and literacy. In Robert Scholes, editor, Literacy and Language Analysis.
Lawrence Erlbaum Associates, Hillsdale, NJ, pages 55–70.
Radhakrishnan, Sankaran. 2002. Tamil Script Learners Manual. University of Texas
at Austin, Department of Asian Studies, Austin, TX. Available at inic.utexas.
edu/asnic/radhakrishnan/pages/tamilscript.html.
Ratliff, Martha. 1996. The Pahawh Hmong script. In Peter Daniels and William
Bright, editors, The World’s Writing Systems. Oxford University Press, New York,
NY, pages 619–624.
Salomon, Richard. 1996. Brahmi and Kharoshthi. In Peter Daniels and William
Bright, editors, The World’s Writing Systems. Oxford University Press, New York,
NY, pages 373–383.
Sampson, Geoffrey. 1985. Writing Systems. Stanford University Press, Stanford, CA.
Smalley, William, Chia Koua Vang, and Gnia Yee Yang. 1990. Mother of Writing:
The Origin and Development of a Hmong Messianic Script. University of Chicago
Press, Chicago, IL.
Spencer, Harold. 1950. Kanarese Grammar. Wesley Press, Mysore. Revised edition
by W. Perston.
Sproat, Richard. 1997a. Multilingual text analysis for text-to-speech synthesis. Jour-
nal of Natural Language Engineering, 2(4):369–380.
Sproat, Richard, editor. 1997b. Multilingual Text to Speech Synthesis: The Bell Labs
Approach. Kluwer Academic Publishers, Boston, MA.
Sproat, Richard. 2000. A Computational Theory of Writing Systems. Cambridge
University Press, Cambridge.
Steever, Sanford. 1996. Tamil writing. In Peter Daniels and William Bright, editors,
The World’s Writing Systems. Oxford University Press, New York, NY, pages 426–
430.
Swiggers, Pierre. 1996. Transmission of the Phoenician writing system to the west.
In Peter Daniels and William Bright, editors, The World’s Writing Systems. Oxford
University Press, New York, NY, pages 261–270.
Vaid, Jyotsna and Ashum Gupta. 2002. Exploring word recognition in a semi-
alphabetic script: The case of Devanagari. Brain and Language, 81:679–690.
van Heuven, Vincent. 2002. The effects of diaeresis on visual word recognition in
Dutch. In Martin Neef, Anneke Neijt, and Richard Sproat, editors, The Relation of
Writing to Spoken Language, number 460 in Linguistische Arbeiten. Max Niemey-
er, Tübingen, pages 99–114.

Notes
1
As is well-known, Sampson (1985), has argued that Hankul is a featural script.
This has been argued against by DeFrancis (1989), and see also (Sproat, 2000).
2
Though in Gurmukhi the full-vowel forms are replaced with vowel-specific “vow-
el bearers”, along with the diacritic form (Gill, 1996), and this trend of using vowel
bearers has developed further in Southeast Asian Brahmi derivatives.
3
I will use the term consonant sequence throughout to refer to a sequence of con-
sonants within an aks.ara. By so doing, I wish to emphasize the fact that these do not
generally represent phonological consonant clusters.
4
Some, like Chinese, have several different overall directions to choose from.

5
In Mahapatra’s Table 35.1 (1996, page 405), k[au] appears as with the
and glyphs unfused.
6
In Oriya, where the decomposition of /o,[ai],[au]/ is less controversial than the
proposal for Devanagari, the full vowel forms are more idiosyncratic.
7
Use of arka is not obligatory in Kannada: one can also write the sequence rC in
its logical way, with the full form of the r glyph and the second consonant in a
reduced form underneath it: e.g., rta .

8
Interestingly Spencer (1950) suggests that the inherent vowel /a/ is actually rep-
resented in the orthography by the headstroke that is included with many of the con-
sonant symbols—e.g. k —and which is lost in the reduced forms: consider the
second k of kk . His argument is that the headstroke is lost with some of the
vowels, as with ki . This seems doubtful, however: the headstroke is only lost
in the reduced forms, and with vowels that would conflict with the headstroke. Thus
the headstroke is not lost with ku .
9
Also used to ‘cancel’ inherent vowels and occasionally used in Devanagari as a
substitute for a ligatured consonant.
10
See http://www.sibal.com/sandeep/jtrans.
11
This rule does not handle the case of a homorganic nasal being written with anus-
vara, and hence being syllabified with the preceding aks.ara.
12
In practice, however, we found that the conventions for candrabindu usage are
often not adhered to; a later ‘stylistic’ rule allows anusvara to substitute for candra-
bindu.
13
http://www.research.att.com/sw/tools/lextools/
14
Though there have been recent attempts to decompose the complex symbols when
they are taught (Karanth, 2002).
15
The work in phonology that argues against the segment as a primitive of phono-
logical representation relies heavily on the concept that phonetic features are features
of syllables rather than segments. Segments appear phonetically as an epiphenomenon
of coordinated timing of specific features. This view has been expressed in vari-
ous phonological theories, but is perhaps most strongly associated with “articulatory
phonology” (Browman and Goldstein, 1989), and see also (Fujimura, 1994).

Problematic for that view is recent work by Blevins (2002) on historical metathe-
sis, which involves shifting articulatory features from one (segmental) position within
a syllable to another, typically flipping between pre- and post-rime position. The his-
torical evidence suggests that language learners have a strong propensity to associate
those features with one or another position, with change occurring when the position
decided on by learners of one generation differs from the position assumed by learn-
ers of the preceding generation. But if the anti-segmental view favored by Faber is
right, one has to wonder why learners would have any propensity at all to associate
the features with a particular segmental position. In particular, why could the features
not just remain associated with the syllable, with no preferred timing with respect to
other segments?
16
On the other hand, Swiggers (1996) seems to imply a rather more conscious de-
sign on the part of the Greeks than is suggested by Faber’s explanation.
17
While arka and anusvara make segmental manipulation easier, interestingly they
also make syllable deletion harder (Padakannaya, 2001).
18
Note that the naming latency difference between tilak and masjid probably cannot
be explained solely by the fact that masjid has more letters than tilak: both were slower
than cases like tasv[ii]r ‘picture’, which has the same number of letters as masjid.
19
George Thompson (p.c.) also notes that the notion of aks.ara itself precedes liter-
acy.

Bbslides
No ratings yet
Bbslides
103 pages
Cpms Long Iwlc 06
No ratings yet
Cpms Long Iwlc 06
19 pages
Altmann, Gabriel - Analyses of Script - Properties of Characters and Writing Systems (Quantitative Linguistics) (German Edition) (2008, Walter de Gruyter)
No ratings yet
Altmann, Gabriel - Analyses of Script - Properties of Characters and Writing Systems (Quantitative Linguistics) (German Edition) (2008, Walter de Gruyter)
184 pages
Siva Sutra Paper
No ratings yet
Siva Sutra Paper
16 pages
Brahmi Script
No ratings yet
Brahmi Script
25 pages
Quantitative Methods For Classifying Writing Systems: Gerald Penn
No ratings yet
Quantitative Methods For Classifying Writing Systems: Gerald Penn
3 pages
Lies SL
No ratings yet
Lies SL
289 pages
Brahmi Script Recognition Study
No ratings yet
Brahmi Script Recognition Study
7 pages
Paninis Grammar
No ratings yet
Paninis Grammar
19 pages
(Ebook) Analyses of Script: Properties of Characters and Writing Systems (Quantitative Linguistics) (German Edition) by Altmann, Gabriel ISBN 9783110196412, 3110196417 2025 Easy Download
100% (1)
(Ebook) Analyses of Script: Properties of Characters and Writing Systems (Quantitative Linguistics) (German Edition) by Altmann, Gabriel ISBN 9783110196412, 3110196417 2025 Easy Download
108 pages
Prakrit Phonology
No ratings yet
Prakrit Phonology
13 pages
Primus Featural Analysis 2004 PDF
No ratings yet
Primus Featural Analysis 2004 PDF
40 pages
Gujarati Script Grammar
No ratings yet
Gujarati Script Grammar
40 pages
Calli Monanty
No ratings yet
Calli Monanty
10 pages
282860malayalam Script Grammar
No ratings yet
282860malayalam Script Grammar
47 pages
282860malayalam Script Grammar PDF
No ratings yet
282860malayalam Script Grammar PDF
47 pages
10 1 1 19 3403
No ratings yet
10 1 1 19 3403
18 pages
Jan Gonda A Concise Elementary Grammar of The Sanskrit Language PDF
No ratings yet
Jan Gonda A Concise Elementary Grammar of The Sanskrit Language PDF
158 pages
Jan Gonda - A Concise Elementary Grammar of The Sanskrit Language
83% (6)
Jan Gonda - A Concise Elementary Grammar of The Sanskrit Language
158 pages
Deciphering Indus Script As A Cryptogram
100% (1)
Deciphering Indus Script As A Cryptogram
59 pages
Deciphering Indus Script As A Cryptogram
No ratings yet
Deciphering Indus Script As A Cryptogram
68 pages
Deciphering Indus Script As A Cryptogram
No ratings yet
Deciphering Indus Script As A Cryptogram
62 pages
Sanskrit The First Programming Language
No ratings yet
Sanskrit The First Programming Language
3 pages
Script Grammar Marathi
No ratings yet
Script Grammar Marathi
44 pages
Brahmi For Lingua Akshara
No ratings yet
Brahmi For Lingua Akshara
43 pages
CS TR 83 965
No ratings yet
CS TR 83 965
159 pages
Netineti PDF
No ratings yet
Netineti PDF
38 pages
DIFFICULTIES IN LEARNING TAMIL - IRREGULARITIES IN ... - Infitt
No ratings yet
DIFFICULTIES IN LEARNING TAMIL - IRREGULARITIES IN ... - Infitt
6 pages
Rosettaproject Mar Ortho-1
No ratings yet
Rosettaproject Mar Ortho-1
7 pages
Phonology-Orthography Interface in Devanāgarī For Hindi : 2nd Proofs
No ratings yet
Phonology-Orthography Interface in Devanāgarī For Hindi : 2nd Proofs
9 pages
Brahmi Language
No ratings yet
Brahmi Language
3 pages
Script Identification in Printed Bilingual Documents
No ratings yet
Script Identification in Printed Bilingual Documents
12 pages
Comp Net - An Approach For - Chandra Sekharam Bondu
100% (1)
Comp Net - An Approach For - Chandra Sekharam Bondu
8 pages
An Analysis of Difficulties in Punjabi Language Automation Due To Non-Standardization of Fonts
No ratings yet
An Analysis of Difficulties in Punjabi Language Automation Due To Non-Standardization of Fonts
8 pages
The Development of Writing
No ratings yet
The Development of Writing
4 pages
The Unicode Standard, Version 12.0
No ratings yet
The Unicode Standard, Version 12.0
69 pages
Iks Unit 2
No ratings yet
Iks Unit 2
40 pages
Theory of Automata: Dr. S. M. Gilani
No ratings yet
Theory of Automata: Dr. S. M. Gilani
29 pages
Flat Notes
No ratings yet
Flat Notes
110 pages
Unit 1 - Introduction To Formal Language Theory.
No ratings yet
Unit 1 - Introduction To Formal Language Theory.
4 pages
FLAT Complete Notes
No ratings yet
FLAT Complete Notes
110 pages
CH 12
No ratings yet
CH 12
73 pages
UNIT 5 - Writing Systems
No ratings yet
UNIT 5 - Writing Systems
3 pages
Sanskrit
90% (10)
Sanskrit
245 pages
00 Introduction To The Devanagari Script
No ratings yet
00 Introduction To The Devanagari Script
245 pages
An Arabic Morphological Analyzer/Synthesizer: M.G. K, A. A - O and S. A - S
No ratings yet
An Arabic Morphological Analyzer/Synthesizer: M.G. K, A. A - O and S. A - S
23 pages
Unicode 6.2通用字符
No ratings yet
Unicode 6.2通用字符
33 pages
Splat
No ratings yet
Splat
10 pages
Interspeech 15
No ratings yet
Interspeech 15
4 pages
Monograph
No ratings yet
Monograph
100 pages
Sproat Shih
No ratings yet
Sproat Shih
29 pages
Late Uruk Pigs in Ancient Mesopotamia
No ratings yet
Late Uruk Pigs in Ancient Mesopotamia
13 pages
Text Normalization for Speech
No ratings yet
Text Normalization for Speech
81 pages
Academic Texts: A Student's Guide
No ratings yet
Academic Texts: A Student's Guide
7 pages
(Ebook) Colloquial Turkish: The Complete Course For Beginners by Backus, Ad Aarssen, Jeroen ISBN 9781138950214, 1138950211
No ratings yet
(Ebook) Colloquial Turkish: The Complete Course For Beginners by Backus, Ad Aarssen, Jeroen ISBN 9781138950214, 1138950211
51 pages
Word Formation Notes
No ratings yet
Word Formation Notes
5 pages
Present Simple Vs Present Continuous
0% (1)
Present Simple Vs Present Continuous
1 page
Bahasa Inggris. Kel 1
No ratings yet
Bahasa Inggris. Kel 1
13 pages
Faulkner R O A Concise Dictionary of Mid
100% (4)
Faulkner R O A Concise Dictionary of Mid
500 pages
Exams Are The Best Way To Find Out What Students Know. Do You Agree or Disagree? Why?
No ratings yet
Exams Are The Best Way To Find Out What Students Know. Do You Agree or Disagree? Why?
3 pages
Language Teaching Frameworks Guide
No ratings yet
Language Teaching Frameworks Guide
1 page
Eapp SLM - Template
100% (1)
Eapp SLM - Template
13 pages
KVS TGT PHE, 2023.02.14, Shift 02, 03.00PM-06.00PM
No ratings yet
KVS TGT PHE, 2023.02.14, Shift 02, 03.00PM-06.00PM
110 pages
Complete Ielts 5.0-6.5
No ratings yet
Complete Ielts 5.0-6.5
5 pages
Language Production 1
No ratings yet
Language Production 1
4 pages
Fast ForWord Scope and Sequence Guide
No ratings yet
Fast ForWord Scope and Sequence Guide
13 pages
GE Simple Complex Compund Sentences Min
No ratings yet
GE Simple Complex Compund Sentences Min
2 pages
Articles in Grammar - Useful Rules, List & Examples - 7ESL
No ratings yet
Articles in Grammar - Useful Rules, List & Examples - 7ESL
22 pages
Homework Cambridge Dictionary
100% (1)
Homework Cambridge Dictionary
5 pages
IELTS General Task Notes 1
No ratings yet
IELTS General Task Notes 1
7 pages
Ethnography of Communication
No ratings yet
Ethnography of Communication
3 pages
English for Learners & Speakers
100% (1)
English for Learners & Speakers
391 pages
Boosting Student Reading Fluency
No ratings yet
Boosting Student Reading Fluency
9 pages
Class Notes
No ratings yet
Class Notes
13 pages
Summative Test
100% (1)
Summative Test
2 pages
Tangkhul Curriculum Overview
No ratings yet
Tangkhul Curriculum Overview
2 pages
8th English Term-1 - WAY To SUCCESS - COMPLETE GUIDE by WTS Teachers Team-Watermarked
No ratings yet
8th English Term-1 - WAY To SUCCESS - COMPLETE GUIDE by WTS Teachers Team-Watermarked
88 pages
Francophone Countries Overview
No ratings yet
Francophone Countries Overview
1 page
Newmark Translation Theory
No ratings yet
Newmark Translation Theory
24 pages
Jobs and Workplaces Lesson Plan
No ratings yet
Jobs and Workplaces Lesson Plan
3 pages
Class 6 Eng WS
No ratings yet
Class 6 Eng WS
3 pages
Gerund and Infinitive Usage Guide
No ratings yet
Gerund and Infinitive Usage Guide
2 pages
English Entrance Exam 2014/2015
No ratings yet
English Entrance Exam 2014/2015
5 pages

A Formal Computational Analysis of Indic Scripts: University of Illinois at Urbana-Champaign

Uploaded by

A Formal Computational Analysis of Indic Scripts: University of Illinois at Urbana-Champaign

Uploaded by

A FORMAL COMPUTATIONAL ANALYSIS OF INDIC

a sequence of phonemes), and maps it to a graphemic representation,

right catenation 

Figure 1: the five planar concatenation operations:

the appropriate geometric position relative to the dash, representing the

L OCALITY HYPOTHESIS : Changes from the macroscopic catena-

A DJACENCY A XIOM: If  and if  is not idiosyn-

atures, such as the

Before the first consonant of a consonant sequence,

To get the ligatured consonants we assume a function  , which takes

% d —the catenation operator may be   : % +

For postconsonantal vowels, the following general rules apply,

     

Finally, repha/arka can be described as follows: Given an orthograph-

 * ,2.  , DIPH 

      "      !         "  "

If we return to Devanagari, we can see that exactly the same grammar

general stylistic principle of Devanagari that we appealed to above that

sequence is analyzed as        .

Divide the phonological string into orthographic units

The macroscropic catenator in Tamil is  , as in other Indic scripts.

 *-,.  , 1*-,.   

with  to the preceding consonant.

or preconsonantal r (arka), these would be coded in their surface

Here represents the empty string per convention. Note that

Additional rules translated consonants into their orthographic encod-

A vertical bar is inserted between VBAR consonants and a fol-

. . . investigations of language use suggest that many

There is by now a large and convincing body of evi-

Rhyme recognition: determining whether two words rhyme or

with an additional symbol to represent the vowel. On the other hand,

they were not able to correctly delete the /n/ in nadii 

ered cases where there was a single consonant (

syllabic consonant sequence (  %

other hand, with cases like  %

You might also like

right catenation

A DJACENCY A XIOM: If and if is not idiosyn-

To get the ligatured consonants we assume a function , which takes

% d —the catenation operator may be : % +

* ,2. , DIPH

" ! " "

sequence is analyzed as .

The macroscropic catenator in Tamil is , as in other Indic scripts.

-,. , 1-,.

with to the preceding consonant.

they were not able to correctly delete the /n/ in nadii

syllabic consonant sequence ( %

other hand, with cases like %