A Formal Computational Analysis of Indic Scripts: University of Illinois at Urbana-Champaign
A Formal Computational Analysis of Indic Scripts: University of Illinois at Urbana-Champaign
SCRIPTS
Richard Sproat
University of Illinois at Urbana-Champaign
Introduction
The Brahmi-derived Indic scripts occupy a special place in the study
of writing systems. They are alphasyllabic scripts (Bright, 1996a)
(though Daniels (1996) prefers the term abugida), meaning that they
are basically segmental in that almost all segments are represented in
the script, yet the fundamental organizing principle of the script is the
(orthographic) syllable: crudely, an orthographic syllable — aks.ara —
in an Indic script consists of a consonantal core consisting of one or
more consonant symbols, with the vowels variously arranged around
the core. The one exception to the principle that segments are repre-
sented is the inherent vowel, usually a schwa or /a/-like phoneme: when
the consonantal core is marked with no vowel, it is interpreted as hav-
ing the inherent vowel (though phonological rules such as Hindi schwa-
deletion (Ohala, 1983), can complicate this characterization).
Brahmi-derived scripts are by no means the only scripts that are ba-
sically segmental but have the syllable as an organizing principle: no-
table other scripts are Ethiopic (Haile, 1996) and Korean Hankul (King,
1996); there have been claims that the latter is derived from Indic
scripts, though King notes (page 219) that there have been “no less than
ten different ‘origin theories’.”1 But Brahmi scripts are indubitably the
most successful, spawning not only several scripts in the Indian sub-
continent, but also spreading into Southeast Asia and spawning scripts
such as Thai and Lao (Diller, 1996), and even arguably influencing the
development in modern times of the Pahawh Hmong messianic script
(Smalley, Vang, and Yang, 1990; Ratliff, 1996).
The Indic scripts have some properties that make them interesting
from a number of perspectives. For instance, while the individual
scripts differ in details, they share a lot of properties; for example,
many scripts have encodings for the full set of Sanskrit-derived con-
sonants, and in all the scripts, vowel-initial aks.ara are written with a
full vowel form,2 whereas in consonant-initial aks.ara are written with
diacritic vowels around the consonant. This is convenient from the
point of view of text encoding since it means that to a large extent to
transliterate between different Indic scripts, one merely needs to shift
the codespace of the text; this is what is done in Unicode (Aliprand,
2003, and www.unicode.org). The parallels between scripts can be
quite strong indeed: as I shall argue in a later section, the encoding of
diacritic vowels in Devanagari and Oriya are identical at the right level
of abstraction. Indic scripts also make a much richer use of the two-
dimensional layout possibilities afforded by the written medium than
do Western alphabetic scripts, since vowels may be written above, be-
low or even before the consonants they logically follow; and in conso-
nant sequences in many scripts—Tamil is a notable exception (Steever,
1996)— all but one of the consonants takes on a reduced form, which
may be written either in linear sequence with, or above or below the
“primary” consonant (usually either the first or last in the logical se-
quence).3 In this paper I will express these two-dimensional layout pos-
sibilities in terms of a planar regular calculus.
Finally Indic scripts have some interesting psycholinguistic proper-
ties. With the exception of the inherent vowel, vowels are always rep-
resented explicitly in the Indic scripts, so in that sense the scripts are
segmental. But vowels are clearly subordinate to consonants (except
in vowel-initial aks.ara) since they are written as diacritics on the con-
sonant. As a result, various studies have shown that readers of Indian
languages written in Indic scripts have less overall segmental aware-
ness than readers of a language like English. Other studies have al-
so demonstrated psycholinguistic consequences of certain properties of
Indic scripts; for example, /i/ in Devanagari, which is written before
the consonant it logically follows, causes processing problems in that
words that contain /i/ take longer to identify and longer to name (i.e.,
pronounce aloud) in controlled experiments. I will discuss these issues
and cite the relevant studies in a later section.
The goal in this paper is threefold. First, I will introduce a formalism
for describing planar arrangements of symbols and apply that formalism
to selected phenomena in Indic scripts. Second, I will present in some
detail a computational model of Devanagari that was developed as part
of a Hindi text-to-speech system. Finally, I will explore some of the
results of psycholinguistic research on Indic scripts, and relate them to
the formal model previously developed: since the formal model as stat-
ed treats Indic scripts as segmental at an abstract level, this potentially
has psycholinguistic implications, which we will explore.
Formal Preliminaries
The relation between writing and the linguistic information it encodes
can naturally be thought of as a mapping in the algebraic sense. Consid-
er that we have a relation , that takes a string of linguistic units (say,
left catenation
downwards catenation
upwards catenation
surrounding catenation
The semantics of each of these is illustrated in Figure 1.
Surrounding catenation seems not to be needed for Indic scripts,
though it is needed for Chinese; see (Sproat, 2000). However most
Indic scripts employ all the other catenators; I will give illustrations in
the next section. Note that catenators are associative within the same
type so that ; however across types this is not
the case so that in general
.
The planar catenators are, of course, similar to the traditional way of
describing the placement of vowel symbols in Indic scripts, by using
a dash to represent the consonant, and by placing the vowel symbol in
γ(α) γ(β) γ(β) γ(α)
(a) (b)
γ(α) γ(β)
γ(β) γ(α)
(c) (d)
γ(β) γ(α)
(e)
vowel. Consonant sequences are represented orthographically as
ligatured groups. Thus
k + [.s] +
s +
m yields
k yields sk ;
k[.s]m . Ligatures may
(in Devanagari i ) appear before the entire sequence: /ski/ is
represented i+sk
The anusvara symbol, representing a nasalized vowel, or a
postvocalic homorganic nasal stop, appears above the rightmost
symbol of the aks.ara: h[aa]M . Strictly speaking, a pure-
ly nasalized vowel should be written with candrabindu (thus
h[aa]M , which has the same placement as anusvara, but this
is often not observed.
Orthographic-syllable-initial /r/ is represented by a superscript
symbol (repha or arka) occurring above the last symbol of the
aks.ara: thus /varm[aa]/ is represented as vm[aa]+r .
The procedure for marking orthographic syllables, given a string of
phonemes, that applies to most Indic scripts (Tamil is an exception) can
be stated as follows:
Divide the phonological string into orthographic sylla-
bles by placing a syllable boundary :
At the beginning of the word;
Between each pair of adjacent (non-tautosyllabic)
vowels (as between the /[aa]/ and /[ii]/ in bhaaii
‘brother’)
Catenator
Full form
Diacritic form
(Null
a
ka )
[aa] k[aa]
o ko
e
[au] k[au]
k[ai]
ke
[ii] k[ii]
[ai]
u[uu] kuk[uu]
k[ri]
[ri]
i ki
Table 1: Full and diacritic forms for Devanagari vowels, classified by catena-
tor inherent to the diacritic forms.
Lexical specifications for certain ligatures, such as + =
k[.s] will override the general ligature. By default the catenation
operator is , but in certain instances, especially when the full form
+ can be ( or )
'
one can sometimes combine a pair of consonants in more than one way:
' '
ll . Sequence-final /r/, is always produced
by a reduced form, which we will notate as +*-,. $/ , subscripted (or
in some cases idiosyncratically fused with) the previous consonant:
0/ $ 1*-,2. $/ so that, e.g., 3 +4 becomes 5
pr .
If represents the full (syllable-initial) form of , then the rule
for graphically encoding syllable-initial vowels can be given as follows:
.
That is, /a/ deletes, and all other vowels are represented by their
diacritic form. The catenation operator depends upon the vowel: For a
consonant sequence and vowel , the catenation operators are chosen
as follows:
1*-,. if /a,o,[au],[ii]/
1*-,. if /e,[ai]/
1*-,. if /[uu],u,[ri]/
1*-,. if /i/
The vowels /e,[ai],o,[au]/ allow a deeper analysis, to which we
will return in the next section.
Anusvara and candrabindu are simply conjoined with their logical
predecessor using the operator:
$/ .
This places the arka above the remainder of the aks.ara; to place
it on the righthand side of the aks.ara we assume a general stylistic
principle of Devanagari that prefers to place superscripted materials to
the righthand edge of the aks.ara. We will see in the next section that
this is useful beyond this one case.
Catenator
[sh] Example
p
[sh]p
s[.n] [.th]
+ =
+
t = =
st[.n][.th]
kr
dk br ==
+
+
bd
t[ng] +
sk ==
ts[ng]k
r
+
+
k = rk
p p =
py
+
(fused)
k +
+
[sh] =
k[sh]
Table 2: Catenators used for Oriya consonant sequences. In many of the
above cases, the glyphs take on a reduced form, and ligaturing applies. Note
the representation of rk with a superscript reduced r , as in Devanagari.
k[sh] is represented with a fused glyph, also as in Devanagari.
O RIYA
Oriya (Mahapatra, 1996) is a Northern Indic script distantly related to
Devanagari (Salomon, 1996), though quite different in appearance. The
basic design of the Oriya script is of course identical to that of Devana-
gari: syllable-initial vowels are represented with their own symbols,
otherwise dependent vowel symbols are used, so that the orthographic
syllable parsing procedure given above for Devanagari also works for
Oriya. Consonant sequences are represented by combining the single
consonants using various catenators and often involving ligaturing the
consonants together.
Oriya consonant sequences are complex and there is much sequence-
specificity in the choice of catenators (often or , but sometimes ),
as well as the degree of fusion. Simple cases include
[.n][.th] ; both using
p , which becomes [sh]p ; and [.n] plus
[sh] plus
[.th] which
becomes the operator. Various examples are
given in Table 2.
The diacritic vowels are more orderly. Oriya simplex diacritic vowels
are shown in Table 3. The grammar for these vowels is similar to that
for Devanagari, except that /i/ uses rather than , and /e/ uses rather
than :
Catenator
Diacritic form
(Null
k[aa]
ka )
k[ii]
ki
k[uu]
kuk[ri]
ke
Table 3: Forms for simplex diacritic Oriya vowels, classified by catenator.
1*-,. if /a,[ii]/
1*-,. if /i/
1*-,. if /[uu],u,[ri]/
1*-,. if /e/
The three vowels, /o, [ai], [au]/ are expressed using combinations
of symbols requiring more than one catenator. Thus ko is a
combination of diacritic e and diacritic [aa] , thus
; k[ai]
thus
is a combination of diacritic e and a new superscript diacritic,
; and k[au] is a combination of all three, thus .5 Such
complex vowels are common in Indic scripts (as well as Indic-derived
scripts of Southeast Asia, such as Thai), and Devanagari is more the
exception than the rule in seemingly lacking them.
Let us assume that the complex vowels are represented as a set of
symbols each with their own catenator; furthermore let us assume that
the glyph for [ai] and [au] represents a ‘diphthong’ feature
[ai]
DIPH (since historically at least, these vowels were diphthongs). Then:
. Finally, au is just a combination of e , [aa] and DIPH,
with the placement of the e +DIPH complex above the [aa] being
again due to the stylistic principle: . Thus, in detail:
=
=
DIPH
=
DIPH =
full form vow-
This decomposition also carries over to the Devanagari
els. Thus o is (full) [aa] plus diacritic e : . [ai] is (full)
e plus DIPH: . And [au] is (full) [aa] plus diacritic e
plus DIPH: .6 Note that this explains the otherwise puzzling fact that
for full [ai] , there is only one stroke above the , whereas for all
other ‘diphthongs’ there are two: in this case the single stroke repre-
sents the DIPH feature, and the second stroke, which would normally
represent diacritic e is unnecessary since we already have the full
form e . This system for representing e , o , ai and au ,
would seem to have been inherited from Brahmi; see (Salomon, 1996,
Table 30.2, page 374).
K ANNADA
Kannada is a Southern Brahmi-derived script used to write the
Kannada language (Spencer, 1950; Bright, 1996b). The underlying
structure of the script is again, essentially the same as other Indic
scripts, though the surface details differ. Consonant clusters are written
in a much more regular fashion than Devanagari or Oriya, with the typ-
ical mode of expression being to use the catenator to place a reduced
follows: thus
l becomes
form of a non-initial consonant underneath the consonant it logically
l , so that .
If there are more than two consonants in the sequence, then the
subsequent consonants are not stacked, but are written in a left-to-right
sequence following the reduced and subscripted second consonant:
lak[.s]ma[.n]a ‘Lakshmana’ (Spencer, 1950, page 99).
This motivates a general rule for subscripted sequences whereby in
a sequence of catenations, all but the initial catenator are changed to :
K ANNADA L INEARIZATION:
(Note that this rule must apply right-to-left.) Thus the !
used for vowels, which are limited to and , with being used only
for [ri] , [rii] and the lower symbol of ai (a complex sym-
bol). Table 4 gives the analysis of Kannada diacritic vowels in terms
of these three catenators. I also include the symbol sets for complex
vowels, where LONG is used for a glyph that seems to be associat-
ed only with long vowels.8 With consonant sequences the vowels are
standardly described as being associated with the first (inline) conso-
nant (Bright, 1996b). According to the A DJACENCY A XIOM, vowels
that are written above ( ) the aks.ara must appear above the first con-
sonant, which they do, consistent with the standard description. For
vowels that have an or component, these components must appear
to the right or below the vertically arranged consonant cluster: so either
. . . #" or $ . . . #" . Thus we have (with
Catenator
ka )
Diacritic form
(Null
ku
k[aa]
k[uu]
kike
k[au]
k[ri]
k[rii]
i ( ), LONG ( )
k[ii]
e ( ), [uu] ( )
ko
e ( ), [ai] ( )
k[ai]
e ( ), LONG ( )
k[ee]
e ( ), [uu] ( ), LONG ( )
k[oo]
Table 4: Forms for diacritic Kannada vowels, classified by catenator. Com-
plex vowels are given at the bottom of the table. Note that [rii] only occurs
in Kannada transliterations of Sanskrit (Bill Bright, p.c.).
)
ka[.n][.n][ii]ru ‘tears’ and lak[.s]m[ii][.s]a
‘husband of Laxmi (Vishnu)’ (Spencer, 1950, pages 29, 341); here
the right component of [ii] appears after the consonant sequence.
Similarly (with ) we have
kr[ai]sta ‘Christian’,
[ee][sh]v[ai]kya ‘universal unity’ (Spencer, 1950, page 156, 343);
where the lower component of [ai] appears after the sequence of
lowered consonants. Note in particular the application of K ANNADA
L INEARIZATION to the righthand component of [ai] . It is worth ob-
serving that while this arrangement is consistent with the A DJACENCY
A XIOM, it contradicts the standard description of the arrangement of
vowels with consonant sequences.
TAMIL
From a formal point of view, Tamil (Steever, 1996; Radhakrishnan,
2002) is probably the simplest Indic script since it has eliminated the
aks.ara as a fundamental unit of the script. Virtually all consonant con-
juncts have been eliminated, and consonant sequences are written left-
to-right with unligatured consonant symbols. Steever (1996, page 426)
suggests that this trend may have been related to the introduction of
typography from the West. However, the vowels behave as they do in
other Indic scripts: syllable-initial vowels are written in a full form,
whereas postconsonantal vowels are written with a diacritic form, with
various of the four catenators being used, depending upon the vowel.
Given that consonant sequences have been completely linearized in
Tamil, with all consonant sequences being constructed using the macro-
scopic catenator , there would seem to be no motivation for maintain-
ing that the SLU is the orthographic syllable. Rather the SLU in Tamil
is a maximally CV(V) unit, with the possibilities being V(V) (syllable-
initial short or long vowel), C (consonant-sequence-medial consonant)
and CV(V) (consonant with diacritic short or long vowels). Ignoring
for the moment the decomposability of CV(V) elements and the fact
that there are simplex vowel glyphs representing both long and short
vowels, Tamil orthography is almost what I termed in (Sproat, 2000) a
core syllabary, like Japanese kana. Tamil is moving towards being an
alphasyllabic version of kana.
If the SLU is the CV(V) unit, not an orthographic syllable, the theory
predicts that vowel diacritics in Tamil can only attach to the final conso-
nant in a consonant sequence. In particular, a vowel that uses the cate-
nator must appear before the last consonant of a sequence, and cannot,
for example appear before the initial consonant, as it would in Devana-
gari or Oriya. This is because appearance before a non-final consonant
would require deviation from the macroscopic order in a unit larger than
the SLU. This prediction is correct: thus
ten with the ee symbol appearing before the
hek[.t][ee] is writ-
[.t] .
A procedure that would be appropriate for dividing Tamil words into
SLU’s would be as follows:
that, for instance [.n] is , whereas [.n][uu] is , which is quite
different from what one might expect on the basis of puu . Tamil
seems to be evolving towards kana in another regard, namely that the
CV(V) units are becoming unanalyzable.
The diacritic forms of o , and [au] , involve the same symbol
sets as in Devanagari and Oriya, and in addition Tamil has [oo] ,
which is formed on the basis of [ee] (these latter two invented by
o po
missionaries, Bill Bright, p.c.):
of the [aa] glyph, though we assume that it is specified as catenating
Note that in Tamil the glyph is ligatured to the beginning
DIPH
A similar rule (replacing with ) would be required to encode
cancellation signs in other Indic scripts.
Computational analysis of Devanagari
As part of a Hindi text-to-speech system that we were developing at
Bell Labs (Narasimhan, Sproat, and Kiraz, 2003; Sproat, 1997b) (and
see (Bhaskararao and Mathew, 1992; Bhaskararao, Peri, and Updikar,
1994) for other work on Hindi TTS), we developed a practical imple-
mentation of Hindi orthography as part of the text analysis component
of the system. The task of a text-analysis component of a TTS system is
to map from text into a linguistic representation; it can thus be thought
of as a computational model of what humans do when they read aloud.
The Hindi text-analysis component transformed raw text into a phono-
logical representation, including as part of that process the expansion
of abbreviations, numbers and time expressions, as well as phonolog-
ical phenomena such as schwa deletion. All of this was implemented
in terms of the FST-based approach to text-analysis outlined in (Sproat,
1997a; Sproat, 2000).
The first component of the text-analysis module converted from De-
vanagari text into an abstract orthographic representation, where sym-
bols are left-catenated together in their logical order. Thus
%
would
be represented as varm[aa] , and would be represented as
hiNdii . Note that while this representation could also be thought
of as phonological (Hindi orthography being fairly close to a phone-
mic representation), phonological phenomena such as schwa-deletion
are not represented at this level.
Input text was represented using the mock ISO-8859 xdvng encoding,
for use with the X window system under Unix.10 The fonts were de-
signed to work under an ordinary browser with no special Devanagari-
specific rendering. Super- and subscripted vowels are handled in this
scheme by having the corresponding character codes appear in their log-
ical order (as in Unicode), but with the glyphs so designed that they will
over- or understrike the previous consonant. The placement of super- or
ing. For symbols that are massively ‘out of order’, such as short i
subscripts was thus technically correct if not always esthetically pleas-
O RTH S YLL:
i is handled by a pair of rules, one which converts a postconsonantal
i into a ‘trace’ , and a second that inserts a precursor for the
diacritic (non-full-form) i ( . * ) after an orthographic syllable
boundary, before a consonant sequence that is followed by the i
trace:
I - TRACE: i (i)
i . *
ORTH - I: (i)
Note that the I - TRACE is needed since even though the i is
written before the consonant sequence, the consonant sequence still
behaves as if there is a vowel after it: the rule VBAR that later inserts
a vertical bar on the final consonant of the sequence is sensitive to the
presence of a vowel in order to find the end of the sequence.
In a similar way, preconsonantal r is positioned by a pair of rules
to the end of the orthographic syllable, as follows:
R - TRACE: r (r)
r. * (r)
ORTH - R:
In the case of r there is no reason to preserve the trace, so
this can be automatically deleted.
The two-stage process involving deletion and insertion can be viewed
as a poor-man’s implementation of the directional catenators introduced
in the previous discussion, necessitated by the fact that the FST toolk-
it used in the implementation effectively encodes only one catenation
operator. Thus a more faithful implementation would select for i
and leave the relative order of the glyphs up to the final rendering.
Postconsonantal vowels other than /i/ are spelled out in their diacritic
form:
a a. *
aa aa . *
ii ii . *
u u . *
uu uu . *
e e. *
ai ai . *
o o . *
au au . *
ri ri . *
Nasalized vowels are marked with candrabindu, though this is
supposed to be replaced by anusvara if the vowel diacritic is a super-
script. Thus we have 3 for puM with candrabindu, but 3 with
anusvara for peM .12 We had the following rules for nasalized
vowels:
aM a. * [candrabindu]
aa M aa . * [candrabindu]
ii M ii . * [anusvara]
uM u. * [candrabindu]
uu M uu . * [candrabindu]
eM e. * [anusvara]
ai M ai . * [anusvara]
oM o. * [anusvara]
au M au . * [anusvara]
Full vowels are encoded using a set of rules similar to the above
rules for diacritic vowels, except that the rules are context independent
(the post-consonantal vowels having already been taken care of by the
rules above), and the full rather than the diacritic form
were also included for English open-o / / (full form ).
is used. Rules
divided into two classes, those that need a vertical bar to be complete in
final position (e.g. 3 p , b , [bh] , n ), and those that
do not (e.g. t. , [t.h] )
Non-VBAR consonants, including consonant sequences ending in
r or y (which cannot include any following consonant) can be
translated directly into glyphs. Thus, for example:
pr 5
sr
t.k
d.y
She ties this claim in with a point made earlier on the same page to the
effect that recent phonological theories have argued against a primitive
status for segments:
We first need to observe that Faber is raising two quite distinct issues
here. The first relates to the issue of speakers’ conscious awareness of
their language, and their ability to consciously manipulate phonolog-
ical units of a particular kind. The second relates to speakers uncon-
scious knowledge of language, and phonological and phonetic theories
of that knowledge. The two are not necessarily connected. It is perfect-
ly conceivable, for example, that segments in the traditional structural-
ist linguistic sense are a primitive of people’s unconscious phonological
system, yet they do not become consciously aware of segments unless
explicitly trained (any more than untrained people are conscious of how
their articulators make sounds).
Yet, one activity that would surely seem to require a conscious aware-
ness of segments is the design of a segmental writing system, so if con-
scious awareness of segments is an epiphenomenon we would seem to
have a classic chicken-and-egg problem: how were segmental systems
invented in the first place? For the Greek alphabet, the main question is
how the vowel symbols developed, since the consonants were already
provided by the Phoenician precursor. Faber’s explanation (Faber, 1992,
page 126) is that the vowels were a misinterpretation (emphasis hers) of
several of the Phoenician consonant symbols as vowel symbols, due to
the misperception of consonant sounds not found in Greek. 16 According
to this story, then, the Greek alphabet is the accident that happened just
once. But what about various other scripts including the alphasyllabic
scripts of South Asia and Ethiopia, as well as Hankul? In all of these
scripts the (orthographic) syllable plays a key role in the arrangement of
symbols, yet the design of the system is basically segmental in nature.
Faber defines these cases away by claiming that only scripts where con-
sonants and vowels are both represented (thus Semitic scripts fail), are
on a par (thus Indic scripts fail), and are linearly arranged (thus Hankul
fails), count as alphabetic and thus segmental. So, in Faber’s classifica-
tion, Indic scripts do not count as alphabetic.
Now, there is certainly evidence that people who were taught to read
Indic scripts (and are not literate in a purely alphabetic script such as
English), are less segmentally aware than their counterparts in places
where alphabetic scripts are used. This in turn is in keeping with Faber’s
expectations.
Consider, for example, work by Padakannaya (2000) on Kannada,
where he experimented with a variety of tasks that tested the abilities
of children of different ages to manipulate the sounds of words. These
tasks included recognition or manipulation of syllables:
Syllable deletion;
Syllable reversal;
as well as tasks that involved recognition or manipulation of segments:
Phoneme oddity: given four stimuli, determine the one stimu-
lus for which there is a different phoneme at a specified position.
Thus for nonsense words chota, beti, kale, mito, the odd one out
is kale since there is an /l/ rather than a /t/ in the third position;
Phoneme deletion;
Phoneme reversal.
Padakannaya compared two sets of children in school grades I through
VII. The first set were sighted children learning the standard Kannada
alphasyllabary, and the second were blind children who were learning a
purely alphabetic Kannada braille.
On syllable manipulation tasks there was no significant difference
found between the two sets of children. So for example, except for first
grade, where blind children outperformed sighted children, blind chil-
dren and sighted children were identical in their performance on syllable
deletion. On the other hand, for phoneme manipulation, blind children,
exposed to the alphabetic script, consistently outperformed sighted chil-
dren. For example, Figure 2 shows the performance in terms of percent
correct for the phoneme reversal task for both populations of children
from grades I through VII. For the first three grades, where the blind
children steadily improve in performance, the sighted children are not
able to do the task at all. In fourth grade sighted children start to be able
to do the task, but it is not until fifth grade that they start to catch up with
the blind children’s ability; not coincidentally, it is in the fifth grade
that English is first introduced into the curriculum. Results showing
similarly lower performance for segment manipulation tasks by read-
ers of Indic scripts are reported in (Prakash et al., 1993; Padakannaya,
1999; Karanth, 2002), inter alia.
On the other hand, as we can see even in Figure 2, segmental ma-
nipulation is not a categorical ability: by fourth grade, before they
started learning English, the sighted children had developed some, al-
beit weak, ability to reverse phonemes. One factor that seems to af-
fect phonemic awareness with readers of Indic scripts is how “sepa-
rable” or “noticeable” particular glyphs are are (Prakash et al., 1993,
and Prakash Padakannaya, p.c.). For example, Padakannaya and his
colleagues (1993, page 65) note that their Hindi subjects were 95% suc-
cessful in a phoneme deletion task at deleting the /d/ in doshii
do[sh][ii] to yield osh[ii]. Here, the
%
d forms a separate glyph
%
hard to treat anusvara and arka as separate segments, since they are
written as diacritics above the end of the orthographic syllable: 3
peMsil ; aka+r ( arka ). On the other hand, this is easy for
'
Kannada speakers since in each case these symbols are written inline in
Kannada script (though in the case of arka, out of its “logical” order):
peMsil ;
aka+r .17
So some conscious segment-level manipulation is clearly possible, but
one might wonder how the segmental aspects of Indic scripts affect the
unconscious processing that is presumably involved when fluent read-
ers read words. A study by Vaid and Gupta (Vaid and Gupta, 2002)
addresses this question. Vaid and Gupta investigated naming latencies
(the speed with which readers are able to vocalized printed words) in
40 Hindi-speaking adults, and naming errors (the numbers of errors
made when vocalizing printed words) among 10 Hindi-speaking chil-
dren (7.5–10.0 years old) with words containing short /i/, which is writ-
cases like
'
lems if a phonological syllable boundary intervened. In particular, in
i+tlk /tilak/, the first syllable would simply be pro-
cessed as a unit, and it should not matter that the i is out of its logi-
cal order; such cases should be as readily processed as cases where the
away from the detail that the upper stroke of the i is actually written
over the following consonant sequence as in ki . Similarly one
k[uu] , without the finer
can describe the diacritic vowel [uu] in Oriya as occurring under
the consonant it logically follows as in
grained description that it actually occurs on the right at the bottom.
The justification for this is that it seems to be the right level of gran-
ularity to describe some of the psycholinguistic results we have dis-
cussed. So, it seems to be a factor that certain glyphs are written above
or below the main glyphs, but there is no justification for making a
finer-grained distinction than that. In Vaid and Gupta’s (2002) exper-
iment, the placement of diacritic i before the consonant sequence
was clearly relevant, whereas the fact that it also extends above the con-
sonant sequence was not, so far as one knows, a factor.
Of course, this is all extremely sketchy at this point since no psy-
cholinguistic work has yet been done that addresses the question of
whether, e.g., placement of a glyph on top to the right, or on top to
the left, makes a significant difference in some measurable behavior.
My expectation is that no such results will be forthcoming, but that it
will be possible to measure more effects of choice among the four basic
catenators used here.
Conclusions
The formal model presented in this paper allows for the succinct de-
scription of some quite intricate writing systems, such as the writing
systems of India. The theory is able to make predictions, as we saw in
the case of the placement of vowels in Tamil that deviate in their catena-
tion from the macroscopic direction, and the placement of the secondary
- and - catenated vowel components in Kannada, and it is able to pro-
vide some insights into the commonalities among Indic scripts, as in the
case of the encoding of o in diphthongs in Devanagari and Oriya.
The model also has implications for human processing of scripts.
Since it treats Indic scripts as fundamentally segmental, despite their ob-
vious differences from Western alphabets, it implies that there should be
some evidence that readers of Indic scripts are able to develop some seg-
mental awareness, and that psycholinguistic studies of reading should
show some processing at the segment level rather than just at the syl-
lable level. These expectations are borne out: although it is very clear
that readers of Indic scripts are much less able to manipulate segments
than readers of alphabetic scripts, there are clearly also cases where they
are able to do so. And the naming studies reported in (Vaid and Gup-
ta, 2002) show that processing of Devanagari at least partly occurs at
the level of the segment. Through the work of Padakannaya and others,
there is much that is now understood about the psycholinguistic effects
of learning Indic scripts, but there is also much that is not known. To
my mind this will be one of the most fruitful areas of future research on
Indian writing systems.
Acknowledgments
I wish to thank Bill Bright, Steve Farmer, Prakash Padakannaya,
Pramod Pandey, and George Thompson, for helpful comments on pre-
vious versions of this paper.
References
Aliprand, Joan, editor. 2003. The Unicode Standard, Version 4.0. Addison Wesley
Professional, Reading, MA.
Bhaskararao, Peri and Suresh Mathew. 1992. Phonemic transcription rules for text-to-
speech synthesis of Hindi. In R.M.K. Sinha, editor, Computer Processing of Asian
Languages. TATA McGraw Hill, New Delhi.
Bhaskararao, Peri, Venkata Peri, and Vishwas Updikar. 1994. A text-to-speech sys-
tem for application by visually handicapped and illiterate. In Proceedings of ICSLP
94, pages 1239–1242, Yokohama.
Blevins, Juliette. 2002. Evolutionary Phonology. Ms., University of California,
Berkeley, Submitted to Cambridge University Press.
Bright, William. 1996a. The Devanagari script. In Peter Daniels and William Bright,
editors, The World’s Writing Systems. Oxford University Press, New York, NY,
pages 384–390.
Bright, William. 1996b. Kannada and Telugu writing. In Peter Daniels and William
Bright, editors, The World’s Writing Systems. Oxford University Press, New York,
NY, pages 413–419.
Browman, Cathe and Louis Goldstein. 1989. Articulatory gestures as phonological
units. Phonology Yearbook, 6:201–251.
Daniels, Peter. 1996. The study of writing systems. In Peter Daniels and William
Bright, editors, The World’s Writing Systems. Oxford University Press, New York,
NY, pages 3–17.
DeFrancis, John. 1989. Visible Speech: The Diverse Oneness of Writing Systems.
University of Hawaii Press, Honolulu, HI.
Diller, Anthony. 1996. Thai and Lao writing. In Peter Daniels and William Bright,
editors, The World’s Writing Systems. Oxford University Press, New York, NY,
pages 457–466.
Faber, Alice. 1992. Phonemic segmentation as epiphenomenon. evidence from the
history of alphabetic writing. In Pamela Downing, Susan Lima, and Michael Noo-
nan, editors, The Linguistics of Literacy. John Benjamins, Amsterdam, pages 111–
34.
Farmer, Steve, J.B. Henderson, and Michael Witzel. 2002. Neurobiology, layered
texts, and correlative cosmologies: A cross-cultural framework for premodern his-
tory. Bulletin of the Museum of Far Eastern Antiquities, 72:48–90.
Fujimura, Osamu. 1994. The C/D model: A computational model of phonetic imple-
mentation. In DIMACS Series in Discrete Mathematics and Theoretical Computer
Science, volume 17, pages 1–20. American Mathematical Socieaty.
Gill, Harjeet Singh. 1996. The gurmukhi script. In Peter Daniels and William Bright,
editors, The World’s Writing Systems. Oxford University Press, New York, NY,
pages 395–398.
Haile, Getatchew. 1996. Ethiopic writing. In Peter Daniels and William Bright, edi-
tors, The World’s Writing Systems. Oxford University Press, New York, NY, pages
569–576.
Hopcroft, John and Jeffrey Ullman. 1979. Introduction to Automata Theory, Lan-
guages and Computation. Addison-Wesley, Reading, MA.
Johnson, C. Douglas. 1972. Formal Aspects of Phonological Description. Mouton,
Mouton, The Hague.
Kaplan, Ronald and Martin Kay. 1994. Regular models of phonological rule systems.
Computational Linguistics, 20:331–378.
Karanth, Prathibha. 2002. Reading into reading research through nonalphabetic lens-
es: Evidence from Indian languages. Topics in Language Disorders, 22(5):20–31.
Karanth, Prathibha. 2003. Literacy and language processes — orthographic and struc-
tural effects. In Prathibha Karanth and Joe Rozario, editors, Learning Disability in
India: Willing the Mind to Learn. Sage, chapter 6, pages 145–160.
King, Ross. 1996. Korean hankul. In Peter Daniels and William Bright, editors, The
World’s Writing Systems. Oxford University Press, New York, NY, pages 218–227.
Macdonell, Arthur. 1900. A History of Sanskrit Literature. Heinemann, London.
Mahapatra, B. P. 1996. Oriya writing. In Peter Daniels and William Bright, editors,
The World’s Writing Systems. Oxford University Press, New York, NY, pages 404–
407.
Mohri, Mehryar. 1997. Finite-state transducers in language and speech processing.
Computational Linguistics, 23(2).
Mohri, Mehryar, Fernando Pereira, and Michael Riley. 1998. A rational design for a
weighted finite-state transducer library. Lecture Notes in Computer Science, (1436).
Mohri, Mehryar and Richard Sproat. 1996. An efficient compiler for weighted rewrite
rules. In 34th Annual Meeting of the Association for Computational Linguistics,
pages 231–238, Santa Cruz, CA. ACL.
Narasimhan, Bhuvana, Richard Sproat, and George Kiraz. 2003. Schwa-deletion in
Hindi text-to-speech synthesis. International Journal of Speech Technology. forth-
coming.
Ohala, Manjari. 1983. Aspects of Hindi Phonology. Motilal Banarsidass, Delhi.
Padakannaya, Prakash. 1999. Reading disability and knowledge of orthographic prin-
ciples. Psychological Studies, 44:59–64.
Padakannaya, Prakash. 2000. Is phonemic awareness an artefact of alphabetic litera-
cy?! Presented at ARMADILLO 11, Texas A&M, October 13–14, 2000.
Padakannaya, Prakash. 2001. Syllabic and phonemic awareness in children acquiring
literacy in a semisyllabic script. Department of Psychology, University of Mysore.
Patel, P. G. 1993. Ancient India and the orality-literacy divide theory. In Robert
Scholes, editor, Literacy and Language Analysis. Lawrence Erlbaum Associates,
Hillsdale, NJ, pages 199–208.
Prakash, P., D. Rekha, R. Nigam, and P. Karanth. 1993. Phonological awareness, or-
thography and literacy. In Robert Scholes, editor, Literacy and Language Analysis.
Lawrence Erlbaum Associates, Hillsdale, NJ, pages 55–70.
Radhakrishnan, Sankaran. 2002. Tamil Script Learners Manual. University of Texas
at Austin, Department of Asian Studies, Austin, TX. Available at inic.utexas.
edu/asnic/radhakrishnan/pages/tamilscript.html.
Ratliff, Martha. 1996. The Pahawh Hmong script. In Peter Daniels and William
Bright, editors, The World’s Writing Systems. Oxford University Press, New York,
NY, pages 619–624.
Salomon, Richard. 1996. Brahmi and Kharoshthi. In Peter Daniels and William
Bright, editors, The World’s Writing Systems. Oxford University Press, New York,
NY, pages 373–383.
Sampson, Geoffrey. 1985. Writing Systems. Stanford University Press, Stanford, CA.
Smalley, William, Chia Koua Vang, and Gnia Yee Yang. 1990. Mother of Writing:
The Origin and Development of a Hmong Messianic Script. University of Chicago
Press, Chicago, IL.
Spencer, Harold. 1950. Kanarese Grammar. Wesley Press, Mysore. Revised edition
by W. Perston.
Sproat, Richard. 1997a. Multilingual text analysis for text-to-speech synthesis. Jour-
nal of Natural Language Engineering, 2(4):369–380.
Sproat, Richard, editor. 1997b. Multilingual Text to Speech Synthesis: The Bell Labs
Approach. Kluwer Academic Publishers, Boston, MA.
Sproat, Richard. 2000. A Computational Theory of Writing Systems. Cambridge
University Press, Cambridge.
Steever, Sanford. 1996. Tamil writing. In Peter Daniels and William Bright, editors,
The World’s Writing Systems. Oxford University Press, New York, NY, pages 426–
430.
Swiggers, Pierre. 1996. Transmission of the Phoenician writing system to the west.
In Peter Daniels and William Bright, editors, The World’s Writing Systems. Oxford
University Press, New York, NY, pages 261–270.
Vaid, Jyotsna and Ashum Gupta. 2002. Exploring word recognition in a semi-
alphabetic script: The case of Devanagari. Brain and Language, 81:679–690.
van Heuven, Vincent. 2002. The effects of diaeresis on visual word recognition in
Dutch. In Martin Neef, Anneke Neijt, and Richard Sproat, editors, The Relation of
Writing to Spoken Language, number 460 in Linguistische Arbeiten. Max Niemey-
er, Tübingen, pages 99–114.
Notes
1
As is well-known, Sampson (1985), has argued that Hankul is a featural script.
This has been argued against by DeFrancis (1989), and see also (Sproat, 2000).
2
Though in Gurmukhi the full-vowel forms are replaced with vowel-specific “vow-
el bearers”, along with the diacritic form (Gill, 1996), and this trend of using vowel
bearers has developed further in Southeast Asian Brahmi derivatives.
3
I will use the term consonant sequence throughout to refer to a sequence of con-
sonants within an aks.ara. By so doing, I wish to emphasize the fact that these do not
generally represent phonological consonant clusters.
4
Some, like Chinese, have several different overall directions to choose from.
5
In Mahapatra’s Table 35.1 (1996, page 405), k[au] appears as with the
and glyphs unfused.
6
In Oriya, where the decomposition of /o,[ai],[au]/ is less controversial than the
proposal for Devanagari, the full vowel forms are more idiosyncratic.
7
Use of arka is not obligatory in Kannada: one can also write the sequence rC in
its logical way, with the full form of the r glyph and the second consonant in a
reduced form underneath it: e.g., rta .
8
Interestingly Spencer (1950) suggests that the inherent vowel /a/ is actually rep-
resented in the orthography by the headstroke that is included with many of the con-
sonant symbols—e.g. k —and which is lost in the reduced forms: consider the
second k of kk . His argument is that the headstroke is lost with some of the
vowels, as with ki . This seems doubtful, however: the headstroke is only lost
in the reduced forms, and with vowels that would conflict with the headstroke. Thus
the headstroke is not lost with ku .
9
Also used to ‘cancel’ inherent vowels and occasionally used in Devanagari as a
substitute for a ligatured consonant.
10
See http://www.sibal.com/sandeep/jtrans.
11
This rule does not handle the case of a homorganic nasal being written with anus-
vara, and hence being syllabified with the preceding aks.ara.
12
In practice, however, we found that the conventions for candrabindu usage are
often not adhered to; a later ‘stylistic’ rule allows anusvara to substitute for candra-
bindu.
13
http://www.research.att.com/sw/tools/lextools/
14
Though there have been recent attempts to decompose the complex symbols when
they are taught (Karanth, 2002).
15
The work in phonology that argues against the segment as a primitive of phono-
logical representation relies heavily on the concept that phonetic features are features
of syllables rather than segments. Segments appear phonetically as an epiphenomenon
of coordinated timing of specific features. This view has been expressed in vari-
ous phonological theories, but is perhaps most strongly associated with “articulatory
phonology” (Browman and Goldstein, 1989), and see also (Fujimura, 1994).
Problematic for that view is recent work by Blevins (2002) on historical metathe-
sis, which involves shifting articulatory features from one (segmental) position within
a syllable to another, typically flipping between pre- and post-rime position. The his-
torical evidence suggests that language learners have a strong propensity to associate
those features with one or another position, with change occurring when the position
decided on by learners of one generation differs from the position assumed by learn-
ers of the preceding generation. But if the anti-segmental view favored by Faber is
right, one has to wonder why learners would have any propensity at all to associate
the features with a particular segmental position. In particular, why could the features
not just remain associated with the syllable, with no preferred timing with respect to
other segments?
16
On the other hand, Swiggers (1996) seems to imply a rather more conscious de-
sign on the part of the Greeks than is suggested by Faber’s explanation.
17
While arka and anusvara make segmental manipulation easier, interestingly they
also make syllable deletion harder (Padakannaya, 2001).
18
Note that the naming latency difference between tilak and masjid probably cannot
be explained solely by the fact that masjid has more letters than tilak: both were slower
than cases like tasv[ii]r ‘picture’, which has the same number of letters as masjid.
19
George Thompson (p.c.) also notes that the notion of aks.ara itself precedes liter-
acy.