Skip to content

becky82/mteh

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,394 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

More than enough Hanzi (MteH)

A curated set of ~4,540 simplified Chinese characters for advanced language learners.

The MteH corpus is designed as an "endgame corpus" for advanced students. Basically, if you learn these characters, you're practically "done for life" studying simplified Chinese characters (congratulations!). Obviously, there are more simplified Chinese characters than this (in proper nouns, scientific terms, chengyu, Chinese history, online usernames, etc.), but at a certain point you've got to draw the line and say "this is my endgame".

Currently, MteH focuses entirely on simplified Chinese characters, especially those you’ll encounter in mainland China and in HSK exams.

There is also an Anki Deck (here) which should already work, but should be thought of as a work-in-progress. (On a computer, AnkiDraw allows you to handwrite. On AnkiDroid, the in-built whiteboard feature enables handwriting.)


Summary

The MteH corpus is built to minimize "missing" characters; any characters not included are extremely rare or niche. Version v0.1.3 merges the following corpora:

# Corpus #chars #used Source / Reference
1 HSK 1.0 2,866 2,866 pre-2010, 11 levels
2 HSK 2.0 2,663 2,663 post-2010, 6 levels
3 HSK 3.0 3,000 3,000 2021 version 3.0 standards, 9 levels
4 HSK 3.1 3,088 3,088 2025 version 3.0 standards, 9 levels
5 TOCFL 3,027* 3,009 Taiwan's TOCFL 3100 + 33 traditional chars
6 K-5 1,817 1,812 K-5 word frequency
7 通用规范汉字表 3,500 3,495 Ministry of Education (2013)
8 现代汉语常用字表 3,500 3,491 Ministry of Education (1988)
9 primary school 2,468 2,467 China primary schools (2016)
10 Singapore 1,655 1,655 Singapore primary schools (2015)
11 Heisig 3,018 3,017 Heisig & Richardson, Remembering Simplified Hanzi I–II
12 Hoenig 2,177 2,159 Learn & Remember 2,178 Characters and Their Meanings
13 Jun Da 4,485* 4,254 modern Chinese corpus
14 SUBTLEX 4,462* 4,184 film and TV subtitle corpus
15 Tsai 4,329* 3,975 Usenet newsgroups (1993-1994)
16 Wikipedia 3,476* 3,221 Chinese Wikipedia
17 classical 1,968* 1,867 prior to the end of the Han dynasty
18 THUOCL 3,421* 3,222 mostly Sogou webpages
19 Leeds 4,230* 4,073 Internet corpus
20 BLCU 4,445* 4,089 "balanced", written Chinese
21 LWC 4,130* 3,961 Sina Weibo
22 food 1,182 1,101 food-related terms
23 species 4,086 3,211 species names
24 Chinese surnames 1,745 1,566 1,807 Chinese surnames
25 Chinese names 2,269 1,989 1,200,000 Chinese names
26 city-geo 1,277 1,133 mainland China city terms
27 company 4,363* 3,645 company proper nouns
28 med-orgs 4,826 3,731 medical organizations
29 chengyu convention 2,226 2,172 characters in "chengyu convention" chengyu
30 Xinhua 5,357 4,081 Xinhua chengyu and xiehouyu

Those marked * have extraction steps (documented in their respective readmes): selection of top-N words/characters, conversion from traditional to simplified.

Characters are ordered in Unicode order (excluding variants), grouping visually or structurally related forms as much as possible.

MteH also incorporates:

Statistics and debug reports: missing chars; corpus histogram; debug; modifications; syllables.


License

© 2025 Rebecca J. Stones
Licensed under CC BY-SA 4.0

About

More than enough Hanzi

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors