Skip to content

jcderyck/EncoDet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EncoDet

PHP code for detecting character encoding and language from textfiles.

Encodings dectected: 7-bit (ISO 2022, HZ, VN_HTML, ASCII) CJK (GB18030, CP950, BIG5-HKCS:2004, EUC-JP, CP932, CP949, JOHAB) Vietnamese (VPS, VNI, VIQR, VISCII) Single byte non-latin (Cytillic: CP1251, ISO-8859-5, MacCyrillic, KOI8-RU, CP855, CP866, Greek: CP1253, ISO-8859-3, MacGreek, CP737, CP869, Hebrew: CP1255, ISO-8859-8, MacHebrew, CP862, Arabic, Farsi, Urdu: CP1256, ISO-8859-6, MacArabic, CP864, Thai: MacThai, CP874) Single-byte latin (DOS OEMs, Windows ANSIs, MacOSs, ISO-8859-*)

Latin languages detected: English, Breton, Catalan, Dutch, Galician, Icelandic, Portuguese, Brazilian, Danish, Norwegian, Esperanto, Finnish, French, German, Italian, Luxemburgish, Occitan, Spanish, Swedish, Tagalog, Estonian, Latvian, Lithuanian, Malay, Indonesian, Albanian, Bosnian, Croatian, Serbian, Czech, Hungarian, Polish, Romanian, Roman Macedonian, Slovak, Slovenian, Turkish, Vietnamese

Non-latin languages: Russian, Bulgarian, Macedonian, Ukrainian, cyrillic Serbian, Mongolian, Arabic, Farsi, Urdu, Hebrew, Greek, Hindi, Tamil, Telugu, Mayalayam, Thai, Traditional Chinese, Simplified Chinese, Korean, Japanese

See example/Test_encoding for example of PHP script

About

Text Character Encoding and Language Detection in PHP

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages