Skip to content

scarletcho/KoLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

47 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

KoLM

ν•œκ΅­μ–΄ μ–Έμ–΄λͺ¨λΈ μ œμž‘μ„ μœ„ν•œ 파이썬 기반 ν•œκ΅­μ–΄ ν…μŠ€νŠΈμ²˜λ¦¬ νŒ¨ν‚€μ§€μž…λ‹ˆλ‹€.

Key features

1) utils: ν•œκ΅­μ–΄ ν…μŠ€νŠΈμ²˜λ¦¬λ₯Ό μœ„ν•œ κΈ°λ³Έ 도ꡬ듀

- 파일 관리  
	- μ†μ‰¬μš΄ ν…μŠ€νŠΈνŒŒμΌ 읽기와 μ“°κΈ°, 파일 톡합  
- 인코딩 λ³€ν™˜  
- ν…μŠ€νŠΈ 처리  
	- μž‰μ—¬μ μΈ 곡백 정리  
	- TEI 헀더 제거  
	- λ¬Έμž₯별 μ–΄μ ˆ λͺ©λ‘ 정리  

2) normalize: ν…μŠ€νŠΈ μ •κ·œν™”

- ν…μŠ€νŠΈ μ •κ·œν™”  
- 길게 이어진 μ½”νΌμŠ€λ₯Ό λ¬Έμž₯ λ‹¨μœ„λ‘œ 자λ₯΄κΈ°  
- ν•œκΈ€μ΄ μ•„λ‹Œ 문자 μ‚­μ œ  
- ν•œκ΅­μ–΄λ‘œ 된 쀄글 μ™Έμ˜ λ¬Έμžλ“€ 전사  
	- ν•œκΈ€ 자λͺ¨, ν•œμž, 숫자, μ•ŒνŒŒλ²³, μ˜λ‹¨μ–΄ 읽기  

3) tag: ν˜•νƒœμ†Œμ™€ μ˜μ‚¬ν˜•νƒœμ†Œ 생성

- ν˜•νƒœμ†Œ 뢄석 (KoNLPy 및 Mecab 연동)  
- ν˜•νƒœμ†Œ 뢄석 κ²°κ³Όλ‘œλΆ€ν„° 2κ°€μ§€ μœ ν˜•μ˜ μ˜μ‚¬ν˜•νƒœμ†Œ(pseudo-morpheme) 생성  
	- μ΅œμ†Œν˜•νƒœμ†Œ (λͺ¨λ“  ν˜•νƒœ 경계λ₯Ό 뢄리해 κ°€μž₯ μž‘κ²Œ 잘린 λ‹¨μœ„; micro)  
	- μ€‘κ°„ν˜•νƒœμ†Œ (체언과 μ‘°μ‚¬λ§Œμ„ 뢄리해 쀑간 크기둜 잘린 λ‹¨μœ„ ; medium)  
	
	NB. μ˜μ‚¬ν˜•νƒœμ†Œ 생성을 μœ„ν•΄μ„œλŠ” ν˜•νƒœμ†Œ 뢄석이 μ™„λ£Œλœ ν…μŠ€νŠΈκ°€ ν•„μš”ν•©λ‹ˆλ‹€.
	    λ³Έ μ½”λ“œλŠ” μšΈμ‚°λŒ€ UTagger ν˜•νƒœμ†ŒλΆ„μ„ 아웃풋을 μ „μ œν•˜μ—¬ μ˜μ‚¬ν˜•νƒœμ†Œλ₯Ό μƒμ„±ν•˜κΈ°μ—,
	    μž…λ ₯λ˜λŠ” ν…μŠ€νŠΈ 파일이 UTagger 아웃풋과 λ‹€λ₯Ό 경우 좔가적인 μ½”λ“œ μˆ˜μ • μž‘μ—…μ΄ ν•„μš”ν•©λ‹ˆλ‹€.

4) lm & g2p: μ–Έμ–΄λͺ¨λΈ μ œμž‘μ„ μœ„ν•œ λ°œμŒμ •λ³΄ 및 μ •μ œν…μŠ€νŠΈ 생성

- λ¬Έμžμ—΄λ‘œλΆ€ν„° λ°œμŒμ—΄ 생성(Grapheme-to-Phone; G2P)  
- μ–Έμ–΄λͺ¨λΈ μ œμž‘μ„ μœ„ν•œ 파일 생성  
	- μ •μ œλœ μ½”νΌμŠ€ 원문(textraw) 생성  
	- λ°œμŒμ‚¬μ „(lexicon.txt) 생성  


Requirements

  • Python 2.7 or 3
  • Required Python packages:
    • KoNLPy, JPype1, korean, hanja, Mecab
    • [Note] The above packages are automatically installed as you install KoLM via pip


Installation

  • The latest version is available in PyPI:

      $ pip install kolm  
    

Tutorial: How to use KoLM

  • λ§λ­‰μΉ˜ μ •μ œ μž‘μ—… κ°€μ΄λ“œ:

    • (1) λͺ¨λ“  ν…μŠ€νŠΈλ₯Ό UTF-8둜 인코딩 λ³€ν™˜
      • utils.convertEncoding
    • (2) λͺ¨λ“  ν…μŠ€νŠΈλ₯Ό ν•˜λ‚˜λ‘œ 이어뢙여 μ €μž₯ν•˜κΈ°
      • utils.stackFiles
    • (3) TEI 헀더 (λ˜λŠ” λΆ„μ„λŒ€μƒμ΄ μ•„λ‹Œ νƒœκ·Έλ₯˜) 제거
      • utils.removeHeader
    • (4) ν…μŠ€νŠΈ μ •κ·œν™”
      • normalize.Knormalize
    • (5) ν˜•νƒœμ†ŒλΆ„μ„
      • tag.morphTag
    • (6) 원문-ν˜•νƒœμ†Œ λŒ€μ‘°λ₯Ό ν†΅ν•œ μ˜μ‚¬ν˜•νƒœμ†Œ μΆ”μΆœ
      • tag.pseudomorph
    • (7) μ •μ œν…μŠ€νŠΈ(textraw)와 λ°œμŒμ‚¬μ „(lexicon.txt) 생성
      • lm.writeTextraw
      • lm.getUniqueWords
      • lm.writeLexicon
  • ꡬ체적인 μ‚¬μš© μ˜ˆμ‹œ μ½”λ“œλ₯Ό 보렀면 runKoLM.py λ₯Ό μ°Έμ‘°ν•˜μ„Έμš”.

1. utils

  • Start by importing every method in kolm.utils

      >> from kolm.utils import *  
    
  1. File management
    • readfileUTF8 (fname)

        # UTF-8 μΈμ½”λ”©λœ νŠΉμ • 파일(song15.txt)을 읽어듀이기  
        >> readfileUTF8('song15.txt')  
      
    • writefile (body, fname)

        # mydir λ‚΄ λͺ¨λ“  ν…μŠ€νŠΈμ˜ 인코딩을 UTF-16μ—μ„œ UTF-8둜 λ³€ν™˜ν•˜κΈ°  
        >> convertEncoding('mydir', 'utf-16', 'utf-8')  
      
    • stackFiles (path, stackFname, flist=[])

        # mydir λ‚΄ λͺ¨λ“  ν…μŠ€νŠΈλ₯Ό ν•œ 파일둜 λͺ¨μ•„ mystack.txt 둜 μ €μž₯ν•˜κΈ°  
        >> stackFiles('mydir', 'mystack.txt')  
      
        # mydir λ‚΄ νŠΉμ • νŒŒμΌλ“€(song1.txt, song2.txt, song15.txt)을 ν•œ 파일둜 λͺ¨μ•„ mystack.txt 둜 μ €μž₯ν•˜κΈ°  
        >> stackFiles('mydir', 'mystack.txt', ['song1.txt', 'song2.txt', 'song15.txt'])		  
      

  1. Encoding
    • convertEncoding (path, encodingSource, encodingDest, flist=[])

        # mydir λ‚΄ λͺ¨λ“  ν…μŠ€νŠΈμ˜ 인코딩을 UTF-16μ—μ„œ UTF-8둜 λ³€ν™˜  
        >> convertEncoding('mydir', 'utf-16', 'utf-8')  
          
        # mydir λ‚΄ νŠΉμ • νŒŒμΌλ“€(song1.txt, song2.txt, song15.txt)의 인코딩을 UTF-16μ—μ„œ UTF-8둜 λ³€ν™˜  
        >> convertEncoding('mydir', 'utf-16', 'utf-8', ['song1.txt', 'song2.txt', 'song15.txt'])  
      

  1. Text management
    • tightenString (corpus)

        # ν…μŠ€νŠΈ 리슀트 λ‚΄ μž‰μ—¬μ μΈ 곡백 정리 및 μ‚­μ œ  
        >> tightenString(corpus)  
      
    • getEojeolList (sentlist)

        # λ¬Έμž₯ λ¦¬μŠ€νŠΈμ—μ„œ μ–΄μ ˆ 리슀트 μΆ”μΆœ  
        >> getEojeolList(['짧은 λ¬Έμž₯을 λ„£μ—ˆλ‹€', 'μƒˆν•΄ 볡', '집에 κ°”λ”λ‹ˆ λ°₯이 μ—†λ„€')  
      
    • removeHeader (headeredfname)

        >> convertEncoding('mydir', 'utf-16', 'utf-8')  
      

2. normalize

  • Start by importing every method in kolm.normalize

      >> from kolm.normalize import *  
    
  1. Normalization
    • Knormalize (in_fname, out_fname)

        # Normalize a textfile  
        >> Knormalize(in_fname, out_fname)  
      
    • normalize (corpus)

        # Normalize a text list variable in workspace  
        >> normalize(corpus)  
      
    • bySentence (corpus)

        >> bySentence(corpus)  
      
    • removeNonHangul (line)

        >> removeNonHangul(line)  
      

  1. Character reading in Korean
    • Alphabets

      • readABC (line)

          >> readABC(line)  
        
      • readAlphabet (line)

          >> readAlphabet(line)  
        
    • Hanja (Chinese characters)

      • readHanja (line)

          >> readHanja(line)  
        
    • Hangul jamos (i.e. single letters which do not make a syllable)

      • readHangulLetter (line)

          >> readHangulLetter('γ…Šμ„ γ…ˆμœΌλ‘œ μ μ—ˆλ‹€')  
          μΉ˜μ“μ„ μ§€μ’μœΌλ‘œ μ μ—ˆλ‹€  
        

  1. Number reading in Korean
    • readNumber (line)

        >> readNumber(line)  
      

3. tag

  • Start by importing every method in kolm.tag

      >> from kolm.tag import *  
    
  1. Morphemes
    • morphTag (in_fname, out_fname)

        # Mecab ν˜•νƒœμ†ŒλΆ„μ„  
        >> morphTag(in_fname, out_fname)  
      

  1. Pseudo-morphemes
    • morph2pseudo (raw_sentlist, morph_sentlist, type)

        # λ¬Έμž₯ λ¦¬μŠ€νŠΈλ‘œλΆ€ν„° μ˜μ‚¬ν˜•νƒœμ†Œ(μ΅œμ†Œ 크기) λ¬Έμž₯ 리슀트 생성  
        >> morph2pseudo(raw_sentlist, morph_sentlist, 'micro')  
      
        # λ¬Έμž₯ λ¦¬μŠ€νŠΈλ‘œλΆ€ν„° μ˜μ‚¬ν˜•νƒœμ†Œ(쀑간 크기) λ¬Έμž₯ 리슀트 생성  
        >> morph2pseudo(raw_sentlist, morph_sentlist, 'medium')  
      
    • pseudomorph (rawText, morphText, pseudoType)

        # λ¬Έμž₯ ν•˜λ‚˜λ‘œλΆ€ν„° μ˜μ‚¬ν˜•νƒœμ†Œ(μ΅œμ†Œ 크기) λ¬Έμž₯ 생성  
        >> pseudomorph(rawText, morphText, 'micro')  
      
        # λ¬Έμž₯ ν•˜λ‚˜λ‘œλΆ€ν„° μ˜μ‚¬ν˜•νƒœμ†Œ(쀑간 크기) λ¬Έμž₯ 생성  
        >> pseudomorph(rawText, morphText, 'medium')  
      

4. lm

  • Start by importing every method in kolm.lm

      >> from kolm.lm import *  
    
  • writeTextraw (corpus)

      # μ •μ œλ₯Ό 마친 단일 λ§λ­‰μΉ˜ 파일(textraw) 생성  
      >> writeTextraw(corpus)  
    
  • getUniqueWords (text_fname)

      # 고유 μ–΄μ ˆ(λ˜λŠ” ν˜•νƒœμ†Œ; λ§λ­‰μΉ˜ μƒμ˜ λ„μ–΄μ“°κΈ°λœ λ‹¨μœ„λ₯Ό 의미)λͺ©λ‘(wordlist.txt) μΆ”μΆœ  
      >> getUniqueWords(text_fname)  
    
  • writeLexicon (text_fname)

      # 고유 μ–΄μ ˆλͺ©λ‘μ— G2Pλ₯Ό μ μš©ν•œ λ°œμŒμ‚¬μ „(lexicon.txt) 생성  
      >> writeLexicon(text_fname)  
    

5. g2p

  • Start by importing every method in kolm.g2p

      >> from kolm.g2p import *  
    
  1. Main
    • runKoG2P (hangeul_sequence, rulebook_path)

        # Run Korean G2P on a sequence  
        >> runKoG2P(hangeul_sequence, rulebook_path)  
      
    • runTest (rulebook, testset)

        # Run a test on a testset with a specific rulebook  
        >> runTest(rulebook, testset)  
      
    • readRules (pver, rulebook)

        >> readRules(pver, rulebook)  
      

  1. Auxiliaries
    • phone2prono (phones, rule_in, rule_out)

        >> phone2prono(phones, rule_in, rule_out)  
      
    • graph2prono (graph, rule_in, rule_out)

        >> graph2prono(graph, rule_in, rule_out)  
      
    • graph2phone (graphs)

        >> graph2phone(graphs)  
      
    • isHangul (charint)

        >> isHangul(charint)  
      

About

Korean text normalization and language preparation package for LM in Kaldi-based ASR system

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages