Skip to content

cneud/dta_emb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

dta_emb

Instructions and supporting tools for training word embeddings for historical (ca. 1600 - 1900) German based on the texts of the Deutsches Textarchiv using fastText.

A pre-trained model can be obtained here [1,43 GB].

Linux

  1. Follow the instructions for building fastText

  2. Download the DTA normalized XML files
    wget -i dta_normalized.txt -P dta_normalized

  3. Transform the DTA normalized XML files into plain text
    xsltproc tei2txt.xsl dta_normalized/* -o dta_normalized/*.txt

  4. Concatenate all plain text files into a single text file
    cp dta_normalized/*.txt dta_normalized/dta_normalized_all.txt

  5. Compute word embeddings using fastText
    fasttext skipgram -input dta_normalized/dta_normalized_all.txt -output dta_emb

Windows

  1. Download wget.exe from https://eternallybored.org/misc/wget/

  2. Download msxsl.exe from https://www.microsoft.com/en-us/download/details.aspx?id=21714

  3. Download fasttext.exe from https://github.com/xiamx/fastText/releases

  4. Download the DTA normalized XML files
    wget.exe -i dta_normalized.txt -P dta_normalized

  5. Run a batch script to convert the XML files to a combined plain text file
    dta2txt.bat

  6. Compute word embeddings using fastText
    fasttext.exe skipgram -input dta_normalized\dta_normalized_all.txt -output dta_emb

About

Train word embeddings on DTA texts using fastText

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors