GitHub - cneud/dta_emb: Train word embeddings on DTA texts using fastText

dta_emb

Instructions and supporting tools for training word embeddings for historical (ca. 1600 - 1900) German based on the texts of the Deutsches Textarchiv using fastText.

A pre-trained model can be obtained here [1,43 GB].

Linux

Follow the instructions for building fastText
Download the DTA normalized XML files
wget -i dta_normalized.txt -P dta_normalized
Transform the DTA normalized XML files into plain text
xsltproc tei2txt.xsl dta_normalized/* -o dta_normalized/*.txt
Concatenate all plain text files into a single text file
cp dta_normalized/*.txt dta_normalized/dta_normalized_all.txt
Compute word embeddings using fastText
fasttext skipgram -input dta_normalized/dta_normalized_all.txt -output dta_emb

Windows

Download wget.exe from https://eternallybored.org/misc/wget/
Download msxsl.exe from https://www.microsoft.com/en-us/download/details.aspx?id=21714
Download fasttext.exe from https://github.com/xiamx/fastText/releases
Download the DTA normalized XML files
wget.exe -i dta_normalized.txt -P dta_normalized
Run a batch script to convert the XML files to a combined plain text file
dta2txt.bat
Compute word embeddings using fastText
fasttext.exe skipgram -input dta_normalized\dta_normalized_all.txt -output dta_emb

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
dta2txt.bat		dta2txt.bat
dta_normalized.txt		dta_normalized.txt
tei2txt.xsl		tei2txt.xsl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dta_emb

Linux

Windows

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

dta_emb

Linux

Windows

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages