A simple python 3.11 script to generate child-parent lemma mappings for import into Lute, using NLTK.
This may only work for English, I'm not sure ... this is a work-in-progress to explore lemmatization for Lute.
- python3.11 (perhaps will work with earlier versions, untested)
Use pip3.11:
$ python3.11 -m venv .env
$ source .env/bin/activate
$ pip3.11 install -r requirements.txt
$ deactivate
Then, get the necessary data files, following notes at https://www.nltk.org/data.html#interactive-installer. Eg, for English:
$ python3.11
>>> import nltk
>>> nltk.download('punkt')
>>> nltk.download('wordnet')
>>> nltk.download('popular')
Put the full text you wish to lemmatize in a file, and then call main. e.g., using the demo file:
$ source .env/bin/activate
$ python3.11 main.py demo/en_input.txt en_output.txt
$ deactivate
This will generate the file of parent-child mappings for the words in the demo text:
go gone
take took
thing things
write wrote
friend friends
report reports
Note again that this program uses a full text for lemmatization, because it uses tagging to determine the function of each word as it lemmatizes.