A Bi-directional Long Short-Term Memory (BiLSTM) Classifer for Japanese sentences that categorizes the verbs contained within by their tense/voice e.g. passive, past, participle, etc.
Only python<=3.12.3 will work as of now, due to compatibility issues with spaCy.
The corpora are included in the repository, but can be re-downloaded: Tanaka Corpus (used in final model) SNOW T15
The SNOW T15 corpus was dropped due to its low number of examples for conjugations like passive and potential (which is the focus of this project)
For spaCy, follow these instructions and the model used is ja_core_news_trf
These should be the commands:
pip install spacy
python -m spacy download ja_core_news_trf
For MeCab, the official installation instructions are in Japanese, but the relevant commands are:
MeCab itself:
tar zxfv mecab-0.996.tar.gz
cd mecab-0.996
./configure
make
make check
su
make install
MeCab's Dictionary:
% tar zxfv mecab-ipadic-2.7.0-20070610.tar.gz
% mecab-ipadic-2.7.0-20070610
% ./configure
% make
% su
# make install
Here is a list of dictionaries to use for MeCab in case there are issues with the above commands.
This repository also requires mecab to work, so if there are issues with finding the dictionary's path refer to this.
pip install pandas numpy matplotlib scikit-learn torch pickle jamdict jamdict-data
To manually use examples for each model to attempt to classify, modify test.csv where each row is a sentence segment, a sentence containing only a single verb. If a sentence with multiple verbs is used, the first verb in the sentence is classified when using the baseline, and the last verb is used in the bilstm.
Run python3 baseline.py, where a logistic regression model will either be loaded or trained on /data/data_2/sentences_with_segments.csv. If a new model is trained, new metrics are saved to baseline_model_evaluation.txt. Then, examples from test.csv will be classified and printed.
Example Predictions:
Segment: 誰が一番に着く
Predicted verb type: dict (confidence: 0.90)
Segment: か私には分かりません
Predicted verb type: negative-polite (confidence: 1.00)
Segment: 食べられる
Predicted verb type: passive (confidence: 0.98)
Segment: お姉さんに私のりんごを食べられた
Predicted verb type: passive-past (confidence: 0.96)
Segment: お姉さんは私のりんごが食べられた
Predicted verb type: potential-past (confidence: 0.80)
Segment: お姉さんに私のりんごを食べられました
Predicted verb type: passive-past-polite (confidence: 0.99)
Segment: お姉さんは私のりんごが食べられました
Predicted verb type: passive-past-polite (confidence: 0.97)
Segment: お姉さんに私のりんごを食べられて
Predicted verb type: passive-participle (confidence: 0.97)
Segment: お姉さんは私のりんごが食べられて
Predicted verb type: passive-participle (confidence: 0.90)
Segment: 私に食べられた
Predicted verb type: passive-past (confidence: 0.97)
Segment: 私は食べられた
Predicted verb type: passive-past (confidence: 0.97)
Segment: した私は食べる
Predicted verb type: past (confidence: 0.86)
Run python3 lstm.py, where a bidirectional long short-term memory model will either be loaded or trained on /data/data_2/sentences_with_segments.csv. If a new model is trained, new metrics are saved to bilstm_model_evaluation.txt. Then, examples from test.csv will be classified and printed.
Segment: 誰が一番に着く
Predicted verb type: dict (confidence: 1.00)
Segment: か私には分かりません
Predicted verb type: negative-polite (confidence: 1.00)
Segment: 食べられる
Predicted verb type: passive (confidence: 1.00)
Segment: お姉さんに私のりんごを食べられた
Predicted verb type: passive-past (confidence: 1.00)
Segment: お姉さんは私のりんごが食べられた
Predicted verb type: potential-past (confidence: 1.00)
Segment: お姉さんに私のりんごを食べられました
Predicted verb type: passive-past-polite (confidence: 1.00)
Segment: お姉さんは私のりんごが食べられました
Predicted verb type: potential-past-polite (confidence: 0.98)
Segment: お姉さんに私のりんごを食べられて
Predicted verb type: passive-participle (confidence: 1.00)
Segment: お姉さんは私のりんごが食べられて
Predicted verb type: potential-participle (confidence: 1.00)
Segment: 私に食べられた
Predicted verb type: passive-past (confidence: 1.00)
Segment: 私は食べられた
Predicted verb type: passive-past (confidence: 1.00)
Segment: した私は食べる
Predicted verb type: dict (confidence: 1.00)
jp-verb-classifier/
├── conjugation_info/ *files referenced for conjugation information *
│ ├── conj_tags_to_inflections.csv *conjugation rule tagset for spacy tags --> mecab tags*
│ ├── conjugation_tags.csv *conjugation rule tagset for mecab tags --> tags recognized by inflections.py*
│ └── inflection_types.csv *list of root conjugation rules*
├── data/ *data files*
│ ├── data_1/ *data files for the SNOW T15 Corpus (not used in final model)*
│ ├── ├── sentences_cleaned.csv *Step 3. cleaned sentences for T15*
│ ├── ├── sentences_labeled.csv *Step 2. labeled sentences for T15*
│ ├── ├── sentences_raw.csv *Step 1. raw sentences for T15*
│ ├── ├── sentences_with_segments.csv *Step 4. sentences with segments extracted by verb for T15*
│ │ └── T15-2020.1.7.xlsx *original dataset*
│ └── data_2/ *data files for the Tanaka Corpus*
│ ├── examples.utf *original dataset*
│ ├── sentences_cleaned.csv *Step 3. cleaned sentences for Tanaka*
│ ├── sentences_labeled.csv *Step 2. labeled sentences for Tanaka*
│ ├── sentences_raw.csv *Step 1. raw sentences for Tanaka*
│ └── sentences_with_segments.csv *Step 4. sentences with segments extracted by verb for Tanaka*
├── legacy/ *legacy files from previous attempts at labeling pipeline, no longer works*
│ ├── baseline.py
│ ├── inflections.py
│ ├── tense_check_test.py
│ └── tense_check.py
├── models/ *saved models and related classifier evaluation data*
│ ├── baseline_model_evaluation.txt *evaluation of baseline*
│ ├── baseline_model.pkl *saved baseline model (not included in repo due to size)*
│ ├── bilstm_model_evaluation.txt *evaluation of bilstm*
│ └── bilstm_model.pt *saved bilstm model*
├── modules/ *modules for data processing*
│ ├── classify.py *module for classifying verbs*
│ ├── inflections.py *conjugation/inflection script that takes a root verb and its conjugation rule, returning a full dictionary of conjugations*
│ └── sentence_tokenize.py *depreceated, but saves a .pkl of sentence_cleaned.csv with an additional column containing spacy token data*
├── plots/ *figures*
│ ├── baseline_metrics.png *bar chart of precision, recall, and F1 for baseline*
│ ├── bilstm_metrics.png *bar chart of precision, recall, and F1 for bilstm*
│ └── label-process-graph.png *visual flow-chart of how verbs are classified by classify.py*
├── results/ *results*
│ ├── bilstm_epochs.csv *epoch training metrics from training bilstm*
│ ├── report.py *generates tables and figures for metrics foudn in /plot*
│ └── test.csv *test examples*
├── baseline.py *trains or loads new baseline model*
├── data_setup.py *Completes step 3 and 4 of data processing*
├── label.py *Completes step 1 and 2 of data processing*
└── lstm.py *trains or loads new lstm model*
- Python 3 Reference Manual Python v3.12.3 (Highest version compatible with spaCy)
- Tanaka Corpus Managed by the Tatoeba Corpus project, UTF-8 downloaded
- SNOW T15 Corpus Japanese Simplified Corpus with Core Vocabulary (not used in final model)
- spaCy: Industrial-strength Natural Language Processing in Python Tokenization, lemmatization, POS-tagging, conjugation rule type
- spaCy (Japanese models) Japanese models
- Scikit-learn: Machine Learning in Python Training/test split, logistic regression, classifier evaluation
- PyTorch Dataset, dataloader, BiLSTM Model
- pandas: powerful Python data analysis toolkit Data cleaning
- NumPy (Nature article) Data cleaning
- pickle — Python object serialization Model saving and loading
- Matplotlib Plot graphing
- Jamdict: A Japanese-Multilingual Dictionary Interface Lemmatization verification
- Japanese Verb Conjugation Scripts Original script for classifying verbs based on rules
- Japanese MeCab Part-of-Speech Tagset Reference for classes of Japanese verbs
- MeCab: Yet Another Part-of-Speech and Morphological Analyzer Used by spaCy