EstNLTK analyzer#818
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #818 +/- ##
=======================================
Coverage 99.63% 99.63%
=======================================
Files 93 95 +2
Lines 7141 7170 +29
=======================================
+ Hits 7115 7144 +29
Misses 26 26 ☔ View full report in Codecov by Sentry. |
|
The test coverage is not 100%. I think that's due to how optional backends are handled in the initialization code, that I copied from the spaCy analyzer. The same problem is already there for spaCy, so I think it would make sense to fix that first in main, then apply the same solution in this PR. |
d2a0051 to
66f577d
Compare
|
The initialization problem was fixed for spaCy in PR #820, already merged to main. This still needs wiki documentation, maybe also a mention in the Annif tutorial. |
|
I added a brief section about this analyzer on the wiki page for Analyzers. I don't see any mention of specific analyzers in the Annif tutorial, so I don't think this needs to be mentioned there. |
|
I added an estnltk section to the Optional features page of wiki too (and updated the whole page to use poetry instead of pip for installing dependencies when using dev installation). |
|
I realized that EstNLTK is GPL licensed, so probably should be mentioned alongside YAKE when we talk about licensing. Need to fix that before merging this PR. |
|
|
I asked the EstNLTK developers about their thoughts on license compatibility. They were very positive about the Annif integration. Now that the licensing situation is at least somewhat clarified in the README, I don't think there are obstacles for merging this, so I will do it now. |
This PR adds a new analyzer to support lemmatization using EstNLTK, a natural language analysis toolkit for the Estonian language.
Note that the indirect dependencies of EstNLTK are quite large, with around ~500MB of libraries.