Add JSON fulltext corpus format#872
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #872 +/- ##
========================================
Coverage 99.66% 99.66%
========================================
Files 102 103 +1
Lines 7665 7838 +173
========================================
+ Hits 7639 7812 +173
Misses 26 26 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
juhoinkinen
left a comment
There was a problem hiding this comment.
Principles look good.
jsonschema could be used for validating the JSON data and for error handling. However, for such a simple schema, it might be unnecessary. But on other hand, Annif already uses jsonschema to customize API request validation.
The JSON format for documents actually looks like a step toward the functionality to train projects via REST API: #634
Yes, that's a great idea! I think that the schema is just about complex enough to justify JSON Schema validation. I'll look into it. (I already added basic error handling for cases where the file is empty or has broken JSON syntax) |
|
Note to self: there's now a subtle bug in EDIT: Should be fixed by 3031971 |
|
@juhoinkinen I added JSON Schema validation, made it possible to specify subjects via labels instead of URIs (as in the .txt/.tsv/.key format), moved the JSON support into its own submodule json.py, and fixed some issues with Also @c-poley @RietdorfC feel free to test and comment! |
|
|
Hi @osma, The first one completed without any problems and produced the expected output, but the second one, which refers to the JSON-file, generated an error massage (you can find the message in the file error.txt). We couldn’t trace the error back to our input, but maybe we overlooked something? A question occured when we tried to run the Thanks a lot and best regards |
|
Thank you @RietdorfC for testing and for reporting the problems. For the first problem, it appears that I will investigate and see if I can reproduce these problems. |
|
@RietdorfC just verifying - the testing was performed using the final PR code that was merged to |
|
Hi @osma, |
|
I found the first problem. There is a bugfix in PR #874 . Would you like to test it? As for the second problem, this happens because |
|
Hi @osma , Thanks for the clarification! The JSON support for the |
|
Hi @osma, We tested the Thanks again to your and your team! |
This PR adds a new JSON-based fulltext corpus format, based on the discussion in issue #868 (ping @RietdorfC @c-poley).
It extends the existing support for document directories. Previously, the directory had to contain
.txtfiles, with gold standard subjects stored in.tsv(or.key) files with the same basename. This PR adds another option: the directory may instead (or also) contain.jsonfiles with JSON data of the following form:{ "text": "A quick brown fox jumped over the lazy dog.", "metadata": { "title": "As We May Think", "author": "Bush, Vannevar" }, "subjects": [ { "uri": "http://www.yso.fi/onto/yso/p817", "label": "future" }, { "uri": "http://www.yso.fi/onto/yso/p3295", "label": "visions (prospects)" }, { "uri": "http://www.yso.fi/onto/yso/p15527", "label": "science fiction" } ] }All top level fields (
text,metadataandsubjects) are optional. Subject labels are also optional and included only for illustration; they do not affect the parsing. The JSON format is the same as the one used by the REST API methodlearn, specified by the IndexedDocument schema in the OpenAPI specification, exceptthere are no required fieldsthe set of required fields is a bit different (eithertextormetadatais required on the top level, and if subjects are included, they must specify eitheruriorlabel).Closes #868.