Skip to content

Add JSON fulltext corpus format#872

Merged
osma merged 14 commits into
mainfrom
issue868-json-corpus-format
Aug 14, 2025
Merged

Add JSON fulltext corpus format#872
osma merged 14 commits into
mainfrom
issue868-json-corpus-format

Conversation

@osma

@osma osma commented Aug 13, 2025

Copy link
Copy Markdown
Member

This PR adds a new JSON-based fulltext corpus format, based on the discussion in issue #868 (ping @RietdorfC @c-poley).

It extends the existing support for document directories. Previously, the directory had to contain .txt files, with gold standard subjects stored in .tsv (or .key) files with the same basename. This PR adds another option: the directory may instead (or also) contain .json files with JSON data of the following form:

{
  "text": "A quick brown fox jumped over the lazy dog.",
  "metadata": {
    "title": "As We May Think",
    "author": "Bush, Vannevar"
  },
  "subjects": [
    { "uri": "http://www.yso.fi/onto/yso/p817", "label": "future" },
    { "uri": "http://www.yso.fi/onto/yso/p3295", "label": "visions (prospects)" },
    { "uri": "http://www.yso.fi/onto/yso/p15527", "label": "science fiction" }
  ]
}

All top level fields (text, metadata and subjects) are optional. Subject labels are also optional and included only for illustration; they do not affect the parsing. The JSON format is the same as the one used by the REST API method learn, specified by the IndexedDocument schema in the OpenAPI specification, except there are no required fields the set of required fields is a bit different (either text or metadata is required on the top level, and if subjects are included, they must specify either uri or label).

Closes #868.

@osma osma added this to the 1.4 milestone Aug 13, 2025
@osma osma self-assigned this Aug 13, 2025
Comment thread annif/corpus/document.py Dismissed
@codecov

codecov Bot commented Aug 13, 2025

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.66%. Comparing base (140a1b6) to head (9592eef).
⚠️ Report is 15 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff            @@
##             main     #872    +/-   ##
========================================
  Coverage   99.66%   99.66%            
========================================
  Files         102      103     +1     
  Lines        7665     7838   +173     
========================================
+ Hits         7639     7812   +173     
  Misses         26       26            

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@osma osma changed the title WIP: Add JSON fulltext corpus format Add JSON fulltext corpus format Aug 13, 2025
@osma osma requested a review from juhoinkinen August 13, 2025 14:10

@juhoinkinen juhoinkinen left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Principles look good.

jsonschema could be used for validating the JSON data and for error handling. However, for such a simple schema, it might be unnecessary. But on other hand, Annif already uses jsonschema to customize API request validation.

The JSON format for documents actually looks like a step toward the functionality to train projects via REST API: #634

@osma

osma commented Aug 14, 2025

Copy link
Copy Markdown
Member Author

@juhoinkinen

jsonschema could be used for validating the JSON data and for error handling. However, for such a simple schema, it might be unnecessary.

Yes, that's a great idea! I think that the schema is just about complex enough to justify JSON Schema validation. I'll look into it. (I already added basic error handling for cases where the file is empty or has broken JSON syntax)

@osma

osma commented Aug 14, 2025

Copy link
Copy Markdown
Member Author

Note to self: there's now a subtle bug in annif index. The list of paths returned by DocumentDirectory may contain broken/empty files, because their content is not checked during initial iteration. That means that the list of file names may get out of sync with the list of suggestion results, causing results to be written into the wrong files.

EDIT: Should be fixed by 3031971

@osma osma marked this pull request as ready for review August 14, 2025 09:54
@osma

osma commented Aug 14, 2025

Copy link
Copy Markdown
Member Author

@juhoinkinen I added JSON Schema validation, made it possible to specify subjects via labels instead of URIs (as in the .txt/.tsv/.key format), moved the JSON support into its own submodule json.py, and fixed some issues with annif index. I can't think of anything else right now, so from my side, this is getting ready for merging. Do you want to take another look?

Also @c-poley @RietdorfC feel free to test and comment!

@osma osma requested a review from juhoinkinen August 14, 2025 09:58
@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
24.1% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

@osma osma merged commit 65db9a4 into main Aug 14, 2025
13 of 14 checks passed
@osma osma deleted the issue868-json-corpus-format branch August 14, 2025 11:08
@osma osma mentioned this pull request Aug 14, 2025
@RietdorfC

Copy link
Copy Markdown
Collaborator

Hi @osma,
we started testing the JSON fulltext format but unfortunally encountered one error and one problem.
The error occured when we tried to run the annif index command. We prepared two files from the same document: a .txt file containing simply the full text of the document (you can find this file attached as 1000694364.txt) and one .json file with an additional metadata field for the title and its subjects (you can find this file attached as 1000694364.json). We placed both files in directories to use the annif index command. In our test, we used the same model to run the index command (an omikuji model trained on full texts). To test if both files could be processed, we ran the following commands:

annif index A-omikuji-FT tmp-txt/
annif index A-omikuji-FT tmp-json/

The first one completed without any problems and produced the expected output, but the second one, which refers to the JSON-file, generated an error massage (you can find the message in the file error.txt). We couldn’t trace the error back to our input, but maybe we overlooked something?

A question occured when we tried to run the annif suggest command on a JSON file. First, we tested the annif suggest command with an omikuji model without the transform=select(title) parameter. We tested the model on a file from the classic full text corpus format (same .txt file as above) and got a valid result. Then we tested the same model but on a JSON file (same as mentioned above) and also got valid results.
Next, we tested a omikuji model with the new parameter transform=select(title) set. First, we tested it on the .txt file and, as expected got no result, as there is no title for the model to process. Second, we tested the model on the JSON file but again got no result, even though there is a title given in the metadata section of the JSON file. Why didn't it work? You can find the complete input and output in the file suggest.txt.

Thanks a lot and best regards
Clemens

1000694364.txt
1000694364.json
error.txt
suggest.txt

@osma

osma commented Aug 15, 2025

Copy link
Copy Markdown
Member Author

Thank you @RietdorfC for testing and for reporting the problems.

For the first problem, it appears that subject_index is not properly set for some reason. For the second problem, I'm not sure what happened.

I will investigate and see if I can reproduce these problems.

@osma

osma commented Aug 15, 2025

Copy link
Copy Markdown
Member Author

@RietdorfC just verifying - the testing was performed using the final PR code that was merged to main yesterday and not an earlier version?

@RietdorfC

Copy link
Copy Markdown
Collaborator

Hi @osma,
Yes, we used the newest version with the final pull request merged.

@osma

osma commented Aug 15, 2025

Copy link
Copy Markdown
Member Author

I found the first problem. There is a bugfix in PR #874 . Would you like to test it?

As for the second problem, this happens because annif suggest is not aware of JSON files. It expects to be given a plain text file and does not care about the filename extension. If you give it a JSON file, it will still parse it as plain text. Therefore there is no title and the suggestion result is empty. Maybe this could be changed (open an issue?) but for now this is how things stand.

@RietdorfC

Copy link
Copy Markdown
Collaborator

Hi @osma ,
Yes, we will test the bugfix!

Thanks for the clarification! The JSON support for the suggest command is not essential for us. So there is no need for us to change it.

@RietdorfC

Copy link
Copy Markdown
Collaborator

Hi @osma,

We tested the index command again and now everything works perfectly. Thank you! We also tested the train, eval, hyperopt and optimize commands with the new JSON fulltext corpus format and did not encounter any problems. Everything seems to work. In addition, tried using the content of the JSON files directly as (part of) the request body for the suggest_batch function with the REST API. That worked really well too.

Thanks again to your and your team!
Best regards
Clemens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Metadata support in fulltext corpus format

4 participants