Add JSON fulltext corpus format by osma · Pull Request #872 · NatLibFi/Annif

osma · 2025-08-13T13:24:02Z

This PR adds a new JSON-based fulltext corpus format, based on the discussion in issue #868 (ping @RietdorfC @c-poley).

It extends the existing support for document directories. Previously, the directory had to contain .txt files, with gold standard subjects stored in .tsv (or .key) files with the same basename. This PR adds another option: the directory may instead (or also) contain .json files with JSON data of the following form:

{
  "text": "A quick brown fox jumped over the lazy dog.",
  "metadata": {
    "title": "As We May Think",
    "author": "Bush, Vannevar"
  },
  "subjects": [
    { "uri": "http://www.yso.fi/onto/yso/p817", "label": "future" },
    { "uri": "http://www.yso.fi/onto/yso/p3295", "label": "visions (prospects)" },
    { "uri": "http://www.yso.fi/onto/yso/p15527", "label": "science fiction" }
  ]
}

All top level fields (text, metadata and subjects) are optional. Subject labels are also optional and included only for illustration; they do not affect the parsing. The JSON format is the same as the one used by the REST API method learn, specified by the IndexedDocument schema in the OpenAPI specification, except ~~there are no required fields~~ the set of required fields is a bit different (either text or metadata is required on the top level, and if subjects are included, they must specify either uri or label).

Closes #868.

…at support

codecov · 2025-08-13T13:27:22Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.66%. Comparing base (140a1b6) to head (9592eef).
⚠️ Report is 15 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff            @@
##             main     #872    +/-   ##
========================================
  Coverage   99.66%   99.66%            
========================================
  Files         102      103     +1     
  Lines        7665     7838   +173     
========================================
+ Hits         7639     7812   +173     
  Misses         26       26

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

juhoinkinen

Principles look good.

jsonschema could be used for validating the JSON data and for error handling. However, for such a simple schema, it might be unnecessary. But on other hand, Annif already uses jsonschema to customize API request validation.

The JSON format for documents actually looks like a step toward the functionality to train projects via REST API: #634

osma · 2025-08-14T07:25:34Z

@juhoinkinen

jsonschema could be used for validating the JSON data and for error handling. However, for such a simple schema, it might be unnecessary.

Yes, that's a great idea! I think that the schema is just about complex enough to justify JSON Schema validation. I'll look into it. (I already added basic error handling for cases where the file is empty or has broken JSON syntax)

osma · 2025-08-14T09:28:31Z

Note to self: there's now a subtle bug in annif index. The list of paths returned by DocumentDirectory may contain broken/empty files, because their content is not checked during initial iteration. That means that the list of file names may get out of sync with the list of suggestion results, causing results to be written into the wrong files.

EDIT: Should be fixed by 3031971

osma · 2025-08-14T09:56:32Z

@juhoinkinen I added JSON Schema validation, made it possible to specify subjects via labels instead of URIs (as in the .txt/.tsv/.key format), moved the JSON support into its own submodule json.py, and fixed some issues with annif index. I can't think of anything else right now, so from my side, this is getting ready for merging. Do you want to take another look?

Also @c-poley @RietdorfC feel free to test and comment!

sonarqubecloud · 2025-08-14T10:29:26Z

Quality Gate failed

Failed conditions
24.1% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

RietdorfC · 2025-08-15T10:27:58Z

Hi @osma,
we started testing the JSON fulltext format but unfortunally encountered one error and one problem.
The error occured when we tried to run the annif index command. We prepared two files from the same document: a .txt file containing simply the full text of the document (you can find this file attached as 1000694364.txt) and one .json file with an additional metadata field for the title and its subjects (you can find this file attached as 1000694364.json). We placed both files in directories to use the annif index command. In our test, we used the same model to run the index command (an omikuji model trained on full texts). To test if both files could be processed, we ran the following commands:

annif index A-omikuji-FT tmp-txt/
annif index A-omikuji-FT tmp-json/

The first one completed without any problems and produced the expected output, but the second one, which refers to the JSON-file, generated an error massage (you can find the message in the file error.txt). We couldn’t trace the error back to our input, but maybe we overlooked something?

A question occured when we tried to run the annif suggest command on a JSON file. First, we tested the annif suggest command with an omikuji model without the transform=select(title) parameter. We tested the model on a file from the classic full text corpus format (same .txt file as above) and got a valid result. Then we tested the same model but on a JSON file (same as mentioned above) and also got valid results.
Next, we tested a omikuji model with the new parameter transform=select(title) set. First, we tested it on the .txt file and, as expected got no result, as there is no title for the model to process. Second, we tested the model on the JSON file but again got no result, even though there is a title given in the metadata section of the JSON file. Why didn't it work? You can find the complete input and output in the file suggest.txt.

Thanks a lot and best regards
Clemens

1000694364.txt
1000694364.json
error.txt
suggest.txt

osma · 2025-08-15T10:58:19Z

Thank you @RietdorfC for testing and for reporting the problems.

For the first problem, it appears that subject_index is not properly set for some reason. For the second problem, I'm not sure what happened.

I will investigate and see if I can reproduce these problems.

osma · 2025-08-15T11:04:17Z

@RietdorfC just verifying - the testing was performed using the final PR code that was merged to main yesterday and not an earlier version?

RietdorfC · 2025-08-15T11:24:10Z

Hi @osma,
Yes, we used the newest version with the final pull request merged.

osma · 2025-08-15T11:32:21Z

I found the first problem. There is a bugfix in PR #874 . Would you like to test it?

As for the second problem, this happens because annif suggest is not aware of JSON files. It expects to be given a plain text file and does not care about the filename extension. If you give it a JSON file, it will still parse it as plain text. Therefore there is no title and the suggestion result is empty. Maybe this could be changed (open an issue?) but for now this is how things stand.

RietdorfC · 2025-08-15T11:51:46Z

Hi @osma ,
Yes, we will test the bugfix!

Thanks for the clarification! The JSON support for the suggest command is not essential for us. So there is no need for us to change it.

RietdorfC · 2025-08-18T12:16:38Z

Hi @osma,

We tested the index command again and now everything works perfectly. Thank you! We also tested the train, eval, hyperopt and optimize commands with the new JSON fulltext corpus format and did not encounter any problems. Everything seems to work. In addition, tried using the content of the JSON files directly as (part of) the request body for the suggest_batch function with the REST API. That worked really well too.

Thanks again to your and your team!
Best regards
Clemens

osma added 2 commits August 13, 2025 15:41

refactor DocumentDirectory in preparation for adding JSON corpus form…

c8699df

…at support

implement JSON fulltext corpus format (directory of JSON files)

03ac192

osma added this to the 1.4 milestone Aug 13, 2025

osma self-assigned this Aug 13, 2025

osma added the enhancement label Aug 13, 2025

github-advanced-security AI found potential problems Aug 13, 2025

View reviewed changes

Comment thread annif/corpus/document.py Dismissed

osma added 3 commits August 13, 2025 16:30

remove redundant list() calls to make SonarCloud happier

d775348

add unit test for eval command with JSON fulltext corpus

3402d92

modify test_suggest_two_files unit test to use one JSON corpus file

23cb801

osma changed the title ~~WIP: Add JSON fulltext corpus format~~ Add JSON fulltext corpus format Aug 13, 2025

osma requested a review from juhoinkinen August 13, 2025 14:10

osma added 2 commits August 13, 2025 17:16

avoid equality checks on float values to make SonarCloud happier

78ee9fb

handle empty and broken JSON corpus files

86235dd

juhoinkinen approved these changes Aug 14, 2025

View reviewed changes

osma added 4 commits August 14, 2025 11:35

support JSON corpus files with subjects given by label only

8a98d8d

refactor: move JSON functionality to separate module corpus/json.py

24035c5

validate JSON corpus files using JSON Schema

c18b9e1

make sure 'annif index' works on JSON files as well as TXT

42c56f4

osma added 2 commits August 14, 2025 12:46

store file_path within Document, for use by annif index

3031971

show less verbose JSON schema validation warning

faf32ff

osma marked this pull request as ready for review August 14, 2025 09:54

osma requested a review from juhoinkinen August 14, 2025 09:58

force utf-8 encoding when reading JSON files, just in case

9592eef

juhoinkinen approved these changes Aug 14, 2025

View reviewed changes

osma merged commit 65db9a4 into main Aug 14, 2025
13 of 14 checks passed

osma deleted the issue868-json-corpus-format branch August 14, 2025 11:08

osma mentioned this pull request Aug 14, 2025

Flexible fusion backend #813

Closed

osma mentioned this pull request Aug 15, 2025

JSON corpus bugfix: avoid parsing subjects in annif index #874

Merged

osma mentioned this pull request Aug 18, 2025

Add JSONL short text corpus format #876

Merged

Conversation

osma commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

juhoinkinen left a comment

Choose a reason for hiding this comment

Uh oh!

osma commented Aug 14, 2025

Uh oh!

osma commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

osma commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud Bot commented Aug 14, 2025

Quality Gate failed

Uh oh!

Uh oh!

RietdorfC commented Aug 15, 2025

Uh oh!

osma commented Aug 15, 2025

Uh oh!

osma commented Aug 15, 2025

Uh oh!

RietdorfC commented Aug 15, 2025

Uh oh!

osma commented Aug 15, 2025

Uh oh!

RietdorfC commented Aug 15, 2025

Uh oh!

RietdorfC commented Aug 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

osma commented Aug 13, 2025 •

edited

Loading

codecov Bot commented Aug 13, 2025 •

edited

Loading

osma commented Aug 14, 2025 •

edited

Loading

osma commented Aug 14, 2025 •

edited

Loading