Skip to content

Add JSONL short text corpus format#876

Merged
osma merged 2 commits into
mainfrom
issue875-json-lines-shorttext-format
Aug 18, 2025
Merged

Add JSONL short text corpus format#876
osma merged 2 commits into
mainfrom
issue875-json-lines-shorttext-format

Conversation

@osma

@osma osma commented Aug 18, 2025

Copy link
Copy Markdown
Member

This PR adds a new short text (many documents in a single file) corpus format based on JSON Lines. Each line of the file should contain a JSON object that follows the same schema as the fulltext JSON format that was added in PR #872. The file may optionally be gzip compressed.

Closes #875.

@osma osma self-assigned this Aug 18, 2025
@osma osma added this to the 1.4 milestone Aug 18, 2025
Comment thread tests/test_corpus.py Fixed
Comment thread tests/test_corpus.py Fixed
@codecov

codecov Bot commented Aug 18, 2025

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.67%. Comparing base (da18362) to head (b6610b8).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #876   +/-   ##
=======================================
  Coverage   99.66%   99.67%           
=======================================
  Files         103      103           
  Lines        7843     7905   +62     
=======================================
+ Hits         7817     7879   +62     
  Misses         26       26           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@osma osma force-pushed the issue875-json-lines-shorttext-format branch from f21d914 to d8b3035 Compare August 18, 2025 09:43
@sonarqubecloud

Copy link
Copy Markdown

@osma osma changed the title WIP: Add JSONL short text corpus format Add JSONL short text corpus format Aug 18, 2025
@osma osma marked this pull request as ready for review August 18, 2025 09:54
@osma osma requested a review from juhoinkinen August 18, 2025 09:55
@osma osma merged commit 3eb8ce2 into main Aug 18, 2025
14 checks passed
@osma osma deleted the issue875-json-lines-shorttext-format branch August 18, 2025 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Short text corpus format based on JSON Lines

3 participants