Reimplement NN ensemble using PyTorch by osma · Pull Request #926 · NatLibFi/Annif

osma · 2026-01-13T11:34:52Z

This PR reimplements the NN ensemble using PyTorch instead of Keras/TensorFlow.

To test this, you will have to use uv sync --group all --extra torch-cpu or similar (see comments below).

Some notes about the implementation:

the neural network architecture has been radically simplified; it turned out that a much simpler model (separate linear models for each concept) gives better results than the old MLP-based model
the old code displayed top_k_categorical_accuracy, but this was not easily available in PyTorch, so I switched to using the nDCG metric computed for a random subset (n=512 documents) of the given train set and this metric is used for early stopping
the progress bar shown during training now uses tdqm, so it looks a bit different than the Keras one; it is also displayed on stderr and not stdout as the old one used to be
the code implements early stopping; it could train up to 50 epochs (can be set with max-epochs parameter), but tracks nDCG on a small sample (n=512) of the train set and stops when scores start to decline (with patience=2)
the old code showed a detailed error message when model loading failed; I couldn't figure out (yet) how to do that with PyTorch models, but the model is stored with metadata (python version, torch version etc.) that may be helpful in implementing such an error message later on if it turns out to be necessary. In general, the models should be pretty much PyTorch-version-agnostic so there may not be a need for this.
This PR sets up a dependency group all for installing all extras (a substitute for --all-extras which won't work anymore) as well as special extras for selecting the PyTorch variant. There are now torch-cu126 and torch-cu130 extras for now (for CUDA 12.6 and 13.0, respectively), but I think the setup could quite easily be extended to other PyTorch variants such as CUDA 13.2, ROCm or Intel XPU, though obviously these would require more configuration in pyproject.toml.
This NN ensemble will not make use of a GPU anyway; the model is trained and inference is performed using CPU only. The model is so small that using GPU computation would not bring any practical benefit. But the infrastructure for GPU use is now in place for other PyTorch based backends such as EBM or XTransformer that would benefit from GPU computing.

Fixes #895

codecov · 2026-01-13T11:43:59Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.63%. Comparing base (14b7443) to head (e318f64).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #926   +/-   ##
=======================================
  Coverage   99.63%   99.63%           
=======================================
  Files         103      103           
  Lines        8241     8242    +1     
=======================================
+ Hits         8211     8212    +1     
  Misses         30       30

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…now)

…ctionality (for now)

osma · 2026-01-15T14:57:51Z

Selecting of the PyTorch variant (CPU or CUDA x.y or ROCm or...) when setting up the development environment using uv sync has been a headache, but I think I've found a workable solution. It's not super elegant, but at least it seems to work.

The problem is that uv sync wants to perform "universal resolution", that is, resolve all the transitive dependencies once and for all, then write the result into the uv.lock file. This can be parameterized by OS, Python version and some other factors, but not by anything that the user could set when running uv sync. Since different PyTorch variants have different dependencies (e.g. CUDA libraries), dependencies for each of them would have to be resolved separately.

But fortunately, it is possible to have some degree of control over the resolution by setting up "extras" and then declaring a "conflict" between them. This causes uv to "fork" the resolution into different "branches", each having their own dependency tree.

So in commit e629963, I added two new extras: torch-cpu (CPU only) and torch-cu128 (CUDA 12.8 GPU), and declared a conflict between them, i.e., you can't install both extras at the same time. (This will unfortunately cause --all-extras to stop working, which is a shame, since it means that lots of specific --extra parameters are needed in typical situations.) These extras are then tied to specific PyTorch package indexes and thus different variants of the torch package.

The end result is that these two extras can be used to select the PyTorch variant at uv sync time. The torch dependency is still also defined for the nn extra, without a specific index. This means that installing only the nn extra will install whatever is the default PyTorch variant (on Linux it is a CUDA variant).

Here are examples of how this works now:

1. `uv sync` without extras

This installs 439MB of dependencies, no PyTorch.

$ uv sync
Resolved 212 packages in 1.71s
      Built annif @ file:///home/oisuomin/git/Annif
Prepared 1 package in 261ms
Uninstalled 1 package in 0.21ms
Installed 1 package in 0.50ms
 ~ annif==1.5.0.dev0 (from file:///home/oisuomin/git/Annif)

$ du -sh .venv
439M	.venv

2. `uv sync` with just the `nn` extra

This installs the default PyTorch CUDA variant, for a total 2.2GB of dependencies.

$ uv sync --extra nn
Resolved 212 packages in 0.77ms
Installed 6 packages in 96ms
 + lmdb==1.7.5
 + mpmath==1.3.0
 + networkx==3.6.1
 + setuptools==80.9.0
 + sympy==1.14.0
 + torch==2.9.1

$ du -sh .venv
2.2G	.venv

3. `uv sync` with both `nn` and `torch-cpu` extras

This switches to the CPU-only variant of PyTorch. Dependencies are now only 1.2GB.

$ uv sync --extra nn --extra torch-cpu
Resolved 212 packages in 0.78ms
Uninstalled 1 package in 69ms
Installed 1 package in 93ms
 - torch==2.9.1
 + torch==2.9.1+cpu

$ du -sh .venv
1.2G	.venv

4. `uv sync` with both `nn` and `torch-cu128` extras

This installs the PyTorch CUDA 12.8 variant and lots of nvidia-* library packages, for a whopping 7.0GB of dependencies. (I wonder why this isn't the same as the default PyTorch CUDA build that got installed in step 2 above?)

$ uv sync --extra nn --extra torch-cu128
Resolved 212 packages in 0.77ms
Uninstalled 1 package in 72ms
Installed 17 packages in 97ms
 + nvidia-cublas-cu12==12.8.4.1
 + nvidia-cuda-cupti-cu12==12.8.90
 + nvidia-cuda-nvrtc-cu12==12.8.93
 + nvidia-cuda-runtime-cu12==12.8.90
 + nvidia-cudnn-cu12==9.10.2.21
 + nvidia-cufft-cu12==11.3.3.83
 + nvidia-cufile-cu12==1.13.1.3
 + nvidia-curand-cu12==10.3.9.90
 + nvidia-cusolver-cu12==11.7.3.90
 + nvidia-cusparse-cu12==12.5.8.93
 + nvidia-cusparselt-cu12==0.7.1
 + nvidia-nccl-cu12==2.27.5
 + nvidia-nvjitlink-cu12==12.8.93
 + nvidia-nvshmem-cu12==3.3.20
 + nvidia-nvtx-cu12==12.8.90
 - torch==2.9.1+cpu
 + torch==2.9.1+cu128
 + triton==3.5.1

$ du -sh .venv
7.0G	.venv

…h variant

…document it in README

osma · 2026-01-16T10:44:17Z

I refined the above solution by adding an all dependency group (because --all-extras cannot be used anymore). Now a basic developer install with all CPU-only extra features can be installed with:

uv sync --group all --extra torch-cpu

Maybe not ideal, but it works.

juhoinkinen · 2026-01-22T13:55:56Z

I ran benchmarking runs using Annif-tutorial YSO-NLF dataset on annif-data-kk server (it has 6 CPUs).

The used script and output data are in the benchmarking branch

train

	Before (main) -j1	After (this PR) -j1	Before (main) -j6	After (this PR) -j6
user time (seconds)	2810.63	3023.01	2948.25	3208.04
percent CPU	106%	112%	571%	538%
wall time	44:26.96	45:36.19	8:45.21	10:10.04
max RSS	3_368_876	7_076_980	2_599_604	6_764_364
model disk size (bytes)	1_304_759_580	1_131_495_858	(same as -j1)	(same as -j1)

eval

	Before (main) -j1	After (main) -j6	Before (this PR) -j1	After (this PR) -j6
user time	475.29	471.15	485.92	473.70
percent CPU	99%	99%	498%	507%
wall time	7:58.65	7:53.83	1:38.66	1:34.24
max RSS	2_666_460	2_176_184	2_105_688	1_840_860
nDCG	0.4805	0.4750	0.4775	0.4691

Compared to TensorFlow implementation PyTorch requires twice as much memory in training and is slightly slower (107% in usertime); but in inference the situation is the opposite: PyTorch is faster (~98%) and takes less memory.

osma · 2026-01-22T14:07:43Z

Thanks @juhoinkinen ! The RAM usage doubling is interesting. First hypothesis: Maybe PT uses higher precision floats than TF? I'll investigate.

…upports CUDA 12.6 and 13.0 + 13.2 but no longer 12.8); declare tqdm dependency

…gs to match reality

Copilot

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 3 comments.

Copilot

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 4 comments.

Copilot

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 3 comments.

+            n_samples = len(dataset)
+            n_eval = min(self.EARLY_STOP_EVAL_ROWS, n_samples)
+            eval_indices = rng.choice(n_samples, size=n_eval, replace=False)
+            eval_inputs, eval_targets = dataset.get_subset(eval_indices.tolist())


+            criterion = nn.BCEWithLogitsLoss()
+            early_stopping = EarlyStopping(patience=self.EARLY_STOPPING_PATIENCE)
+
+            for epoch in range(max_epochs):


    except ImportError:
-        raise ValueError(
-            "Keras and TensorFlow not available, cannot use " + "nn_ensemble backend"
-        )
+        raise ValueError("PyTorch not available, cannot use nn_ensemble backend")


osma · 2026-06-02T09:14:43Z

I dropped the last two commits (related to the schemathesis test failues) from this PR branch and moved them into their own PR #941 , to be merged first.

sonarqubecloud · 2026-06-02T12:39:51Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

juhoinkinen · 2026-06-10T07:03:19Z

Results of current implementation measured like before:

train

	Before (main) -j1	After (this PR) -j1	Before (main) -j6	After (this PR) -j6
user time (seconds)	2810.63	2815.48	2948.25	2973.60
percent CPU	106%	107%	571%	576%
wall time	44:26.96	44:35.79	8:45.21	8:50.06
max RSS	3_368_876	3_744_964	2_599_604	2_955_204
model disk size (bytes)	1_304_759_580	1_247_592_467	(same as -j1)	(same as -j1)

eval

	Before (main) -j1	After (main) -j1	Before (this PR) -j1	After (this PR) -j6
user time	475.29	474.49	485.92	494.97
percent CPU	99%	99%	498%	505%
wall time	7:58.65	7:57.95	1:38.66	1:38.75
max RSS	2_666_460	2_189_608	2_105_688	1788876
nDCG	0.4805	0.4774	0.4775	0.4755

mjsuhonos · 2026-06-10T09:13:42Z

🚀🚀🚀On Jun 10, 2026, at 9:03 AM, Juho Inkinen ***@***.***> wrote:juhoinkinen left a comment (NatLibFi/Annif#926) Results of current implementation measured like before: train Before (main) -j1 After (this PR) -j1 Before (main) -j6 After (this PR) -j6 user time (seconds) 2810.63 2815.48 2948.25 2973.60 percent CPU 106% 107% 571% 576% wall time 44:26.96 44:35.79 8:45.21 8:50.06 max RSS 3_368_876 3_744_964 2_599_604 2_955_204 model disk size (bytes) 1_304_759_580 1_247_592_467 (same as -j1) (same as -j1) eval Before (main) -j1 After (main) -j1 Before (this PR) -j1 After (this PR) -j6 user time 475.29 474.49 485.92 494.97 percent CPU 99% 99% 498% 505% wall time 7:58.65 7:57.95 1:38.66 1:38.75 max RSS 2_666_460 2_189_608 2_105_688 1788876 nDCG 0.4805 0.4774 0.4775 0.4755 —Reply to this email directly, view it on GitHub, or unsubscribe.Triage notifications, keep track of coding agent tasks and review pull requests on the go with GitHub Mobile for iOS and Android. Download it today! You are receiving this because you were mentioned.Message ID: ***@***.***>

osma · 2026-06-10T10:02:45Z

I think this is now good enough. Internal benchmarks show that with the Finto AI YSO models, results are on average better than with the old NN ensemble, though on some data sets they got worse.

There is still another model variant (torch_nn_split) implemented as a prototype in the nn-ensemble-experiments repository. It is a bit more complex than this one and I know it works better on some data sets (esp. KOKO). Maybe it can be integrated with this backend as a variant in the future.

san-uh · 2026-06-12T12:34:12Z

Here are the results from a new test that @mfakaehler and I carried out, too.
This is a repeat of the test whose results we posted on 23 February 2026. For details of the test conditions (Test case settings, Singlemodells, nn-ensemble parameters and Technical settings) please see above #926 (comment)

We used nn classic based on Keras/Tensorflow with Annif.1.4.1 and, for the new nn based on PyTorch, the Annif version 1.5.0.dev0.

train (25.000 tocs)

	nn classic 25k Keras/TensorFlow(main) -j80	nn new 25k PyTorch (this PR) -j80
real time	66m12,713s	17m53,712s
model disk size	1024 MB (nn-train.mdb) 2,6 GB (nn-model.keras)	99 MB (nn-train.mdb) 8,8 MB (nn-model.pt)

The nn-train.mdb and nn-model.pt files are much smaller in the new PyTorch neural network than in the classic neural network. The runtime is also shorter (though this is, of course, partly due to the early stopping functionality).

Note: The new nn PyTorch ensemble has stopped after the 4th epoch due to the early stopping functionality. Message:
Backend nn_ensemble: Epoch 4/50: NDCG=0.9869
Backend nn_ensemble: Model no longer improving, using best epoch 2.

eval (40.885 tocs)

	nn classic 25k Keras/TensorFlow(main) -j80	nn new 25k PyTorch (this PR) -j80
real time	12m53,328s	10m23,170s
F1@5 (doc avg)	0.3941	0.4075
F1@10 (doc avg)	0.2880	0.2946
NDCG@10	0.6383	0.6613

The new nn PyTorch is 2 minutes 30 seconds faster in evaluation than nn classic.

There is just a few difference in performance: But new nn PyTorch is a little bit better than the nn classic achieve.

index (40.885 tocs)

	nn classic 25k Keras/TensorFlow(main) -j80	nn new 25k PyTorch (this PR) -j80
real time	66m12,713s	105m58,835s

The nn classic requires 0.097 seconds per document. The new nn PyTorch requires 0.155 seconds per document.

Thank you very much for this development @osma and best regards to Helsinki!

osma self-assigned this Jan 13, 2026

osma added the enhancement label Jan 13, 2026

osma force-pushed the issue895-nn-ensemble-pytorch branch from 5bdbf64 to d82a54a Compare January 13, 2026 11:40

osma added 3 commits January 15, 2026 14:25

switch dependency from tensorflow-cpu to torch (only cpu variant for …

e04644e

…now)

NN ensemble basic functionality implemented using PyTorch

2d3e434

add pytorch and python version to NN model; remove model metadata fun…

da479eb

…ctionality (for now)

osma force-pushed the issue895-nn-ensemble-pytorch branch from d82a54a to da479eb Compare January 15, 2026 12:25

enable selecting PyTorch CPU or CUDA (12.8) variant through extras

e629963

osma mentioned this pull request Jan 15, 2026

Add ebm backend #914

Open

osma added 3 commits January 15, 2026 17:15

use torch-cpu extra in CI/CD and Dockerfile to select CPU-only PyTorc…

3784155

…h variant

define dependency group 'all' as a substitute for '--all-extras' and …

541f2af

…document it in README

drop torch-cpu from 'all' group as it is not needed

e3fc7f9

osma added 4 commits January 16, 2026 15:03

add progress bar (using tqdm) for NN ensemble training loop

ff8c692

calculate nDCG after every training epoch (using torchmetrics package)

1660273

specify num_workers and weight_decay parameters

bf0cba0

cleanup

85057cd

osma requested a review from juhoinkinen January 16, 2026 13:54

osma added this to the 1.5 milestone Jan 16, 2026

osma added 2 commits January 16, 2026 16:07

remove unnecessary TensorFlow log level adjustment

fb38ef8

remove test for TF log level setting

9681ab3

osma marked this pull request as ready for review January 16, 2026 15:54

osma changed the title ~~[WIP] Reimplement NN ensemble using PyTorch~~ Reimplement NN ensemble using PyTorch Jan 16, 2026

osma added 3 commits January 21, 2026 14:05

adjust PyTorch model to better match old Keras model

faf3de7

fix test that broke

311a29c

switch to BCELoss (requires clamping output values)

1437d43

osma requested a review from Copilot June 2, 2026 06:25

Copilot started reviewing on behalf of osma June 2, 2026 06:25 View session

This comment was marked as outdated.

Sign in to view

osma added 3 commits June 2, 2026 09:42

adjust pytorch and lmdb dependencies for latest versions (torch now s…

ab8c413

…upports CUDA 12.6 and 13.0 + 13.2 but no longer 12.8); declare tqdm dependency

adjust LMDBDataset __getitem__ and get_subset type hints and docstrin…

b442c80

…gs to match reality

use binary targets for ndcg_batch

0bc987f

osma requested a review from Copilot June 2, 2026 06:52

Copilot started reviewing on behalf of osma June 2, 2026 06:53 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

Comment thread annif/backend/nn_ensemble.py

Comment thread annif/backend/nn_ensemble.py Outdated

Comment thread README.md Outdated

osma added 2 commits June 2, 2026 09:59

update README for CUDA 12.6 and 13.0 specific pytorch extras

93e4063

fix early stopping patience calculation to match conventions

55eeaa3

osma requested a review from Copilot June 2, 2026 07:19

Copilot started reviewing on behalf of osma June 2, 2026 07:19 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

Comment thread annif/backend/nn_ensemble.py Outdated

Comment thread annif/backend/nn_ensemble.py

Comment thread tests/test_backend_nn_ensemble.py

Comment thread README.md

remove unused constant EVAL_BATCH_SIZE

e6b934c

osma requested a review from Copilot June 2, 2026 07:56

Copilot started reviewing on behalf of osma June 2, 2026 07:57 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

osma mentioned this pull request Jun 2, 2026

Fix REST API 500 error and schemathesis flaky test #941

Merged

osma force-pushed the issue895-nn-ensemble-pytorch branch from 57d2d88 to e6b934c Compare June 2, 2026 09:13

Merge branch 'main' into issue895-nn-ensemble-pytorch

e318f64

osma requested a review from juhoinkinen June 2, 2026 14:12

osma merged commit f6030a3 into main Jun 10, 2026
15 checks passed

osma deleted the issue895-nn-ensemble-pytorch branch June 10, 2026 10:03

Conversation

osma commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

osma commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. uv sync without extras

2. uv sync with just the nn extra

3. uv sync with both nn and torch-cpu extras

4. uv sync with both nn and torch-cu128 extras

Uh oh!

osma commented Jan 16, 2026

Uh oh!

juhoinkinen commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

train

eval

Uh oh!

osma commented Jan 22, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

osma commented Jun 2, 2026

Uh oh!

sonarqubecloud Bot commented Jun 2, 2026

Quality Gate passed

Uh oh!

juhoinkinen commented Jun 10, 2026

train

eval

Uh oh!

mjsuhonos commented Jun 10, 2026 via email

Uh oh!

osma commented Jun 10, 2026

Uh oh!

Uh oh!

san-uh commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

osma commented Jan 13, 2026 •

edited

Loading

codecov Bot commented Jan 13, 2026 •

edited

Loading

osma commented Jan 15, 2026 •

edited

Loading

1. `uv sync` without extras

2. `uv sync` with just the `nn` extra

3. `uv sync` with both `nn` and `torch-cpu` extras

4. `uv sync` with both `nn` and `torch-cu128` extras

juhoinkinen commented Jan 22, 2026 •

edited

Loading