Skip to content

Stop EmbeddingIndex.build normalizing the caller's embeddings in place#40

Open
Gh-Novel wants to merge 1 commit into
adrida:mainfrom
Gh-Novel:fix/embedding-index-normalization
Open

Stop EmbeddingIndex.build normalizing the caller's embeddings in place#40
Gh-Novel wants to merge 1 commit into
adrida:mainfrom
Gh-Novel:fix/embedding-index-normalization

Conversation

@Gh-Novel

Copy link
Copy Markdown

Summary

Split out of #26 as requested — this is just the faiss/embedding-consistency bugfix, nothing else.

EmbeddingIndex.build calls faiss.normalize_L2(X), which works in place, on an array that aliases the caller's embeddings (api.py passes embeddings.astype(np.float32, copy=False)). Two consequences:

  1. fit() silently L2-normalizes the user's input array as a side effect.
  2. The persisted index.embeddings.npy is unit-norm, so a later update() vstacks normalized history with raw new embeddings and refits the surrogate on an inconsistent feature space — quiet drift in exactly the path the continual-learning loop depends on.

Change

Normalize a copy for the FAISS index only; persist (and expose) the embeddings in their original scale. metric="l2" path also gets an ascontiguousarray guard before idx.add. One file, +10/−3.

Tests

  • test_embedding_index_does_not_mutate_or_normalize_callerbuild() leaves the caller's array untouched and stores embeddings at original scale (fails on main, passes here).
  • test_fit_persists_raw_embeddings — the artifact round-trips raw embeddings through fit()EmbeddingIndex.load.

Full suite green.

The telemetry feature (routing_trace.json, n_retrains, temporal deltas) will follow separately in #26 once this lands.

faiss.normalize_L2 mutates its argument, and build() received an array that
aliases the embeddings the user passed to fit() (astype with copy=False). So
fit() silently rescaled the caller's array, and the persisted index stored
unit-norm vectors. A later update() then vstacked that normalized history with
raw new embeddings and refit the surrogate on an inconsistent feature space.

Normalize a copy for the FAISS index only and keep stored embeddings in their
original scale. The l2 metric path gets an ascontiguousarray guard before add.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant