Stop EmbeddingIndex.build normalizing the caller's embeddings in place#40
Open
Gh-Novel wants to merge 1 commit into
Open
Stop EmbeddingIndex.build normalizing the caller's embeddings in place#40Gh-Novel wants to merge 1 commit into
Gh-Novel wants to merge 1 commit into
Conversation
faiss.normalize_L2 mutates its argument, and build() received an array that aliases the embeddings the user passed to fit() (astype with copy=False). So fit() silently rescaled the caller's array, and the persisted index stored unit-norm vectors. A later update() then vstacked that normalized history with raw new embeddings and refit the surrogate on an inconsistent feature space. Normalize a copy for the FAISS index only and keep stored embeddings in their original scale. The l2 metric path gets an ascontiguousarray guard before add.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Split out of #26 as requested — this is just the faiss/embedding-consistency bugfix, nothing else.
EmbeddingIndex.buildcallsfaiss.normalize_L2(X), which works in place, on an array that aliases the caller's embeddings (api.pypassesembeddings.astype(np.float32, copy=False)). Two consequences:fit()silently L2-normalizes the user's input array as a side effect.index.embeddings.npyis unit-norm, so a laterupdate()vstacks normalized history with raw new embeddings and refits the surrogate on an inconsistent feature space — quiet drift in exactly the path the continual-learning loop depends on.Change
Normalize a copy for the FAISS index only; persist (and expose) the embeddings in their original scale.
metric="l2"path also gets anascontiguousarrayguard beforeidx.add. One file, +10/−3.Tests
test_embedding_index_does_not_mutate_or_normalize_caller—build()leaves the caller's array untouched and stores embeddings at original scale (fails onmain, passes here).test_fit_persists_raw_embeddings— the artifact round-trips raw embeddings throughfit()→EmbeddingIndex.load.Full suite green.
The telemetry feature (
routing_trace.json,n_retrains, temporal deltas) will follow separately in #26 once this lands.