-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Add tfidf bm25 #2353
base: branch-24.12
Are you sure you want to change the base?
[REVIEW] Add tfidf bm25 #2353
Conversation
[skip ci] Update master references for main branch
REL Fix `21.06` Release Changelog
[HOTFIX] Remove `-g` from cython compile commands
[RELEASE] v22.04
Our `devel` Docker containers need to be switched to using `conda` compilers to resolve a linking error. `raft` is in those containers, but hasn't yet been built with `conda` compilers. This PR addresses that. These changes won't cleanly merge into `branch-22.08` unfortunately due to the changes in rapidsai#641, but we can address that another time. Authors: - AJ Schmidt (https://github.com/ajschmidt8) - Corey J. Nolet (https://github.com/cjnolet) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Corey J. Nolet (https://github.com/cjnolet)
[RELEASE] v22.06 raft
FIX update-version.sh
@shwina I'm going to apologize ahead of time for this, but i was trying to forward merge your branch 22.10 locally to create a new PR from it and I accidentally pushed to your remote branch. I cherry-picked the commits over to a new branch for the hotfix. Authors: - Bradley Dice (https://github.com/bdice) - Ashwin Srinath (https://github.com/shwina) Approvers: - Ray Douglass (https://github.com/raydouglass)
[RELEASE] raft v22.10.01
[RELEASE] raft v22.12.01 [skip-gpuci]
REL Update changelog v23.04
…to add-tfidf-bm25
SparseKNNInputs<value_idx, value_t> params; | ||
}; | ||
|
||
const std::vector<SparseKNNInputs<int, float>> inputs_i32_f = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to add some additional test cases that generate random csr
matrix instead of hardcoding them? Just a suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh boy, I thought this has been updated / fixed. We definitely don't want to be hardcoding these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, those hardcoded values come from the test that was in place. Take a look at what is currently available in main https://github.com/rapidsai/raft/blob/branch-24.12/cpp/test/sparse/neighbors/brute_force.cu. I felt it was acceptable to leave/use those hardcoded values, because the point of these tests here is not to ensure the brute_force works correctly, it is to check that the new interfaces I created for COO and CSR work correctly. If you want me to change, that is fine but then I think that means I need to create a Kernel for this. I dont believe that is the goal of this PR. But let me know if you think I need to make this change in this PR @cjnolet @rhdong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on you, the brute force(and other neighbor algo) has been moved to CUVS, and the test cases there could be more solid, just for your reference: https://github.com/rapidsai/cuvs/blob/branch-24.12/cpp/test/neighbors/brute_force.cu#L469 .
@@ -103,4 +106,171 @@ void brute_force_knn(const value_idx* idxIndptr, | |||
metricArg); | |||
} | |||
|
|||
/** | |||
* Search the sparse kNN for the k-nearest neighbors of a set of sparse query vectors | |||
* using some distance implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add the comments for the template parameters.
float metricArg = 0) | ||
{ | ||
cudaStream_t stream = raft::resource::get_cuda_stream(handle); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could add a judgment for 0
size data for idx
and query
, though it should happen rarely. (Considering the following code includes the logic of size() - 1
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhdong do you think I should raise and error or just return before performing bfknn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should depend on the logic: to return directly (keeping no change on the outputs), if it is normal to have zero-size input, or you could use RAFT_EXPECTS
to notify the caller.
auto host_matrix = raft::make_host_matrix<T2, int64_t>(handle, num_rows, num_cols); | ||
raft::copy(host_matrix.data_handle(), device_matrix.data_handle(), device_matrix.size(), stream); | ||
|
||
for (int i = 0; i < elements_size; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear on the primary objective of this logic, but just a heads-up: unless you explicitly sync on the stream before this line, we can't assume host_matrix
will have the same value as device_matrix
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is happening here is that I am loading the incoming host data into a dense matrix. So we represent the COO arrays as a dense matrix and then I am copying that dense matrix from host memory to GPU memory. Before that line I am expecting that both the host and device matrices are zero filled. I did it this way to use raft APIs as much as possible. For loop on host memory did not seem like the most efficient way to fill a matrix.
* @param csr_in: Input CSR matrix | ||
* @param values_out: Output values array | ||
*/ | ||
template <typename T1, typename T2, typename IdxT> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The T1
, T2
might be a bit unclear; feel free to rename them to something more meaningful if you prefer.
using SparsePreprocessBm25Csr = SparsePreprocessCSR<float, int>; | ||
TEST_P(SparsePreprocessBm25Csr, Result) { Run(true); } | ||
|
||
const std::vector<SparsePreprocessInputs<float, int>> sparse_preprocess_inputs = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be conservative and ensure that there are no surprises after merging, it is best to add some use cases for larger matrices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
larger random matrices is difficult to ensure, because rmat currently makes many duplicates during edge creation. This results in much smaller than anticipated number of edges. I think in its current form it would be misleading. But I can definitely pass much bigger parameters to RMAT. I dont think the end result will be what we expect. We need to first create a function that creates an RMAT and then removes duplicates and keeps looping through this logic until we get a set of edges of the desired amount that have no duplicates. This is outside of the purview of this PR, IMO. How do you feel about it @cjnolet?
This PR will add support for tfidf and BM25 preprocessing of sparse matrix. It does not require the user to work within the confines of the COO or CSR matrix. It only requires the triplets of data ( row, column, value). With this information, we are able to preprocess the values accordingly. Putting this up to get eyes on this, to make sure this is going in the correct direction or if not, to adjust.
Unit tests are still required for these features.