This is a multi-dimensional learned index that processes the following queries efficiently.
- Enumerate data points inside a hyperrectangle parallel to the axes
- Add/Delete data points
We used gnu++17 and python3 for the internal implementation.
We also used scikit-learn.
The sample code is main.cpp.
First, generate a sample dataset using the following command.
./data_generator.sh 3 100000 1000 2000000 ./data/data.csv ./data/sample_query.csv ./data/query.csv
After that, main.cpp can be executed by the following command.
./running.sh main.cpp ./data/data.csv ./data/sample_query.csv ./data/query.csv
The following is a detailed document.
Sections marked with 💻 are for those who understand the internal algorithm.
We designed it so that even if the sections marked with 💻 were ignored, it would not be a problem to use it as a black box.
(1) FlexFlood<typename Type, int dimension> index(std::vector<std::array<Type, dimension>>::iterator begin, std::vector<std::array<Type, dimension>>::iterator end, std::vector<std::array<Type, dimension>> sample_lowers, std::vector<std::array<Type, dimension>> sample_uppers, int num_sampling = 30 * dimension, double learning_time = 60.0 * dimension)
(2 💻) FlexFlood<typename Type, int dimension> index(std::vector<std::array<Type, dimension>>::iterator begin, std::vector<std::array<Type, dimension>>::iterator end, std::array<int, dimension> partitioning)- Constructs a FlexFlood
indexfromdimension-dimensional data of typeTypebetweenbeginandend. - (1) Optimizes the internal structure based on sample queries. Provide the coordinates of the lower corners of the hyperrectangles in
sample_lowersand the upper corners insample_uppers. - (1) 💻 The number of samples for learning the distribution can be specified with
num_sampling, and the learning time withlearning_time. - (2) Initializes based on the number of partitions for each axis specified in
partitioning. Note that at least one value inpartitioningmust be$1$ for the internal algorithm to function properly.
Constraints
- (1)
sample_lowersandsample_uppersmust have the same length. - (2) At least one value in
partitioningmust be$1$ . - (2) Each value in
partitioningmust be positive.
void index.clear()Deletes all elements in index.
size_t index.size()Returns the number of elements in index.
std::vector<std::array<Type, dimension>> enumerate(std::array<Type, dimension> lower, std::array<Type, dimension> upper)Returns all data points within index that are inside the hyperrectangle defined by the lower and upper corners.
Constraints
- For each dimension, the value of
lowermust be less than or equal to the value ofupper(behavior is undefined if this condition is not met).
void insert(std::array<Type, dimension> dat, std::pair<int, int> lower = std::make_pair(3, 1), std::pair<int, int> upper = std::make_pair(1, 2))- Adds
dattoindex. - If
indexalready containsdat, no action is taken, andsize()remains unchanged. - 💻
lowerandupperspecify thresholds for internal processing ofsplitandmerge/equalize. - 💻
lower = std::make_pair(a, b)means that the threshold coefficient forsplitis$\frac{b}{a}$ . - 💻
upper = std::make_pair(a, b)means that the threshold coefficient formerge/equalizeis$\frac{b}{a}$ .
void erase(std::array<Type, dimension> dat, std::pair<int, int> lower = std::make_pair(3, 1), std::pair<int, int> upper = std::make_pair(1, 2))- Deletes
datfromindex. - If
indexdoes not containdat, no action is taken, andsize()remains unchanged. - 💻
lowerandupperspecify thresholds for internal processing ofsplitandmerge/equalize. - 💻
lower = std::make_pair(a, b)means that the threshold coefficient forsplitis$\frac{b}{a}$ . - 💻
upper = std::make_pair(a, b)means that the threshold coefficient formerge/equalizeis$\frac{b}{a}$ .