-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FilterStore: unifying filter specific logic #452
base: main
Are you sure you want to change the base?
Conversation
…ered_index now takes raw universal label
… accordingly + filter build takes raw label file now
9bcedad
to
887e644
Compare
include/in_mem_filter_store.h
Outdated
// universal label | ||
bool _use_universal_label = false; | ||
tsl::robin_set<label_type> _mapped_universal_labels; | ||
std::set<std::string> _raw_universal_labels; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Label_type , string for single filter.
structure for multiple filter will be decided later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, it does not hurt to keep this. the logic works fine. but if you think we just need single filter for now. feel free to change the logic.
apps/build_stitched_index.cpp
Outdated
@@ -343,7 +344,9 @@ int main(int argc, char **argv) | |||
path labels_file_to_use = final_index_path_prefix + "_label_formatted.txt"; | |||
path labels_map_file = final_index_path_prefix + "_labels_map.txt"; | |||
|
|||
convert_labels_string_to_int(label_data_path, labels_file_to_use, labels_map_file, universal_label); | |||
std::set<std::string> raw_universal_label_set = {universal_label}; | |||
diskann::InMemFilterStore<uint32_t>::convert_labels_string_to_int(label_data_path, labels_file_to_use, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be the responsibility of filter store to convert labels from strings to int?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this here because, the end user does not know if we have an internal mapping and how its done, similarly Index does not know that string labels is a thing, it just expects number labels. So filter store seems like an ideal place to do this conversion. That way it can also keep the mapping logic and also the map itself.
DISKANN_DLLEXPORT virtual void save_medoids(const std::string &save_path) = 0; | ||
DISKANN_DLLEXPORT virtual void save_label_map(const std::string &save_path) = 0; | ||
DISKANN_DLLEXPORT virtual void save_universal_label(const std::string &save_path) = 0; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need a public load() which calls the protected load* methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, need to do this, and perhaps remove the "friend" relation between filter store and index class?
include/abstract_index.h
Outdated
@@ -103,7 +103,8 @@ class AbstractIndex | |||
// memory should be allocated for vec before calling this function | |||
template <typename tag_type, typename data_type> int get_vector_by_tag(tag_type &tag, data_type *vec); | |||
|
|||
template <typename label_type> void set_universal_label(const label_type universal_label); | |||
virtual void set_universal_labels(const std::vector<std::string> &raw_universal_labels, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if Index class should get involved in setting universal labels. Why not set it directly in the filter store when it is created?
// This class is responsible for filter actions in index, and should not be used outside. | ||
template <typename label_type> class AbstractFilterStore | ||
{ | ||
public: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also didn't find methods to expand and shrink the filter store.
include/in_mem_filter_store.h
Outdated
std::set<std::string> _raw_universal_labels; | ||
|
||
// populates _loaction_to labels and _labels from given label file | ||
size_t parse_label_file(const std::string &label_file); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename filter_utils to gt_filter_utils
bool detect_common_filters(uint32_t point_id, bool search_invocation, | ||
const std::vector<label_type> &incoming_labels, | ||
const FilterMatchStrategy filter_match_strategy) override; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
define 2 detect_common_filters for base points and queries
type of incoming_labels vector<vector<>>
search_invocation - not required if 2 functions are defined
@@ -369,18 +372,14 @@ template <typename T, typename TagT = uint32_t, typename LabelT = uint32_t> clas | |||
// Filter Support | |||
|
|||
bool _filtered_index = false; | |||
// Location to label is only updated during insert_point(), all other reads are protected by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
insert_point should take string formatted labels (since exposed to user). conversion to label type should happen internally
…ft/DiskANN into patelyash/filter_handler
1. fix SSD label mapping 2. Removed dynamic index parameter from set_universal_label funtion
- Get converted label to Get Numeric Label and convert_label_string_to_int to convert_label_to_numeric - SSD Label mapping fix - Calculate global_string_to_label in entire large file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we address the comments (mostly about who owns medoids), and test the streaming filtered index, I am ok with this PR, good to go!
@@ -311,12 +308,11 @@ void build_incremental_index(const std::string &data_path, diskann::IndexWritePa | |||
{ | |||
const auto save_path_inc = get_save_filename(save_path + ".after-delete-", points_to_skip, | |||
points_to_delete_from_beginning, last_point_threshold); | |||
std::string labels_file_to_use = save_path_inc + "_label_formatted.txt"; | |||
std::string labels_file_to_use = save_path_inc + "_label_numeric.txt"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if numeric labels are not exposed, why are we using numeric labels here?
DISKANN_DLLEXPORT virtual void save_medoids(const std::string &save_path) = 0; | ||
DISKANN_DLLEXPORT virtual void save_label_map(const std::string &save_path) = 0; | ||
DISKANN_DLLEXPORT virtual void save_universal_label(const std::string &save_path) = 0; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, need to do this, and perhaps remove the "friend" relation between filter store and index class?
src/disk_utils.cpp
Outdated
@@ -755,9 +768,16 @@ int build_merged_vamana_index(std::string base_file, diskann::Metric compareMetr | |||
diskann::cout << timer.elapsed_seconds_for_step("building indices on shards") << std::endl; | |||
|
|||
timer.reset(); | |||
const std::string disk_index_path = mem_index_path.substr(0, mem_index_path.size() - 9) + "disk.index"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this hardcoding easier to parse, what is -9?
src/in_mem_filter_store.cpp
Outdated
template <typename label_type> | ||
void InMemFilterStore<label_type>::update_medoid_by_label(const label_type &label, const uint32_t new_medoid) | ||
{ | ||
_label_to_medoid_id[label] = new_medoid; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if label_to_medoid_id is property of index, why is it here?
src/in_mem_filter_store.cpp
Outdated
} | ||
|
||
template <typename label_type> | ||
size_t InMemFilterStore<label_type>::load_medoids(const std::string &labels_to_medoid_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check if needed here or not..
} | ||
map_writer.close(); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need a uber-method called save which can call the necessary saves the store requires like label map, etc.
// Find common filter between a node's labels and a given set of labels, while | ||
// taking into account universal label | ||
template <typename T, typename TagT, typename LabelT> | ||
bool Index<T, TagT, LabelT>::detect_common_filters(uint32_t point_id, bool search_invocation, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we remove index::detect_common_filters, and directly call filter_Store?
src/index.cpp
Outdated
@@ -1758,6 +1697,7 @@ void Index<T, TagT, LabelT>::build(const std::string &data_file, const size_t nu | |||
template <typename T, typename TagT, typename LabelT> | |||
std::unordered_map<std::string, LabelT> Index<T, TagT, LabelT>::load_label_map(const std::string &labels_map_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index need not have a load_label_map itself.. all label related loads should occur in filter_store->load()
src/index.cpp
Outdated
diskann::cerr << stream.str() << std::endl; | ||
throw diskann::ANNException(stream.str(), -1, __FUNCSIG__, __FILE__, __LINE__); | ||
} | ||
|
||
template <typename T, typename TagT, typename LabelT> | ||
void Index<T, TagT, LabelT>::parse_label_file(const std::string &label_file, size_t &num_points) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should remove index::parse_label_file, and call load_labels in the calling function.. _filter_store->populate_labels can be called in build_filtered_index function, or whereever the label file is given in the build pipeline?
What does this implement/fix? Briefly explain your changes.
This is an effort to unify all filter specific operations and also abstract away detection of common filters.
Idea is to have a class "FilterManager" that takes charge of all the filter specific data structures, their operations and also make this class responsible for its own synchronization. Makes code in index even more readable and organized.
Any other comments?