Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FilterStore: unifying filter specific logic #452

Open
wants to merge 52 commits into
base: main
Choose a base branch
from

Conversation

yashpatel007
Copy link
Contributor

@yashpatel007 yashpatel007 commented Sep 11, 2023

  • Does this PR have a descriptive title that could go in our release notes?
  • Does this PR add any new dependencies?
  • Does this PR modify any existing APIs?
    • Is the change to the API backwards compatible?
  • Should this result in any changes to our documentation, either updating existing docs or adding new ones?

What does this implement/fix? Briefly explain your changes.

This is an effort to unify all filter specific operations and also abstract away detection of common filters.

Idea is to have a class "FilterManager" that takes charge of all the filter specific data structures, their operations and also make this class responsible for its own synchronization. Makes code in index even more readable and organized.

Any other comments?

@yashpatel007 yashpatel007 changed the title A FilterHandler unifying filter specific logic A FilterManager unifying filter specific logic Sep 12, 2023
@yashpatel007 yashpatel007 requested a review from rakri September 13, 2023 14:53
@yashpatel007 yashpatel007 changed the title A FilterManager unifying filter specific logic FilterManager: unifying filter specific logic Sep 13, 2023
@yashpatel007 yashpatel007 force-pushed the patelyash/filter_handler branch from 9bcedad to 887e644 Compare September 18, 2023 15:09
@yashpatel007 yashpatel007 changed the title FilterManager: unifying filter specific logic FilterStore: unifying filter specific logic Sep 18, 2023
// universal label
bool _use_universal_label = false;
tsl::robin_set<label_type> _mapped_universal_labels;
std::set<std::string> _raw_universal_labels;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Label_type , string for single filter.
structure for multiple filter will be decided later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, it does not hurt to keep this. the logic works fine. but if you think we just need single filter for now. feel free to change the logic.

@@ -343,7 +344,9 @@ int main(int argc, char **argv)
path labels_file_to_use = final_index_path_prefix + "_label_formatted.txt";
path labels_map_file = final_index_path_prefix + "_labels_map.txt";

convert_labels_string_to_int(label_data_path, labels_file_to_use, labels_map_file, universal_label);
std::set<std::string> raw_universal_label_set = {universal_label};
diskann::InMemFilterStore<uint32_t>::convert_labels_string_to_int(label_data_path, labels_file_to_use,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be the responsibility of filter store to convert labels from strings to int?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this here because, the end user does not know if we have an internal mapping and how its done, similarly Index does not know that string labels is a thing, it just expects number labels. So filter store seems like an ideal place to do this conversion. That way it can also keep the mapping logic and also the map itself.

include/abstract_filter_store.h Show resolved Hide resolved
include/abstract_filter_store.h Outdated Show resolved Hide resolved
include/abstract_filter_store.h Show resolved Hide resolved
include/abstract_filter_store.h Show resolved Hide resolved
include/abstract_filter_store.h Outdated Show resolved Hide resolved
DISKANN_DLLEXPORT virtual void save_medoids(const std::string &save_path) = 0;
DISKANN_DLLEXPORT virtual void save_label_map(const std::string &save_path) = 0;
DISKANN_DLLEXPORT virtual void save_universal_label(const std::string &save_path) = 0;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need a public load() which calls the protected load* methods?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, need to do this, and perhaps remove the "friend" relation between filter store and index class?

@@ -103,7 +103,8 @@ class AbstractIndex
// memory should be allocated for vec before calling this function
template <typename tag_type, typename data_type> int get_vector_by_tag(tag_type &tag, data_type *vec);

template <typename label_type> void set_universal_label(const label_type universal_label);
virtual void set_universal_labels(const std::vector<std::string> &raw_universal_labels,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if Index class should get involved in setting universal labels. Why not set it directly in the filter store when it is created?

include/abstract_filter_store.h Show resolved Hide resolved
// This class is responsible for filter actions in index, and should not be used outside.
template <typename label_type> class AbstractFilterStore
{
public:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also didn't find methods to expand and shrink the filter store.

std::set<std::string> _raw_universal_labels;

// populates _loaction_to labels and _labels from given label file
size_t parse_label_file(const std::string &label_file);
Copy link
Contributor

@NeelamMahapatro NeelamMahapatro Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename filter_utils to gt_filter_utils

bool detect_common_filters(uint32_t point_id, bool search_invocation,
const std::vector<label_type> &incoming_labels,
const FilterMatchStrategy filter_match_strategy) override;

Copy link
Contributor

@NeelamMahapatro NeelamMahapatro Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define 2 detect_common_filters for base points and queries
type of incoming_labels vector<vector<>>
search_invocation - not required if 2 functions are defined

@@ -369,18 +372,14 @@ template <typename T, typename TagT = uint32_t, typename LabelT = uint32_t> clas
// Filter Support

bool _filtered_index = false;
// Location to label is only updated during insert_point(), all other reads are protected by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

insert_point should take string formatted labels (since exposed to user). conversion to label type should happen internally

NeelamMahapatro and others added 12 commits November 22, 2023 14:59
1. fix SSD label mapping
2. Removed dynamic index parameter from set_universal_label funtion
- Get converted label to Get Numeric Label and convert_label_string_to_int to convert_label_to_numeric

- SSD Label mapping fix - Calculate global_string_to_label in entire large file
Copy link
Contributor

@rakri rakri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we address the comments (mostly about who owns medoids), and test the streaming filtered index, I am ok with this PR, good to go!

@@ -311,12 +308,11 @@ void build_incremental_index(const std::string &data_path, diskann::IndexWritePa
{
const auto save_path_inc = get_save_filename(save_path + ".after-delete-", points_to_skip,
points_to_delete_from_beginning, last_point_threshold);
std::string labels_file_to_use = save_path_inc + "_label_formatted.txt";
std::string labels_file_to_use = save_path_inc + "_label_numeric.txt";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if numeric labels are not exposed, why are we using numeric labels here?

include/abstract_filter_store.h Outdated Show resolved Hide resolved
DISKANN_DLLEXPORT virtual void save_medoids(const std::string &save_path) = 0;
DISKANN_DLLEXPORT virtual void save_label_map(const std::string &save_path) = 0;
DISKANN_DLLEXPORT virtual void save_universal_label(const std::string &save_path) = 0;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, need to do this, and perhaps remove the "friend" relation between filter store and index class?

@@ -755,9 +768,16 @@ int build_merged_vamana_index(std::string base_file, diskann::Metric compareMetr
diskann::cout << timer.elapsed_seconds_for_step("building indices on shards") << std::endl;

timer.reset();
const std::string disk_index_path = mem_index_path.substr(0, mem_index_path.size() - 9) + "disk.index";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this hardcoding easier to parse, what is -9?

template <typename label_type>
void InMemFilterStore<label_type>::update_medoid_by_label(const label_type &label, const uint32_t new_medoid)
{
_label_to_medoid_id[label] = new_medoid;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if label_to_medoid_id is property of index, why is it here?

}

template <typename label_type>
size_t InMemFilterStore<label_type>::load_medoids(const std::string &labels_to_medoid_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if needed here or not..

}
map_writer.close();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a uber-method called save which can call the necessary saves the store requires like label map, etc.

// Find common filter between a node's labels and a given set of labels, while
// taking into account universal label
template <typename T, typename TagT, typename LabelT>
bool Index<T, TagT, LabelT>::detect_common_filters(uint32_t point_id, bool search_invocation,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we remove index::detect_common_filters, and directly call filter_Store?

src/index.cpp Outdated
@@ -1758,6 +1697,7 @@ void Index<T, TagT, LabelT>::build(const std::string &data_file, const size_t nu
template <typename T, typename TagT, typename LabelT>
std::unordered_map<std::string, LabelT> Index<T, TagT, LabelT>::load_label_map(const std::string &labels_map_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index need not have a load_label_map itself.. all label related loads should occur in filter_store->load()

src/index.cpp Outdated
diskann::cerr << stream.str() << std::endl;
throw diskann::ANNException(stream.str(), -1, __FUNCSIG__, __FILE__, __LINE__);
}

template <typename T, typename TagT, typename LabelT>
void Index<T, TagT, LabelT>::parse_label_file(const std::string &label_file, size_t &num_points)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should remove index::parse_label_file, and call load_labels in the calling function.. _filter_store->populate_labels can be called in build_filtered_index function, or whereever the label file is given in the build pipeline?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants