Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FilterStore: unifying filter specific logic #452

Open
wants to merge 52 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 44 commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
c445d20
setting up filter handler class
yashpatel007 Sep 11, 2023
cf389f8
build error fix minor
yashpatel007 Sep 11, 2023
2c07ffc
replacing code in index.cpp with filter handler code
yashpatel007 Sep 12, 2023
497abca
resolving labels build failure
yashpatel007 Sep 12, 2023
b5dfd90
renaming from filter_handler to filter_manager
yashpatel007 Sep 12, 2023
bc92303
fixing build error
yashpatel007 Sep 12, 2023
77c5444
small error fix
yashpatel007 Sep 12, 2023
0fa86c2
error fix for disk build
yashpatel007 Sep 12, 2023
1faa5b8
some minor chnages + small update
yashpatel007 Sep 13, 2023
ff7955a
build error fix
yashpatel007 Sep 13, 2023
c86f085
finally fixing disk build & search error
yashpatel007 Sep 13, 2023
275afc1
seperating .h and .cpp file for filter manager
yashpatel007 Sep 13, 2023
91eb6c0
adding DLLEXPORT to methods
yashpatel007 Sep 13, 2023
a0cd607
fixing build err
yashpatel007 Sep 13, 2023
9402f01
adding abstract class + renaming filter manager to store + build_filt…
yashpatel007 Sep 14, 2023
26aa806
refac to support multiple universal labels + disk index is refactored…
yashpatel007 Sep 15, 2023
0c73589
minor update
yashpatel007 Sep 15, 2023
9bcedad
minor update
yashpatel007 Sep 15, 2023
887e644
resolving some errors
yashpatel007 Sep 18, 2023
a9ab92f
small naming changes
yashpatel007 Sep 19, 2023
119ee63
internally updating code to use multiplr universal labels
yashpatel007 Sep 19, 2023
332de43
resolving build error with stitched vamana index
yashpatel007 Sep 19, 2023
e471cf9
experiment : Matching filetrs based on filter match stratagy
yashpatel007 Sep 20, 2023
3992c97
set_universal_labels takes raw labels and maps it accordingly
yashpatel007 Sep 21, 2023
b18ce98
updating code + minor syntax fix
yashpatel007 Sep 21, 2023
8a5c700
updating code to use one source for mapping labels
yashpatel007 Sep 22, 2023
eded185
merging from main
yashpatel007 Oct 4, 2023
392c0ec
build error fix
yashpatel007 Oct 5, 2023
8cfcd5f
minor fix sync frommain
yashpatel007 Oct 5, 2023
f4b430b
fixing error for build stitched index
yashpatel007 Oct 5, 2023
21925ee
improve label loading time
NeelamMahapatro Oct 16, 2023
1cb4aae
small fix
NeelamMahapatro Oct 16, 2023
7edc594
Merge branch 'patelyash/filter_handler' of https://github.com/microso…
NeelamMahapatro Nov 22, 2023
8ba9475
Merge latest main
NeelamMahapatro Nov 22, 2023
11f8be4
Merge latest main branch and fix compile error
NeelamMahapatro Nov 22, 2023
e978f98
fix SSD label mapping
NeelamMahapatro Nov 24, 2023
ccb187c
changing insert_point interface for exposing to user
NeelamMahapatro Nov 27, 2023
9ab9445
test it later
NeelamMahapatro Dec 11, 2023
3cfaf3e
fix build errors
NeelamMahapatro Dec 20, 2023
4f1e81b
Renaming Converted label functions and SSD label to medoid fix
NeelamMahapatro Dec 20, 2023
a7f6b44
move medoid calculation to index class
NeelamMahapatro Jan 14, 2024
6f0f8f5
merge main branch
NeelamMahapatro Jan 14, 2024
68e1dbf
fix
NeelamMahapatro Jan 14, 2024
615247a
clang formatted
Jan 23, 2024
0b9118f
minor changes as per PR discussion
Jan 24, 2024
7462bd0
remove medoid related data and methods to index class
NeelamMahapatro Jan 24, 2024
84174bd
clang format fix
NeelamMahapatro Jan 24, 2024
86159e8
fix disk index build
NeelamMahapatro Jan 25, 2024
ce7db4d
fix in builder
NeelamMahapatro Jan 28, 2024
82f7182
clang fix
NeelamMahapatro Jan 28, 2024
e669521
change raw_labels to populate_labels
NeelamMahapatro Jan 29, 2024
df38242
combine load and save method
NeelamMahapatro Jan 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions apps/build_memory_index.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,7 @@ int main(int argc, char **argv)
.with_index_write_params(index_build_params)
.is_enable_tags(false)
.is_use_opq(use_opq)
.is_filtered(label_file != "")
.is_pq_dist_build(use_pq_build)
.with_num_pq_chunks(build_PQ_bytes)
.build();
Expand Down
7 changes: 5 additions & 2 deletions apps/build_stitched_index.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
#include "memory_mapper.h"
#include "parameters.h"
#include "utils.h"
#include "in_mem_filter_store.h"
#include "program_options_utils.hpp"

namespace po = boost::program_options;
Expand Down Expand Up @@ -341,10 +342,12 @@ int main(int argc, char **argv)
handle_args(argc, argv, data_type, input_data_path, final_index_path_prefix, label_data_path, universal_label,
num_threads, R, L, stitched_R, alpha);

path labels_file_to_use = final_index_path_prefix + "_label_formatted.txt";
path labels_file_to_use = final_index_path_prefix + "_label_numeric.txt";
path labels_map_file = final_index_path_prefix + "_labels_map.txt";

convert_labels_string_to_int(label_data_path, labels_file_to_use, labels_map_file, universal_label);
std::string raw_universal_label = universal_label;
diskann::InMemFilterStore<uint32_t>::convert_label_to_numeric(label_data_path, labels_file_to_use, labels_map_file,
raw_universal_label);

// 2. parse label file and create necessary data structures
std::vector<label_set> point_ids_to_labels;
Expand Down
4 changes: 2 additions & 2 deletions apps/search_disk_index.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -239,11 +239,11 @@ int search_disk_index(diskann::Metric &metric, const std::string &index_path_pre
LabelT label_for_search;
if (query_filters.size() == 1)
{ // one label for all queries
label_for_search = _pFlashIndex->get_converted_label(query_filters[0]);
label_for_search = _pFlashIndex->get_numeric_label(query_filters[0]);
}
else
{ // one label for each query
label_for_search = _pFlashIndex->get_converted_label(query_filters[i]);
label_for_search = _pFlashIndex->get_numeric_label(query_filters[i]);
}
_pFlashIndex->cached_beam_search(
query + (i * query_aligned_dim), recall_at, L, query_result_ids_64.data() + (i * recall_at),
Expand Down
18 changes: 7 additions & 11 deletions apps/test_insert_deletes_consolidate.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ std::string get_save_filename(const std::string &save_path, size_t points_to_ski

template <typename T, typename TagT, typename LabelT>
void insert_till_next_checkpoint(diskann::AbstractIndex &index, size_t start, size_t end, int32_t thread_count, T *data,
size_t aligned_dim, std::vector<std::vector<LabelT>> &location_to_labels)
size_t aligned_dim, std::vector<std::vector<std::string>> &location_to_labels)
{
diskann::Timer insert_timer;
#pragma omp parallel for num_threads(thread_count) schedule(dynamic)
Expand Down Expand Up @@ -187,10 +187,10 @@ void build_incremental_index(const std::string &data_path, diskann::IndexWritePa
diskann::IndexFactory index_factory = diskann::IndexFactory(index_config);
auto index = index_factory.create_instance();

/* remove set_universal_label from here and set it through filter store only*/
if (universal_label != "")
{
LabelT u_label = 0;
index->set_universal_label(u_label);
index->set_universal_labels(universal_label);
}

if (points_to_skip > num_points)
Expand Down Expand Up @@ -255,18 +255,15 @@ void build_incremental_index(const std::string &data_path, diskann::IndexWritePa
<< " points since the data file has only that many" << std::endl;
}

std::vector<std::vector<LabelT>> location_to_labels;
std::vector<std::vector<std::string>> location_to_labels;
if (concurrent)
{
// handle labels
const auto save_path_inc = get_save_filename(save_path + ".after-concurrent-delete-", points_to_skip,
points_to_delete_from_beginning, last_point_threshold);
std::string labels_file_to_use = save_path_inc + "_label_formatted.txt";
std::string mem_labels_int_map_file = save_path_inc + "_labels_map.txt";
if (has_labels)
{
convert_labels_string_to_int(label_file, labels_file_to_use, mem_labels_int_map_file, universal_label);
auto parse_result = diskann::parse_formatted_label_file<LabelT>(labels_file_to_use);
auto parse_result = diskann::parse_raw_label_file(label_file);
location_to_labels = std::get<0>(parse_result);
}

Expand Down Expand Up @@ -311,12 +308,11 @@ void build_incremental_index(const std::string &data_path, diskann::IndexWritePa
{
const auto save_path_inc = get_save_filename(save_path + ".after-delete-", points_to_skip,
points_to_delete_from_beginning, last_point_threshold);
std::string labels_file_to_use = save_path_inc + "_label_formatted.txt";
std::string labels_file_to_use = save_path_inc + "_label_numeric.txt";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if numeric labels are not exposed, why are we using numeric labels here?

std::string mem_labels_int_map_file = save_path_inc + "_labels_map.txt";
if (has_labels)
{
convert_labels_string_to_int(label_file, labels_file_to_use, mem_labels_int_map_file, universal_label);
auto parse_result = diskann::parse_formatted_label_file<LabelT>(labels_file_to_use);
auto parse_result = diskann::parse_raw_label_file(labels_file_to_use);
location_to_labels = std::get<0>(parse_result);
}

Expand Down
13 changes: 5 additions & 8 deletions apps/test_streaming_scenario.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ std::string get_save_filename(const std::string &save_path, size_t active_window

template <typename T, typename TagT, typename LabelT>
void insert_next_batch(diskann::AbstractIndex &index, size_t start, size_t end, size_t insert_threads, T *data,
size_t aligned_dim, std::vector<std::vector<LabelT>> &pts_to_labels)
size_t aligned_dim, std::vector<std::vector<std::string>> &pts_to_labels)
{
try
{
Expand Down Expand Up @@ -211,16 +211,14 @@ void build_incremental_index(const std::string &data_path, const uint32_t L, con
size_t dim, aligned_dim;
size_t num_points;

std::vector<std::vector<LabelT>> pts_to_labels;
std::vector<std::vector<std::string>> pts_to_labels;

const auto save_path_inc =
get_save_filename(save_path + ".after-streaming-", active_window, consolidate_interval, max_points_to_insert);
std::string labels_file_to_use = save_path_inc + "_label_formatted.txt";
std::string mem_labels_int_map_file = save_path_inc + "_labels_map.txt";

if (has_labels)
{
convert_labels_string_to_int(label_file, labels_file_to_use, mem_labels_int_map_file, universal_label);
auto parse_result = diskann::parse_formatted_label_file<LabelT>(labels_file_to_use);
auto parse_result = diskann::parse_raw_label_file(label_file);
pts_to_labels = std::get<0>(parse_result);
}

Expand Down Expand Up @@ -253,8 +251,7 @@ void build_incremental_index(const std::string &data_path, const uint32_t L, con

if (universal_label != "")
{
LabelT u_label = 0;
index->set_universal_label(u_label);
index->set_universal_labels(universal_label);
}

if (max_points_to_insert == 0)
Expand Down
76 changes: 76 additions & 0 deletions include/abstract_filter_store.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
#pragma once
#include "common_includes.h"
#include "utils.h"
#include <any>

namespace diskann
{

enum class FilterMatchStrategy
{
SET_INTERSECTION
rakri marked this conversation as resolved.
Show resolved Hide resolved
};
// This class is responsible for filter actions in index, and should not be used outside.
template <typename label_type> class AbstractFilterStore
NeelamMahapatro marked this conversation as resolved.
Show resolved Hide resolved
{
public:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also didn't find methods to expand and shrink the filter store.

DISKANN_DLLEXPORT AbstractFilterStore(const size_t num_points);
virtual ~AbstractFilterStore() = default;

// needs some internal lock + abstract implementation
DISKANN_DLLEXPORT virtual bool detect_common_filters(
rakri marked this conversation as resolved.
Show resolved Hide resolved
uint32_t point_id, bool search_invocation, const std::vector<label_type> &incoming_labels,
const FilterMatchStrategy strategy = FilterMatchStrategy::SET_INTERSECTION) = 0;

DISKANN_DLLEXPORT virtual const std::vector<label_type> &get_labels_by_location(const location_t point_id) = 0;
DISKANN_DLLEXPORT virtual void set_labels_to_location(const location_t location,
const std::vector<std::string> &labels) = 0;
DISKANN_DLLEXPORT virtual void swap_labels(const location_t location_first, const location_t location_second) = 0;

DISKANN_DLLEXPORT virtual const tsl::robin_set<label_type> &get_all_label_set() = 0;
rakri marked this conversation as resolved.
Show resolved Hide resolved
DISKANN_DLLEXPORT virtual void add_to_label_set(label_type &label) = 0;
rakri marked this conversation as resolved.
Show resolved Hide resolved
rakri marked this conversation as resolved.
Show resolved Hide resolved
// Throws: out of range exception
DISKANN_DLLEXPORT virtual void add_label_to_location(const location_t point_id, label_type label) = 0;
rakri marked this conversation as resolved.
Show resolved Hide resolved
// returns internal mapping for given raw_label
DISKANN_DLLEXPORT virtual label_type get_numeric_label(const std::string &raw_label) = 0;

DISKANN_DLLEXPORT virtual void update_medoid_by_label(const label_type &label, const uint32_t new_medoid) = 0;
DISKANN_DLLEXPORT virtual const uint32_t &get_medoid_by_label(const label_type &label) = 0;
DISKANN_DLLEXPORT virtual const std::unordered_map<label_type, uint32_t> &get_labels_to_medoids() = 0;
rakri marked this conversation as resolved.
Show resolved Hide resolved
DISKANN_DLLEXPORT virtual bool label_has_medoid(const label_type &label) = 0;

// TODO: in future we may accept a set or vector of universal labels
// DISKANN_DLLEXPORT virtual void set_universal_label(label_type universal_label) = 0;
DISKANN_DLLEXPORT virtual void set_universal_labels(const std::string &universal_labels) = 0;
DISKANN_DLLEXPORT virtual std::pair<bool, label_type> get_universal_label() = 0;

// takes raw label file and then genrate internal mapping file and keep the info of mapping
DISKANN_DLLEXPORT virtual size_t load_raw_labels(const std::string &raw_labels_file,
const std::string &raw_universal_label) = 0;

DISKANN_DLLEXPORT virtual void save_labels(const std::string &save_path, const size_t total_points) = 0;
rakri marked this conversation as resolved.
Show resolved Hide resolved
// For dynamic filtered build, we compact the data and hence location_to_labels, we need the compacted version of
// raw labels to compute GT correctly.
DISKANN_DLLEXPORT virtual void save_raw_labels(const std::string &save_path, const size_t total_points) = 0;
DISKANN_DLLEXPORT virtual void save_medoids(const std::string &save_path) = 0;
DISKANN_DLLEXPORT virtual void save_label_map(const std::string &save_path) = 0;
DISKANN_DLLEXPORT virtual void save_universal_label(const std::string &save_path) = 0;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need a public load() which calls the protected load* methods?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, need to do this, and perhaps remove the "friend" relation between filter store and index class?

protected:
// This is for internal use and only loads already parsed file
DISKANN_DLLEXPORT virtual size_t load_labels(const std::string &labels_file) = 0;
DISKANN_DLLEXPORT virtual size_t load_medoids(const std::string &labels_to_medoid_file) = 0;
DISKANN_DLLEXPORT virtual void load_label_map(const std::string &labels_map_file) = 0;
DISKANN_DLLEXPORT virtual void load_universal_labels(const std::string &universal_labels_file) = 0;

private:
size_t _num_points;

// populates pts_to labels and _labels from given label file
virtual size_t parse_label_file(const std::string &label_file) = 0;

// mark Index as friend so it can access protected loads
template <typename T, typename TagT, typename LabelT> friend class Index;
};

} // namespace diskann
8 changes: 4 additions & 4 deletions include/abstract_index.h
Original file line number Diff line number Diff line change
Expand Up @@ -79,8 +79,8 @@ class AbstractIndex
float *distances);

// insert points with labels, labels should be present for filtered index
template <typename data_type, typename tag_type, typename label_type>
int insert_point(const data_type *point, const tag_type tag, const std::vector<label_type> &labels);
template <typename data_type, typename tag_type>
int insert_point(const data_type *point, const tag_type tag, const std::vector<std::string> &labels);

// insert point for unfiltered index build. do not use with filtered index
template <typename data_type, typename tag_type> int insert_point(const data_type *point, const tag_type tag);
Expand All @@ -103,7 +103,8 @@ class AbstractIndex
// memory should be allocated for vec before calling this function
template <typename tag_type, typename data_type> int get_vector_by_tag(tag_type &tag, data_type *vec);

template <typename label_type> void set_universal_label(const label_type universal_label);
// required for dynamic index (they dont use filter store / data store yet)
virtual void set_universal_labels(const std::string &raw_universal_labels) = 0;

private:
virtual void _build(const DataType &data, const size_t num_points_to_load, TagVector &tags) = 0;
Expand All @@ -122,6 +123,5 @@ class AbstractIndex
virtual size_t _search_with_tags(const DataType &query, const uint64_t K, const uint32_t L, const TagType &tags,
float *distances, DataVector &res_vectors) = 0;
virtual void _search_with_optimized_layout(const DataType &query, size_t K, size_t L, uint32_t *indices) = 0;
virtual void _set_universal_label(const LabelType universal_label) = 0;
};
} // namespace diskann
5 changes: 3 additions & 2 deletions include/disk_utils.h
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,8 @@ DISKANN_DLLEXPORT int merge_shards(const std::string &vamana_prefix, const std::
const std::string &idmaps_prefix, const std::string &idmaps_suffix,
const uint64_t nshards, uint32_t max_degree, const std::string &output_vamana,
const std::string &medoids_file, bool use_filters = false,
const std::string &labels_to_medoids_file = std::string(""));
const std::string &labels_to_medoids_file = std::string(""),
const std::unordered_map<std::string, uint32_t> &disk_labels_map = {});

DISKANN_DLLEXPORT void extract_shard_labels(const std::string &in_label_file, const std::string &shard_ids_bin,
const std::string &shard_label_file);
Expand All @@ -81,7 +82,7 @@ DISKANN_DLLEXPORT int build_merged_vamana_index(std::string base_file, diskann::
std::string centroids_file, size_t build_pq_bytes, bool use_opq,
uint32_t num_threads, bool use_filters = false,
const std::string &label_file = std::string(""),
const std::string &labels_to_medoids_file = std::string(""),
const std::string &disk_labels_to_medoids_file = std::string(""),
const std::string &universal_label = "", const uint32_t Lf = 0);

template <typename T, typename LabelT>
Expand Down
3 changes: 1 addition & 2 deletions include/filter_utils.h
Original file line number Diff line number Diff line change
Expand Up @@ -57,8 +57,7 @@ DISKANN_DLLEXPORT void generate_label_indices(path input_data_path, path final_i
DISKANN_DLLEXPORT load_label_index_return_values load_label_index(path label_index_path,
uint32_t label_number_of_points);

template <typename LabelT>
DISKANN_DLLEXPORT std::tuple<std::vector<std::vector<LabelT>>, tsl::robin_set<LabelT>> parse_formatted_label_file(
DISKANN_DLLEXPORT std::tuple<std::vector<std::vector<std::string>>, tsl::robin_set<std::string>> parse_raw_label_file(
path label_file);

DISKANN_DLLEXPORT parse_label_file_return_values parse_label_file(path label_data_path, std::string universal_label);
Expand Down
Loading
Loading