Adding a "Discover" object for finding candidate tables that may be joined on the main table #1153

rcap107 · 2024-11-21T16:31:42Z

Hello, this will be a long one!

The objective of this PR is adapting part of the pipeline from this paper: https://arxiv.org/abs/2402.06282 (repo https://github.com/rcap107/retrieve-merge-predict).

Main points:

the user has access to a path that contains a collection of unstructured tables, and wants to find which of those tables may be joined on a main table (what would be "X").
a table can be joined (is a "candidate table") on a column in X if its Jaccard containment is large
this object takes as parameters the path that contains the tables (may include subfolders), the list of columns that must be checked as possible join keys ("query columns"), a budget to keep only the top-k candidates, and X at fit time
during fit/transform, all the tables in the path are loaded in memory, all categorical/string columns* are selected and the Jaccard containment of each query column (from X) is measured against all the selected columns in the loaded tables
the output of transform is a ranking of candidate joins that includes the query column in the main table, the path to the candidate table, the key in the candidate table and the containment

Some of the problems/things that should be considered for later:

*We may not be interested only in categorical columns: integer columns (that represent integers) may also work. Integers will cause trouble because they may lead to false positives.
Measuring the containment is an expensive operation that should be optimized well (but it can also be parallelized easily).
Loading all tables in memory may cause memory issues (more optimization).
The MultiAggJoiner could be used at the end to combine all the candidates into a big table.
Jaccard containment is only one metric, more profiling tools (maybe target based) could be included,
We can start from providing a path, but later we might want to directly include bindings to connect with DBMS.

Notes on the current implementation:

The code is in _discover.py for now.
All the testing is missing (working on it). How do I set up testing for files that will be on a specific path?
I had to create read_parquet and read_csv dispatched functions, but I am having trouble because the only argument that they need is input_path, and the dispatcher can't use that to decide which library to use. For the time being, I am passing X just to give that information.
For now, the functions used by the Discover object are not in the class itself, should I move them in? It's probably the better option.
I put TODOs for some of the important points.

The current code version is barebones, but it runs. Mostly, I am having trouble with integrating the code properly.

rcap107 · 2024-11-21T16:32:37Z

.gitignore

@@ -20,6 +20,9 @@ var/
 .installed.cfg
 *.egg
 *.pkl
+data/


This is to ignore the files I am using locally to test the object.

rcap107 added 9 commits October 24, 2024 12:02

🎉 Initial commit for the Discover object

a8ca6e7

Merge remote-tracking branch 'upstream/main' into discover

55a7866

Updating gitignore

a73a114

Cleaning up, moving code to source dir

5b65db2

Adding example

02017fa

Updating common with the new functions

abc9174

Starting to implement datframe API

df3b4dd

Updating gitignore

fc064cd

Initial commit

920546b

rcap107 commented Nov 21, 2024

View reviewed changes

Updating code to test on CTU example

7bcb0d1

Provide feedback