Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a "Discover" object for finding candidate tables that may be joined on the main table #1153

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

rcap107
Copy link
Contributor

@rcap107 rcap107 commented Nov 21, 2024

Hello, this will be a long one!

The objective of this PR is adapting part of the pipeline from this paper: https://arxiv.org/abs/2402.06282 (repo https://github.com/rcap107/retrieve-merge-predict).

Main points:

  • the user has access to a path that contains a collection of unstructured tables, and wants to find which of those tables may be joined on a main table (what would be "X").
  • a table can be joined (is a "candidate table") on a column in X if its Jaccard containment is large
  • this object takes as parameters the path that contains the tables (may include subfolders), the list of columns that must be checked as possible join keys ("query columns"), a budget to keep only the top-k candidates, and X at fit time
  • during fit/transform, all the tables in the path are loaded in memory, all categorical/string columns* are selected and the Jaccard containment of each query column (from X) is measured against all the selected columns in the loaded tables
  • the output of transform is a ranking of candidate joins that includes the query column in the main table, the path to the candidate table, the key in the candidate table and the containment

Some of the problems/things that should be considered for later:

  • *We may not be interested only in categorical columns: integer columns (that represent integers) may also work. Integers will cause trouble because they may lead to false positives.
  • Measuring the containment is an expensive operation that should be optimized well (but it can also be parallelized easily).
  • Loading all tables in memory may cause memory issues (more optimization).
  • The MultiAggJoiner could be used at the end to combine all the candidates into a big table.
  • Jaccard containment is only one metric, more profiling tools (maybe target based) could be included,
  • We can start from providing a path, but later we might want to directly include bindings to connect with DBMS.

Notes on the current implementation:

  • The code is in _discover.py for now.
  • All the testing is missing (working on it). How do I set up testing for files that will be on a specific path?
  • I had to create read_parquet and read_csv dispatched functions, but I am having trouble because the only argument that they need is input_path, and the dispatcher can't use that to decide which library to use. For the time being, I am passing X just to give that information.
  • For now, the functions used by the Discover object are not in the class itself, should I move them in? It's probably the better option.
  • I put TODOs for some of the important points.

The current code version is barebones, but it runs. Mostly, I am having trouble with integrating the code properly.

@@ -20,6 +20,9 @@ var/
.installed.cfg
*.egg
*.pkl
data/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to ignore the files I am using locally to test the object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant