Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joiner store state in fit + add other distance scaling strategies #821

Merged
merged 76 commits into from
Dec 12, 2023
Merged
Changes from 1 commit
Commits
Show all changes
76 commits
Select commit Hold shift + click to select a range
255a184
start adding matching strategies
jeromedockes Nov 8, 2023
7c58a6f
iter
jeromedockes Nov 9, 2023
80e9a6a
Merge remote-tracking branch 'upstream/main' into refactor_joiner
jeromedockes Nov 9, 2023
5ee25e8
add to Joiner, pass aux to fit in matchers
jeromedockes Nov 9, 2023
1af0103
use hashing vectorizer + better handling of sparsity
jeromedockes Nov 9, 2023
f5c8c0d
add actual join
jeromedockes Nov 9, 2023
1663a75
use pd merge rather than concat to let pandas handle rows without mat…
jeromedockes Nov 9, 2023
c0e0e4b
update example
jeromedockes Nov 9, 2023
ee189db
update fuzzy_join
jeromedockes Nov 9, 2023
e968a79
better names in fuzzy_join key checks
jeromedockes Nov 10, 2023
b13fedd
add distance rescaling and max_dist
jeromedockes Nov 10, 2023
c02bc74
update joiner docstring
jeromedockes Nov 13, 2023
be23bc8
update fuzzy_join docstring
jeromedockes Nov 13, 2023
dd73ada
allow None or "inf" as max_dist
jeromedockes Nov 13, 2023
d6167d0
update example
jeromedockes Nov 13, 2023
d74391f
unused import
jeromedockes Nov 13, 2023
3f87007
add note
jeromedockes Nov 13, 2023
4cec389
iter
jeromedockes Nov 13, 2023
c275329
select matching as string + use 2nd neighbor as default
jeromedockes Nov 13, 2023
46cbceb
rename matching -> ref_dist
jeromedockes Nov 13, 2023
079286f
outdated comments
jeromedockes Nov 13, 2023
402bda9
update example
jeromedockes Nov 13, 2023
0d3f079
update tests
jeromedockes Nov 13, 2023
94b332d
Merge remote-tracking branch 'upstream/main' into refactor_joiner
jeromedockes Nov 14, 2023
3a62108
iter
jeromedockes Nov 20, 2023
bf747ca
add rescaling with percentile of aux-aux distances
jeromedockes Nov 21, 2023
5cf7c0a
improve docstring
jeromedockes Nov 21, 2023
b04b676
update fuzzy_join doctest
jeromedockes Nov 21, 2023
8645cda
update test
jeromedockes Nov 21, 2023
3244b31
change default max_dist to inf
jeromedockes Nov 21, 2023
c7b3c1e
docstrings
jeromedockes Nov 21, 2023
7c1516a
docstring
jeromedockes Nov 21, 2023
4fa4cbc
fix sparse distance calculation for old scipy
jeromedockes Nov 22, 2023
911dc4e
add changelog
jeromedockes Nov 22, 2023
bbd88f9
insert match info by default & update example
jeromedockes Nov 22, 2023
4b2e020
update doctests
jeromedockes Nov 22, 2023
0ef7d12
Merge remote-tracking branch 'upstream/main' into refactor_joiner
jeromedockes Nov 23, 2023
b96dace
fuzzy join insert match info false by default
jeromedockes Nov 23, 2023
ef48bbb
apply name changes suggested in code review to example
jeromedockes Nov 27, 2023
226f69d
start addressing review
jeromedockes Nov 27, 2023
066ab17
check value of ref_dist
jeromedockes Nov 27, 2023
e76e097
check key lengths match
jeromedockes Nov 27, 2023
7880b7b
use linalg.norm
jeromedockes Nov 27, 2023
1011b39
fix test
jeromedockes Nov 27, 2023
dd02852
better error message when column names overlap
jeromedockes Nov 27, 2023
5164f1a
add note on not using GapEncoder
jeromedockes Nov 27, 2023
5c2fe87
remove unused param + fix MaxDist
jeromedockes Nov 27, 2023
c41f67d
aux percentile -> quartile
jeromedockes Nov 27, 2023
c848208
matching strategies docstrings
jeromedockes Nov 27, 2023
977b065
docstrings
jeromedockes Nov 27, 2023
ae23511
address review on example
jeromedockes Nov 27, 2023
80c311d
rename insert_match_info → add_match_info
jeromedockes Nov 27, 2023
94a8f7a
add add_column_name_suffix function
jeromedockes Nov 27, 2023
cce86c1
rename matching classes
jeromedockes Nov 30, 2023
b3843e0
add max_dist_ and document public attributes
jeromedockes Nov 30, 2023
6da246a
use check_random_state
jeromedockes Nov 30, 2023
ecda94b
skip vectorizing main matching columns if possible
jeromedockes Nov 30, 2023
a51dfe4
remove worst match rescaling option
jeromedockes Dec 1, 2023
effaa54
fix docstring
jeromedockes Dec 1, 2023
4530fa7
capitalize param description
jeromedockes Dec 1, 2023
4ad0147
add tests
jeromedockes Dec 1, 2023
c869b67
update fuzzy_join docstring
jeromedockes Dec 1, 2023
ea1b5cb
rename aux_quartile, Percentile
jeromedockes Dec 1, 2023
a132d73
full stop at end of param description
jeromedockes Dec 1, 2023
3131039
detail
jeromedockes Dec 1, 2023
2f672ab
fix Joiner bug when aux table index is not range(shape[0])
jeromedockes Dec 1, 2023
1429309
simpler reset index
jeromedockes Dec 1, 2023
3dc1ad2
duplicate column name checking
jeromedockes Dec 4, 2023
24f0498
better way of passing table names
jeromedockes Dec 4, 2023
19b92c9
type hints
jeromedockes Dec 4, 2023
f37856f
convert polars dataframes to pandas until we have actual polars support
jeromedockes Dec 8, 2023
aa79283
details
jeromedockes Dec 8, 2023
86fbb65
Apply suggestions from code review
jeromedockes Dec 11, 2023
f24bc4b
apply suggestions from review
jeromedockes Dec 11, 2023
cc7b63c
Merge branch 'refactor_joiner' of github.com:jeromedockes/skrub into …
jeromedockes Dec 11, 2023
771464d
Merge remote-tracking branch 'upstream/main' into refactor_joiner
jeromedockes Dec 11, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Merge remote-tracking branch 'upstream/main' into refactor_joiner
  • Loading branch information
jeromedockes committed Nov 14, 2023
commit 94b332dcffdef9611486d08424e00acd37369146

This merge commit was added into this branch cleanly.

There are no new changes to show, but you can still view the diff.