Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAIN Improve DatetimeEncoder #784

Merged
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
c9069bc
estimator refacto
Vincent-Maladiere Oct 4, 2023
da7d678
revamp all tests from datetime_encoder
Vincent-Maladiere Oct 5, 2023
0c08aad
update docstrings
Vincent-Maladiere Oct 5, 2023
d57691c
update example
Vincent-Maladiere Oct 5, 2023
581c834
Merge branch 'main' into refacto_datetime_encoder
Vincent-Maladiere Oct 5, 2023
b39c2da
split the transform method with _parse_datetime_cols
Vincent-Maladiere Oct 5, 2023
edf11dd
small typo in a comment
Vincent-Maladiere Oct 5, 2023
367d207
add to_datetime and rework the backend
Vincent-Maladiere Oct 12, 2023
65657c3
docstring typo
Vincent-Maladiere Oct 12, 2023
53a04d2
docstring typo 2
Vincent-Maladiere Oct 12, 2023
998859c
docstring typo 3
Vincent-Maladiere Oct 12, 2023
6bec3e6
add TODO
Vincent-Maladiere Oct 12, 2023
d4b9cbc
enhance tests
Vincent-Maladiere Oct 13, 2023
3d87da3
Merge branch 'main' into refacto_datetime_encoder
Vincent-Maladiere Oct 13, 2023
8710fcb
apply Jerome's suggestions
Vincent-Maladiere Oct 13, 2023
4d11fb9
Merge branch 'main' into refacto_datetime_encoder
Vincent-Maladiere Oct 13, 2023
dff7b22
fix old pandas version errors
Vincent-Maladiere Oct 13, 2023
ff5b575
fix doctest
Vincent-Maladiere Oct 16, 2023
7f463bc
add scalar and 1d array support for to_datetime
Vincent-Maladiere Oct 16, 2023
1bf6a9f
fix test on py310-min
Vincent-Maladiere Oct 16, 2023
6653591
Merge branch 'main' into refacto_datetime_encoder
Vincent-Maladiere Oct 31, 2023
4311a5e
update the example
Vincent-Maladiere Oct 31, 2023
3b37ec1
Merge branch 'main' into refacto_datetime_encoder
Vincent-Maladiere Nov 2, 2023
77771f7
improve to_datetime docstring and parameters validation
Vincent-Maladiere Nov 2, 2023
b8e8699
Merge branch 'main' into refacto_datetime_encoder
Vincent-Maladiere Nov 2, 2023
60cfad6
fix _dataframe import path
Vincent-Maladiere Nov 2, 2023
cd6672d
improve doc and add some tests
Vincent-Maladiere Nov 3, 2023
1f1e128
fix docstring format
Vincent-Maladiere Nov 4, 2023
d57eca5
Merge branch 'main' into refacto_datetime_encoder
Vincent-Maladiere Nov 4, 2023
581fd88
make doctest happy
Vincent-Maladiere Nov 4, 2023
25d0457
fix min pandas tests
Vincent-Maladiere Nov 4, 2023
bc72e81
fix tests for min pandas
Vincent-Maladiere Nov 4, 2023
1f52d1e
make doctest happy
Vincent-Maladiere Nov 4, 2023
0875958
Update skrub/_datetime_encoder.py
Vincent-Maladiere Nov 6, 2023
4137f89
apply suggestions
Vincent-Maladiere Nov 8, 2023
0bf4896
missing remarks
Vincent-Maladiere Nov 9, 2023
d5a4091
fix min pandas version test
Vincent-Maladiere Nov 9, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@ development and backward compatibility is not ensured.
Major changes
-------------

* :func:`to_datetime` is now available to support pandas.to_datetime
over dataframes and 2d arrays.
:pr:`784` by :user:`Vincent Maladiere <Vincent-Maladiere>`

* :func:`dataframe.pd_join`, :func:`dataframe.pd_aggregate`,
:func:`dataframe.pl_join` and :func:`dataframe.pl_aggregate`
are now available in the dataframe submodule.
Expand Down Expand Up @@ -47,6 +51,11 @@ Major changes

Minor changes
-------------
* :class:`DatetimeEncoder` doesn't remove constant features anymore.
It also supports an 'errors' argument to raise or coerce errors during
transform, and a 'add_total_seconds' argument to include the number of
seconds since Epoch.
:pr:`784` by :user:`Vincent Maladiere <Vincent-Maladiere>`

* :class:`TableVectorizer` is now able to apply parallelism at the column level rather than the transformer level. This is the default for univariate transformers, like :class:`MinHashEncoder`, and :class:`GapEncoder`.
:pr:`592` by :user:`Leo Grinsztajn <LeoGrin>`
Expand Down
15 changes: 4 additions & 11 deletions examples/03_datetime_encoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@

encoder = make_column_transformer(
(OneHotEncoder(handle_unknown="ignore"), ["city"]),
(DatetimeEncoder(add_day_of_the_week=True, extract_until="minute"), ["date.utc"]),
(DatetimeEncoder(add_day_of_the_week=True, resolution="minute"), ["date.utc"]),
remainder="drop",
)

Expand All @@ -89,11 +89,8 @@

###############################################################################
# We see that the encoder is working as expected: the "date.utc" column has
# been replaced by features extracting the month, day, hour, and day of the
# week information.
#
# Note the year and minute features are not present, this is because they
# have been removed by the encoder as they are constant the whole period.
# been replaced by features extracting the month, day, hour, minute, day of the
# week and total second since Epoch information.

###############################################################################
# One-liner with the |TableVectorizer|
Expand Down Expand Up @@ -144,14 +141,9 @@
# ```py
# from sklearn.experimental import enable_hist_gradient_boosting
# ```

import numpy as np
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.pipeline import make_pipeline

table_vec = TableVectorizer(
datetime_transformer=DatetimeEncoder(add_day_of_the_week=True),
)
pipeline = make_pipeline(table_vec, HistGradientBoostingRegressor())

###############################################################################
Expand All @@ -164,6 +156,7 @@
#
# Instead, we can use the |TimeSeriesSplit|,
# which ensures that the test set is always in the future.
import numpy as np

sorted_indices = np.argsort(X["date.utc"])
X = X.iloc[sorted_indices]
Expand Down
3 changes: 2 additions & 1 deletion skrub/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

from ._agg_joiner import AggJoiner, AggTarget
from ._check_dependencies import check_dependencies
from ._datetime_encoder import DatetimeEncoder
from ._datetime_encoder import DatetimeEncoder, to_datetime
from ._deduplicate import compute_ngram_distance, deduplicate
from ._fuzzy_join import fuzzy_join
from ._gap_encoder import GapEncoder
Expand Down Expand Up @@ -33,6 +33,7 @@
"TargetEncoder",
"deduplicate",
"compute_ngram_distance",
"to_datetime",
"AggJoiner",
"AggTarget",
]
Loading