-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAIN Improve DatetimeEncoder
#784
MAIN Improve DatetimeEncoder
#784
Conversation
skrub/_datetime_encoder.py
Outdated
np_dtypes_candidates = [np.object_, np.str_, np.datetime64] | ||
if any(np.issubdtype(X.dtype, np_dtype) for np_dtype in np_dtypes_candidates): | ||
try: | ||
_ = pd.to_datetime(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think to_datetime
is stricter than DateTimeIndex
which was used before (@LeoGrin also mentioned that in #768 here), as a result the docstring example prints an empty vectorized dataframe on my machine.
being a bit more strict is probably a good idea especially at first; using to_datetime
instead of DateTimeIndex
avoids issues such as this one, but the example needs to be updated
in a later PR the TableVectorizer will start using this module's |
I wonder if we could have a transformer that just parses string columns that contain dates and replaces them with columns with datetime64 dtype? This is one thing that is annoying to do in pandas if we don't know in advance which columns are dates import pandas as pd
df = pd.DataFrame({"date": ["2023-10-06", "2023-10-07"]})
df["date"] = pd.to_datetime(df["date"]) # avoid having to do this |
My idea was to run
Yes, I need to add some of the the logic from |
I like this suggestion a lot, this is indeed very annoying. Does it need to be a transformer? I feel we could use a "switch-bait" function in the sense of |
@jeromedockes I went for the switch-bait I need to update the tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it looks great! I wonder if we should make it for dataframes only (and possibly 2d arrays), and maybe give it a different names because not all columns are converted. I get your "switchbait" point but it may make the function more complex, 1d or less inputs are well handled by polars and pandas functions, and the skrub use case is DataFrames
I found a couple of issues, anything related to datetimes, timezones, and the interactions between standard library datetimes, pandas and polars is always a bit tricky :D
skrub/_datetime_encoder.py
Outdated
X_split = {col: X_split[col_idx] for col_idx, col in enumerate(X.columns)} | ||
X = pd.DataFrame(X_split, index=index) | ||
# conversion is px is Polars, no-op if Pandas | ||
return px.DataFrame(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we build a polars dataframe directly instead of pandas first and then convert? AFAIK at the moment initializing a polars dataframe with a pandas one will make a copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also it is not due to the skrub code but the conversion fails when timezones are involved:
import polars as pl
from skrub import to_datetime
df = pl.DataFrame({"date": ["2023-10-12T17:36:50+01:00"]})
to_datetime(df) # pl.exceptions.ComputeError
but this seems to be more of a polars & pandas compatibility issue:
import pandas as pd
import polars as pl
df = pd.DataFrame({"date": ["2023-10-12T17:36:50+02:00"]})
df["date"] = pd.to_datetime(df["date"])
pl.DataFrame(df)
I'll open an issue on the polars repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we build a polars dataframe directly instead of pandas first and then convert? AFAIK at the moment initializing a polars dataframe with a pandas one will make a copy
Polars is still an optional dependency, so I guess we can't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll open an issue on the polars repo
Thank you for this! Note that I haven't written tests for Polars yet, maybe in a subsequent PR to avoid obfuscating this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Polars is still an optional dependency, so I guess we can't
maybe not in this PR but in the polars/pandas namespace you added we could have a function to build a DataFrame from a dict of columns and an index (in the polars version the index would be ignored, possibly after checking that it is None
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed! I'll create an issue for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll open an issue on the polars repo
I opened pola-rs/polars#11774
Thanks! I get your idea, but we might end up with different behavior for series (
Thanks for pointing out the issues! I haven't tested it extensively with Polars yet. |
The CI errors with the old Pandas version.
|
yes that makes sense. for example pandas can convert to datetime a dataframe with separate columns for year, month, etc. whereas skrub looks at each column separately import pandas as pd
from skrub import to_datetime
df = pd.DataFrame({"year": [2023, 2024], "month": [1, 2], "day": [12, 13]})
print("pandas:")
dt = pd.to_datetime(df)
print(dt)
print("""
------------
skrub:""")
sdt = to_datetime(df)
print(sdt)
|
also I like your choice to handle series basically as dataframes with one column and I wonder if we should have this harmonization for arrays and scalars too? import pandas as pd
from skrub import to_datetime
s = pd.Series(["a", "b"])
print(to_datetime(s))
print("""
-----------------
""")
print(to_datetime(s.values))
|
yes doctests compares the output as strings, so from this point of view "2022" and "2.022e+03" are not the same. I'll look into setting numpy and pandas formatting options so we get more sensible outputs |
One caveat of this first implementation of import numpy as np
import pandas as pd
from skrub import to_datetime
df = pd.DataFrame(
dict(
a=["2020-01-01", "2021-01-01"],
b=[2020, 2021],
)
)
# OK
to_datetime(df)
# a b
# 0 2020-01-01 2020
# 1 2021-01-01 2021
# Not very helpful
to_datetime(df.values)
# array([[1609459200000000000, 2020],
# [1609545600000000000, 2021]], dtype=object)
# Better, but needs all columns to be datetime
to_datetime(df.values[:, [0]])
# array([['2020-01-01T00:00:00.000000000'],
# ['2021-01-01T00:00:00.000000000']], dtype='datetime64[ns]')
# OK, by converting to pydatetime
X_split = list(df.values.T)
X_split[0] = pd.to_datetime(X_split[0]).to_pydatetime()
out = np.vstack(X_split).T
out
# array([[datetime.datetime(2020, 1, 1, 0, 0), 2020],
# [datetime.datetime(2021, 1, 1, 0, 0), 2021]], dtype=object)
# Notice how we still lose the integer representation of the second column.
pd.DataFrame(out).dtypes
# 0 datetime64[ns]
# 1 object That being said, we can't cast Anyway, we convert dataframes to a set of numpy 1d array, which might be problematic for Polars datetime representation for example. Therefore, we should consider improving the |
for this one I see a different result:
which corresponds to the timestamps of the 2 dates. Why do you think that is not helpful? |
in the case where we have to output a numerical numpy array, the timestamp is probably the best we can do. but as you said in most cases we probably want dataframes as input and outputs, with proper datetime columns in the outputs |
""" | ||
Parameters | ||
---------- | ||
X_col : ndarray of shape ``(n_samples,)`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If missing values are not allowed in X_col I think we should mention it in the docstring
with warnings.catch_warnings(): | ||
warnings.simplefilter("ignore", category=UserWarning) | ||
# pd.unique handles None | ||
month_first_formats = pd.unique(vfunc(X_col, dayfirst=False)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it could be a problem but note here we might end up with the string 'None'
in the result, eg for _guess_datetime_format(np.asarray(["01/23/2021", ""]))
we get array(['%m/%d/%Y', 'None'], dtype='<U8')
in month_first_formats
Co-authored-by: Jérôme Dockès <[email protected]>
Yes, the results are just for illustration don't mind them.
More specifically, not helpful in the context of working with dataframes because we lose the datetime representation (using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@LeoGrin it has changed a bit since your approval, do you want to have another look? If not I will merge it |
I'll have a look tonight, thanks! |
|
||
Parameters | ||
---------- | ||
X_col : ndarray of shape ``(n_samples,)`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upping Jerome's comment "If missing values are not allowed in X_col I think we should mention it in the docstring"
Just a few comments but otherwise LGTM, thanks! |
Let's merge then! |
References
Addresses issues raised in #743 and detailed in #768.
What changes?
DatetimeEncoder
exposesformat_per_column_
, which maps datetime parsable columns (seen duringfit
) to their first non-null entry. We aim to use this column selection in theTableVectorizer
to unify the datetime column selection logic.extract_until = None
doesn't create an extra column"total_second"
. It simply doesn't extract features among{"year", ..., "nanoseconds"}
.add_total_second
creates the "total_second" feature. The default is True.errors
either raises errors (errors="raise"
) or outputsNaT
(errors="coerce"
) when non-datetime-parsable values are given duringtransform
. The default is"coerce"
.This PR required additional tests.
cc @LeoGrin @jeromedockes @GaelVaroquaux