Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAIN Improve DatetimeEncoder #784

Merged

Conversation

Vincent-Maladiere
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere commented Oct 5, 2023

References
Addresses issues raised in #743 and detailed in #768.

What changes?

  1. DatetimeEncoder exposes format_per_column_, which maps datetime parsable columns (seen during fit) to their first non-null entry. We aim to use this column selection in the TableVectorizer to unify the datetime column selection logic.
  2. We now keep constant features, as removing them dynamically changes the shape of the output. This previous behavior might break pipelines requiring a specific shape and is surprising for the user.
  3. On the other hand, we detect date columns (in opposition to datetime columns), e.g. formatted like "2023-01-01", and we only extract features up to the "day" level for those.
  4. extract_until = None doesn't create an extra column "total_second". It simply doesn't extract features among {"year", ..., "nanoseconds"}.
  5. The parameter add_total_second creates the "total_second" feature. The default is True.
  6. The parameter errors either raises errors (errors="raise") or outputs NaT (errors="coerce") when non-datetime-parsable values are given during transform. The default is "coerce".

This PR required additional tests.

cc @LeoGrin @jeromedockes @GaelVaroquaux

np_dtypes_candidates = [np.object_, np.str_, np.datetime64]
if any(np.issubdtype(X.dtype, np_dtype) for np_dtype in np_dtypes_candidates):
try:
_ = pd.to_datetime(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to_datetime is stricter than DateTimeIndex which was used before (@LeoGrin also mentioned that in #768 here), as a result the docstring example prints an empty vectorized dataframe on my machine.

being a bit more strict is probably a good idea especially at first; using to_datetime instead of DateTimeIndex avoids issues such as this one, but the example needs to be updated

skrub/_datetime_encoder.py Outdated Show resolved Hide resolved
skrub/_datetime_encoder.py Outdated Show resolved Hide resolved
skrub/_datetime_encoder.py Outdated Show resolved Hide resolved
skrub/_datetime_encoder.py Outdated Show resolved Hide resolved
skrub/_datetime_encoder.py Outdated Show resolved Hide resolved
skrub/_datetime_encoder.py Outdated Show resolved Hide resolved
@jeromedockes
Copy link
Member

in a later PR the TableVectorizer will start using this module's is_datetime_parsable to identify datetime columns? Note there is some logic in the TableVectorizer to handle month first vs day first that may need to be moved into the DateTimeEncoder module

@jeromedockes
Copy link
Member

I wonder if we could have a transformer that just parses string columns that contain dates and replaces them with columns with datetime64 dtype? This is one thing that is annoying to do in pandas if we don't know in advance which columns are dates

import pandas as pd

df = pd.DataFrame({"date": ["2023-10-06", "2023-10-07"]})
df["date"] = pd.to_datetime(df["date"]) # avoid having to do this

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Oct 6, 2023

in a later PR the TableVectorizer will start using this module's is_datetime_parsable to identify datetime columns?

My idea was to run enc = DateTimeEncoder().fit(X) in the TableVectorizer then look up enc.format_per_column_ to get the datetime parsable columns. WDYT?

is_datetime_parsable can also be used externally but is less convenient IMO.

Note there is some logic in the TableVectorizer to handle month first vs day first that may need to be moved into the DateTimeEncoder module

Yes, I need to add some of the the logic from _infer_date_format to handle dayfirst vs monthfirst better.

@Vincent-Maladiere
Copy link
Member Author

his is one thing that is annoying to do in pandas if we don't know in advance which columns are dates

I like this suggestion a lot, this is indeed very annoying. Does it need to be a transformer? I feel we could use a "switch-bait" function in the sense of fuzzy_join, like skrub.to_datetime(). WDYT? cc @GaelVaroquaux

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Oct 12, 2023

@jeromedockes I went for the switch-bait to_datetime with a rework on the backend side to decouple DatetimeEncoder from the column selection, the format guessing, and the parsing logic. LMKWYT :)

I need to update the tests.

Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it looks great! I wonder if we should make it for dataframes only (and possibly 2d arrays), and maybe give it a different names because not all columns are converted. I get your "switchbait" point but it may make the function more complex, 1d or less inputs are well handled by polars and pandas functions, and the skrub use case is DataFrames

I found a couple of issues, anything related to datetimes, timezones, and the interactions between standard library datetimes, pandas and polars is always a bit tricky :D

skrub/_datetime_encoder.py Show resolved Hide resolved
skrub/_datetime_encoder.py Outdated Show resolved Hide resolved
skrub/_datetime_encoder.py Outdated Show resolved Hide resolved
skrub/_datetime_encoder.py Show resolved Hide resolved
X_split = {col: X_split[col_idx] for col_idx, col in enumerate(X.columns)}
X = pd.DataFrame(X_split, index=index)
# conversion is px is Polars, no-op if Pandas
return px.DataFrame(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we build a polars dataframe directly instead of pandas first and then convert? AFAIK at the moment initializing a polars dataframe with a pandas one will make a copy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also it is not due to the skrub code but the conversion fails when timezones are involved:

import polars as pl
from skrub import to_datetime

df = pl.DataFrame({"date": ["2023-10-12T17:36:50+01:00"]})
to_datetime(df) # pl.exceptions.ComputeError

but this seems to be more of a polars & pandas compatibility issue:

import pandas as pd
import polars as pl

df = pd.DataFrame({"date": ["2023-10-12T17:36:50+02:00"]})
df["date"] = pd.to_datetime(df["date"])
pl.DataFrame(df)

I'll open an issue on the polars repo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we build a polars dataframe directly instead of pandas first and then convert? AFAIK at the moment initializing a polars dataframe with a pandas one will make a copy

Polars is still an optional dependency, so I guess we can't

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open an issue on the polars repo

Thank you for this! Note that I haven't written tests for Polars yet, maybe in a subsequent PR to avoid obfuscating this one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Polars is still an optional dependency, so I guess we can't

maybe not in this PR but in the polars/pandas namespace you added we could have a function to build a DataFrame from a dict of columns and an index (in the polars version the index would be ignored, possibly after checking that it is None)

Copy link
Member Author

@Vincent-Maladiere Vincent-Maladiere Oct 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed! I'll create an issue for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open an issue on the polars repo

I opened pola-rs/polars#11774

skrub/_datetime_encoder.py Outdated Show resolved Hide resolved
skrub/_datetime_encoder.py Outdated Show resolved Hide resolved
skrub/_datetime_encoder.py Outdated Show resolved Hide resolved
skrub/_datetime_encoder.py Outdated Show resolved Hide resolved
skrub/_datetime_encoder.py Show resolved Hide resolved
@Vincent-Maladiere
Copy link
Member Author

Thanks, it looks great! I wonder if we should make it for dataframes only (and possibly 2d arrays), and maybe give it a different names because not all columns are converted. I get your "switchbait" point but it may make the function more complex, 1d or less inputs are well handled by polars and pandas functions, and the skrub use case is DataFrames

Thanks! I get your idea, but we might end up with different behavior for series (pd.to_datetime) vs dataframes (skrub.to_datetime). For consistency, I'd advocate for handling both 1d and 2d outputs, so that the user can always rely on skrub.to_datetime, WDYT? I also simplified the logic for series, to be the same as dataframes.

I found a couple of issues, anything related to datetimes, timezones, and the interactions between standard library datetimes, pandas and polars is always a bit tricky :D

Thanks for pointing out the issues! I haven't tested it extensively with Polars yet.

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Oct 13, 2023

The CI errors with the old Pandas version.

Screen Shot 2023-10-13 at 15 56 00

@jeromedockes
Copy link
Member

Thanks! I get your idea, but we might end up with different behavior for series (pd.to_datetime) vs dataframes (skrub.to_datetime). For consistency, I'd advocate for handling both 1d and 2d outputs, so that the user can always rely on skrub.to_datetime, WDYT? I also simplified the logic for series, to be the same as dataframes.

yes that makes sense. skrub.to_datetime will be different than pd.to_datetime and handle only a subset of usecases, so we should be careful to document that and about how we handle parameters that don't apply (such as "unit")

for example pandas can convert to datetime a dataframe with separate columns for year, month, etc. whereas skrub looks at each column separately

import pandas as pd
from skrub import to_datetime

df = pd.DataFrame({"year": [2023, 2024], "month": [1, 2], "day": [12, 13]})
print("pandas:")
dt = pd.to_datetime(df)
print(dt)
print("""
------------
skrub:""")
sdt = to_datetime(df)
print(sdt)
pandas:
0   2023-01-12
1   2024-02-13
dtype: datetime64[ns]

------------
skrub:
   year  month  day
0  2023      1   12
1  2024      2   13

@jeromedockes
Copy link
Member

also I like your choice to handle series basically as dataframes with one column and I wonder if we should have this harmonization for arrays and scalars too?

import pandas as pd
from skrub import to_datetime

s = pd.Series(["a", "b"])
print(to_datetime(s))
print("""
-----------------
""")
print(to_datetime(s.values))
0    a
1    b
dtype: object

-----------------

/home/jerome/workspace/backedup_repositories/skrub/skrub/_datetime_encoder.py:85: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
  return pd.to_datetime(X, **kwargs)
DatetimeIndex(['NaT', 'NaT'], dtype='datetime64[ns]', freq=None)

@jeromedockes
Copy link
Member

We also have doctest errors everywhere. It seems very picky.

yes doctests compares the output as strings, so from this point of view "2022" and "2.022e+03" are not the same. I'll look into setting numpy and pandas formatting options so we get more sensible outputs

@Vincent-Maladiere
Copy link
Member Author

Vincent-Maladiere commented Nov 3, 2023

One caveat of this first implementation of skrub.to_datetime is that when we pass a numpy array of object dtypes, with mixed-typed entries, we might end up with some numeric representations of the date that are not helpful.

import numpy as np
import pandas as pd
from skrub import to_datetime

df = pd.DataFrame(
    dict(
        a=["2020-01-01", "2021-01-01"],
        b=[2020, 2021],
    )
)

# OK
to_datetime(df)
# a	b
# 0	2020-01-01	2020
# 1	2021-01-01	2021

# Not very helpful
to_datetime(df.values)
# array([[1609459200000000000, 2020],
#       [1609545600000000000, 2021]], dtype=object)

# Better, but needs all columns to be datetime
to_datetime(df.values[:, [0]])
# array([['2020-01-01T00:00:00.000000000'],
#       ['2021-01-01T00:00:00.000000000']], dtype='datetime64[ns]')

# OK, by converting to pydatetime
X_split = list(df.values.T)
X_split[0] = pd.to_datetime(X_split[0]).to_pydatetime()
out = np.vstack(X_split).T
out
# array([[datetime.datetime(2020, 1, 1, 0, 0), 2020],
#      [datetime.datetime(2021, 1, 1, 0, 0), 2021]], dtype=object)

# Notice how we still lose the integer representation of the second column.
pd.DataFrame(out).dtypes
# 0    datetime64[ns]
# 1            object

That being said, we can't cast datetime.datetime or pd.Timestamp entries to integers or float, so maybe the numeric representation of numpy i.e. the output to_datetime(df.values) is not that bad, WDYT @jeromedockes?

Anyway, we convert dataframes to a set of numpy 1d array, which might be problematic for Polars datetime representation for example.

Therefore, we should consider improving the to_datetime logic in a subsequent PR to specialize the ndarray, Polars, and Pandas dataframes / series cases, instead of reducing everything to a set of 1d numpy array.

@jeromedockes
Copy link
Member

# Not very helpful
to_datetime(df.values)
# array([[1609459200000000000, 2020],
#       [1609545600000000000, 2021]], dtype=object)

for this one I see a different result:

[[1577836800000000000 2020]
 [1609459200000000000 2021]]

which corresponds to the timestamps of the 2 dates. Why do you think that is not helpful?

@jeromedockes
Copy link
Member

in the case where we have to output a numerical numpy array, the timestamp is probably the best we can do. but as you said in most cases we probably want dataframes as input and outputs, with proper datetime columns in the outputs

skrub/_datetime_encoder.py Outdated Show resolved Hide resolved
skrub/_datetime_encoder.py Show resolved Hide resolved
"""
Parameters
----------
X_col : ndarray of shape ``(n_samples,)``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If missing values are not allowed in X_col I think we should mention it in the docstring

skrub/_datetime_encoder.py Outdated Show resolved Hide resolved
with warnings.catch_warnings():
warnings.simplefilter("ignore", category=UserWarning)
# pd.unique handles None
month_first_formats = pd.unique(vfunc(X_col, dayfirst=False))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it could be a problem but note here we might end up with the string 'None' in the result, eg for _guess_datetime_format(np.asarray(["01/23/2021", ""])) we get array(['%m/%d/%Y', 'None'], dtype='<U8') in month_first_formats

skrub/_datetime_encoder.py Outdated Show resolved Hide resolved
skrub/_datetime_encoder.py Outdated Show resolved Hide resolved
skrub/_datetime_encoder.py Outdated Show resolved Hide resolved
Co-authored-by: Jérôme Dockès <[email protected]>
@Vincent-Maladiere
Copy link
Member Author

for this one I see a different result:

Yes, the results are just for illustration don't mind them.

which corresponds to the timestamps of the 2 dates. Why do you think that is not helpful?

More specifically, not helpful in the context of working with dataframes because we lose the datetime representation (using pd.to_datetime with timestamps is always tricky because of the unit kwarg, and this is not something we can do with skrub.to_datetime), but helpful for ML downstream applications.

Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jeromedockes
Copy link
Member

@LeoGrin it has changed a bit since your approval, do you want to have another look? If not I will merge it

@LeoGrin
Copy link
Contributor

LeoGrin commented Nov 8, 2023

@LeoGrin it has changed a bit since your approval, do you want to have another look? If not I will merge it

I'll have a look tonight, thanks!


Parameters
----------
X_col : ndarray of shape ``(n_samples,)``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upping Jerome's comment "If missing values are not allowed in X_col I think we should mention it in the docstring"

@LeoGrin
Copy link
Contributor

LeoGrin commented Nov 8, 2023

Just a few comments but otherwise LGTM, thanks!

@Vincent-Maladiere
Copy link
Member Author

Let's merge then!

@Vincent-Maladiere Vincent-Maladiere merged commit 2bda119 into skrub-data:main Nov 9, 2023
21 checks passed
@Vincent-Maladiere Vincent-Maladiere deleted the refacto_datetime_encoder branch November 9, 2023 10:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants