-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for ingesting CSV files #2
Conversation
bcbaa7f
to
a44922a
Compare
a44922a
to
87ac971
Compare
This fixes read from the database through Polars and sqlalchemy after recent fixes to the Polars Python package.
Can you describe the bugs you've fixed so we can pay attention as we review the code? |
At this point the code has widely diverged from the original (which was never reviewed), so it might be best to look at all code as if it was new. |
76f02f5
to
32c7173
Compare
32c7173
to
eb51d55
Compare
) -> DuckDBPyRelation: | ||
"""Add a datetime column to the relation.""" | ||
# TODO | ||
raise NotImplementedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we'll need two handling, one without tz-offset and one with.
src/chronify/time_configs.py
Outdated
"If None, timestamps are timezone-naive.", | ||
), | ||
] = None | ||
start: datetime # TODO: what if the time zone is specified here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is supported in other packages like pandas, IMO, it can be ambiguous here because:
If only offset is specified, we can only assume the same offset throughout, making it a standard time assumption.
That said, we can support that as a future feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior is now different. There is no longer a time_zone field. The user has to specify the time zone, if any, in this field. Please raise any concerns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added support for what we discussed last Thursday. Essentially, we will store whatever the user passes: time zones or not. datetimes will be converted to strings when using SQLite, and so we have to implement custom code to read it back. I didn't get anything to work with the sqlalchemy DateTime type with timezone=True.
Now, all of the applicable tests run on DuckDB and SQLite. Spark will come later.
@@ -8,7 +8,3 @@ repos: | |||
args: [ --fix ] | |||
# Run the formatter. | |||
- id: ruff-format | |||
- repo: https://github.com/pre-commit/mirrors-mypy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pesap My take is to remove it. Running it here doesn't use pyproject.toml. We would have to make other changes. The action for mypy does follow pyproject.toml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you pass language system it uses the mypy that you have installed from the environment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that still doesn't respect settings in pyproject.toml. With the changes here, if the developer runs mypy
in the terminal, the right things will happen. The CI job does that as a check. The downside is that pre-commit doesn't run mypy. We would have to add duplicate settings in the pre-commit config file.
src/chronify/time_configs.py
Outdated
"If None, timestamps are timezone-naive.", | ||
), | ||
] = None | ||
start: datetime # TODO: what if the time zone is specified here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior is now different. There is no longer a time_zone field. The user has to specify the time zone, if any, in this field. Please raise any concerns.
"""Read a database query into a Pandas DataFrame.""" | ||
df = pl.read_database(query, connection=conn).to_pandas() | ||
config = schema.time_config | ||
if config.needs_utc_conversion(conn.engine.name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note SQLite special cases here and in the next function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add pandas just to do the datetime conversion? Maybe we can just make simple function that uses bare python
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we agreed that pandas is part of the public API. It is what the user gets when they call read_table
. Numpy array is not a good option because the table will have mixed types. Arrow table is an option, but that's not what users want. I'm skeptical that people want a duckdb relation. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas is fine with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Nothing major to address only some comments.
@@ -8,7 +8,3 @@ repos: | |||
args: [ --fix ] | |||
# Run the formatter. | |||
- id: ruff-format | |||
- repo: https://github.com/pre-commit/mirrors-mypy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you pass language system it uses the mypy that you have installed from the environment
def read_csv(path: Path | str, schema: CsvTableSchema, **kwargs: Any) -> DuckDBPyRelation: | ||
"""Read a CSV file into a DuckDB relation.""" | ||
if schema.column_dtypes: | ||
dtypes = {x.name: get_duckdb_type_from_sqlalchemy(x.dtype) for x in schema.column_dtypes} | ||
rel = duckdb.read_csv(str(path), dtype=dtypes, **kwargs) | ||
else: | ||
rel = duckdb.read_csv(str(path), **kwargs) | ||
|
||
expr = ",".join(rel.columns) | ||
return duckdb.sql(f"SELECT {expr} FROM rel") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we catch any errors or just let duckdb to trow and error for a badly format csv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DuckDB should have it covered.
"""Read a database query into a Pandas DataFrame.""" | ||
df = pl.read_database(query, connection=conn).to_pandas() | ||
config = schema.time_config | ||
if config.needs_utc_conversion(conn.engine.name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add pandas just to do the datetime conversion? Maybe we can just make simple function that uses bare python
@@ -169,15 +174,15 @@ def get_standard_time(tz: TimeZone) -> TimeZone: | |||
|
|||
|
|||
def get_prevailing_time(tz: TimeZone) -> TimeZone: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add definition here of what is prevailing time?
conn.commit() | ||
self.update_table_schema() | ||
|
||
def read_query(self, query: Selectable | str, schema: TableSchema) -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is currently an unfortunate limitation here. You have to pass the schema because SQLite is forcing us to perform custom logic on the time columns. We don't know which columns are time columns yet. We need a second table to store that metadata. I propose that we defer that to a subsequent PR.
"""Read a database query into a Pandas DataFrame.""" | ||
df = pl.read_database(query, connection=conn).to_pandas() | ||
config = schema.time_config | ||
if config.needs_utc_conversion(conn.engine.name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas is fine with me.
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. Thanks for integrating Codecov - We've got you covered ☂️ |
This PR adds support for the following:
It fixes an assortment of bugs and oversights from the initial version.
It also adds test coverage for most of the code. We are now at 94%.
The Polars bug with sqlalchemy has been fixed. You will now get v1.11.0. Also, sqlalchemy is now fixed to v2.0.35 because duckdb_engine is not yet compatible with 2.0.36.