Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deps: update to ibis-framework 9.x and newer sqlglot #827

Merged
merged 86 commits into from
Sep 13, 2024

Conversation

tswast
Copy link
Collaborator

@tswast tswast commented Jul 8, 2024

This change is updating to Ibis-framework 9.x and a newer version of SQLGLot. The Ibis upgrading also removes previous version restrictions on certain packages. Specifically, it expands the allowable version of pyarrow (from 15.0.2 to 17.0.0) and numpy (from 1.26.4 to 2.1.1).

  • Fixes internal issue 350749011
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes internal issue 350749011 🦕

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Jul 8, 2024
@tswast
Copy link
Collaborator Author

tswast commented Jul 9, 2024

Getting

pytest tests/system/small/test_dataframe.py::test_dataframe_agg_int_single_string
...
../../envs/bigframes/lib/python3.11/site-packages/sqlglot/dialects/bigquery.py:53: in <listcomp>
    exp.PropertyEQ(this=exp.to_identifier(name), expression=fld)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

name = Column(
  this=Identifier(this=col_18, quoted=True)), quoted = None, copy = True

    def to_identifier(name, quoted=None, copy=True):
        """Builds an identifier.
    
        Args:
            name: The name to turn into an identifier.
            quoted: Whether to force quote the identifier.
            copy: Whether to copy name if it's an Identifier.
    
        Returns:
            The identifier ast node.
        """
    
        if name is None:
            return None
    
        if isinstance(name, Identifier):
            identifier = maybe_copy(name, copy)
        elif isinstance(name, str):
            identifier = Identifier(
                this=name,
                quoted=not SAFE_IDENTIFIER_RE.match(name) if quoted is None else quoted,
            )
        else:
>           raise ValueError(f"Name needs to be a string or an Identifier, got: {name.__class__}")
E           ValueError: Name needs to be a string or an Identifier, got: <class 'sqlglot.expressions.Column'>

../../envs/bigframes/lib/python3.11/site-packages/sqlglot/expressions.py:6798: ValueError

for DataFrame.agg and related tests like describe. Might have something to do with how we're unnesting memtables. Will have to investigate further.

@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Jul 9, 2024
@tswast
Copy link
Collaborator Author

tswast commented Jul 15, 2024

Re: FAILED tests/system/small/test_pandas.py::test_get_dummies_dataframe[kwargs2] - AssertionError: DataFrame.iloc[:, 11] (column name="time_col.11:14:34.701606") are different

Looks like the time scalar is losing microsecond precision.

  COALESCE(`t0`.`time_col` = time(11, 14, 34), FALSE) AS `col_15`,
  COALESCE(`t0`.`time_col` = time(11, 41, 43), FALSE) AS `col_17`,
  COALESCE(`t0`.`time_col` = time(12, 0, 0), FALSE) AS `col_19`,
  COALESCE(`t0`.`time_col` = time(15, 16, 17), FALSE) AS `col_21`,
  COALESCE(`t0`.`time_col` = time(23, 59, 59), FALSE) AS `col_23`,

Edit: looks like an ibis or maybe sqlglot bug. Filed ibis-project/ibis#9609

@tswast tswast added the do not merge Indicates a pull request not ready for merge, due to either quality or timing. label Jul 16, 2024
@tswast
Copy link
Collaborator Author

tswast commented Jul 16, 2024

Marking as do not merge because to fix 3.9, we'll need to vendor ibis 9.x into third_party.

@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: xl Pull request size is extra large. labels Sep 10, 2024
@@ -31,6 +31,17 @@
scalar_compiler = scalar_compilers.scalar_op_compiler


# TODO(swast): We can remove this if ibis adds general approx_quantile
# See: https://github.com/ibis-project/ibis/issues/9541
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI (no change necessary for this PR): An approximate quantile node type has been merged into ibis. ibis-project/ibis#9881

Once we start vendoring the Ibis expressions in addition to the compiler, we can take advantage of that (if we want).

tests/system/small/test_dataframe.py Outdated Show resolved Hide resolved
column_min = X[column].min()
column_max = X[column].max()

# Use Python value rather than Numpy value to serialization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit fragile - should the sql generator be more strict in its formatting instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 This is a bit scary and may indicate a bigger problem. Would adding numpy literal support to BigQueryCompiler.visit_NonNullLiteral allow us to avoid this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved changes to the ML SQL generation resided in ml/sql.py. Other data core APIs tests didn't have same complains so that BigQueryCompiler.visit_NonNullLiteral remain unaffected.

Comment on lines +182 to +186
return (
value.fill_null(ibis_types.literal("$NULL_SENTINEL$"))
if hasattr(value, "fill_null")
else value.fillna(ibis_types.literal("$NULL_SENTINEL$"))
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is going on here? This seems fragile - what types no longer have fillna?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fillna was renamed to fill_null in ibis-project/ibis#9300 (ibis 9.1.0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fillna was renamed to fill_null in Ibis 9.1.0. Using fillna with this version or later will trigger a deprecation warning. However, we cann't use fill_null because it's not available in Ibis 9.0.0. Since Ibis 9.1.0 and later dropped support for Python 3.9, we must use Ibis 9.0.0 to maintain compatibility with Python 3.9

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I do hate to have hasattr checks in our code. Should we just use fillna and suppress warnings?

left = x.cast(ibis_dtypes.str).fillna(ibis_types.literal("$NULL_SENTINEL$"))
right = y.cast(ibis_dtypes.str).fillna(ibis_types.literal("$NULL_SENTINEL$"))
literal = ibis_types.literal("$NULL_SENTINEL$")
if hasattr(x, "fill_null"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is causing the fillna/fill_nbull divide?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ibis 9.0.0 -> Ibis 9.1.0. We need Ibis 9.0.0 for Python 3.9 support, at least until we vendor the ibis expr APIs too, which we'll want to do sometime between now and bigframes 2.0.

Comment on lines +73 to +77
# In NumPy versions 2 and later, `np.floor` and `np.ceil` now produce integer
# outputs for the "int64_col" column.
if opname in ["floor", "ceil"] and isinstance(
pd_result["int64_col"].dtypes, pd.Int64Dtype
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a big behavior change?! I thought numpy 2 was backwards compatible?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were behavior changes. See: https://numpy.org/doc/stable/numpy_2_0_migration_guide.html#numpy-2-migration-guide

I suspect it may relate to this change: https://numpy.org/neps/nep-0050-scalar-promotion.html#nep50

Type promotion is no longer value dependent, only dtype dependent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we match the new behavior then? dtype-dependent is something we can actually emulate perfectly, unlike value-dependency

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we match the new behavior then? dtype-dependent is something we can actually emulate perfectly, unlike value-dependency

Yes. Maybe bigframes 2.0? Could you file an issue @TrevorBergeron ?

column_min = X[column].min()
column_max = X[column].max()

# Use Python value rather than Numpy value to serialization.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 This is a bit scary and may indicate a bigger problem. Would adding numpy literal support to BigQueryCompiler.visit_NonNullLiteral allow us to avoid this?

Comment on lines +73 to +77
# In NumPy versions 2 and later, `np.floor` and `np.ceil` now produce integer
# outputs for the "int64_col" column.
if opname in ["floor", "ceil"] and isinstance(
pd_result["int64_col"].dtypes, pd.Int64Dtype
):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were behavior changes. See: https://numpy.org/doc/stable/numpy_2_0_migration_guide.html#numpy-2-migration-guide

I suspect it may relate to this change: https://numpy.org/neps/nep-0050-scalar-promotion.html#nep50

Type promotion is no longer value dependent, only dtype dependent.

@cpcloud
Copy link

cpcloud commented Sep 13, 2024

Reading through some of the review here, I would really love for y'all to chime in more proactively on the Ibis issue tracker.

I understand that contributing to Ibis isn't top priority, but it might help ease some of the pain here if we had some advance notice about any difficulty a particular change might create. I also realize that may not be knowable until you actually do the work to upgrade to a later version of Ibis.

Generally speaking, if there's a decision that would cause lots of headaches, or that we expect might, we'd like to discuss and get feedback on it.

Copy link
Collaborator Author

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@chelsea-lin
Copy link
Contributor

chelsea-lin commented Sep 13, 2024

Thanks for your input, @cpcloud. We appreciate you taking the time to engage with us on this thread.
We generally aim to report issues that are relevant to the wider Ibis community directly in the issue tracker, so everyone can benefit from the discussion and resolution. However, we also recognize that some issues might be specific to our use case, and in those cases, we'll handle them internally.
We value your insights and want to ensure we're collaborating effectively. Are there any particular issues you've observed in our usage of Ibis that you'd like to discuss further? Your feedback helps us to better understand your priorities and identify areas for improvement.

@chelsea-lin chelsea-lin merged commit 89ea44f into main Sep 13, 2024
20 of 23 checks passed
@chelsea-lin chelsea-lin deleted the b350749011-ibis-9 branch September 13, 2024 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants