Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(data frame): Support polars #1474

Merged
merged 38 commits into from
Jul 9, 2024

Conversation

machow
Copy link
Collaborator

@machow machow commented Jun 21, 2024

This PR addresses #1439, by generalizing pandas.DataFrame specific logic to include Polars. It adds a module for DataFrame specific logic (_tbl_data.py) and simple tests for each piece.

From pairing with @schloerke, it seems like these next steps might be useful:

  • Move DataFrame specific logic into _tbl_data.py
  • Test and verify it works as expected on both Polars and Pandas
  • Evaluate subbing in narwhals for _tbl_data.py logic (cc @MarcoGorelli)
    • This should be easy with all the logic in one place / tests around behavior. Seems doable!

Notes:

  • It appears that shiny currently evaluates datetime columns to "unknown"
  • Polars coerces subclasses of str back up to str, so certain htmltools tags can't be identified in a Polars Series.
    • (we discussed subclassing collections.UserString instead, but also probably worth bringing up w/ Polars)
  • Barret and I have pairing time set up Monday and Tues, so should be able to knock it out then!

@schloerke
Copy link
Collaborator

schloerke commented Jun 21, 2024

Notes:

  • It appears that shiny currently evaluates datetime columns to "unknown"

We currently do not have special support in the browser... so we let the json to-string method handle it.

But we should definitely add support at some point!

@schloerke schloerke added the data frame Related to @render.data_frame label Jun 21, 2024
@schloerke schloerke added this to the v0.11.0 milestone Jun 21, 2024
@machow
Copy link
Collaborator Author

machow commented Jun 22, 2024

It seems like the JSON string method converts it to an int

image

opened issue here: #1477

@MarcoGorelli
Copy link

Evaluate subbing in narwhals for _tbl_data.py logic (cc @MarcoGorelli)

Thanks for the ping! Happy to help here, please do let me know if there's anything missing which you'd need or which would make your work easier

As a potential consumer, I'd be particularly interested in your thoughts on the StableAPI proposal here: narwhals-dev/narwhals#259 (comment). The objective is to give consumers a way such that, so long as they use StableAPI with a given version, then we will never break any public function they use. The idea's inspired by Rust's Editions

@machow
Copy link
Collaborator Author

machow commented Jun 24, 2024

From pairing w/ Barret, we're able to serialize pandas datetime series to ISO 8601 strings. There's a note in the code about possible results from pandas infer_dtype() function. A key to this function is that afaik it's responsible for inferring types from lists, so some of the possible outputs are induced by list[some_type].

# investigating the different outputs from infer_dtype
import pandas as pd
from datetime import date, datetime

val_dt = pd.to_datetime(["2020-01-01"])
val_per = pd.Period('2012-1-1', freq='D')

col_per = pd.Series([val_per])    # period
col_date = [date.today()]   # date
col_datetime = [datetime.now()]    # datetime
col_datetime64 = pd.Series(pd.to_datetime(["2020-01-01"]))    # datetime64


pd.api.types.infer_dtype(col_date)

@machow
Copy link
Collaborator Author

machow commented Jun 27, 2024

From pairing w/ @schloerke, we discovered that pandas.DataFrame.to_json has an interesting serialization strategy for custom objects.

import pandas as pd
from dataclasses import dataclass

class C:
    x: int

    def __init__(self, x: int):
        self.x = x

    def __str__(self):
        return f"I am C({self.x})"


@dataclass
class D:
    y: int

df = pd.DataFrame({"x": [C(1), D(2)]})
df.to_json()
#>  {"x":{"0":{"x":1},"1":{"y":2}}}

Notice that it somehow serialized C(1) to {"x": 1}. This is because it seems to use C(1).__dict__. However, you can pass a default handler to override this behavior

df.to_json(default_handler=str)
#> {"x":{"0":"I am C(1)","1":"D(y=2)"}}

Notice that the outputs are now the result of called .__str__(). Critically, .to_json() passes things like built-in dictionaries and lists through, so default_handler does not affect them.

pd.DataFrame({"x": [{"A": 1}, [8, 9]]}).to_json()
#> {"x":{"0":{"A":1},"1":[8,9]}

@machow
Copy link
Collaborator Author

machow commented Jun 27, 2024

Alright, handing off to @schloerke. There are two outstanding pieces:

  • replacing pandas logic in _data_frame.py (and gridtable file). The functions in _tbl_data are stubbed out with basic tests, and should be rough-but-ready to replace them. We need renderer tests (playwright?) to validate they work as expected when swapped in.
  • swapping in narwhals: I think a nice approach to this is the strangler fig strategy. Essentially, we could register narwhals implementations in _tbl_data.py. Once all paths work with narhwals, we could always wrap dataframes to be narwhals objects (then, can eventually remove other dataframe logic from _tbl_data.py / delete it.).

More on swapping in narwhals

Currently, _tbl_data.py means to hold all DataFrame specific logic. This makes it easy to communicate and test what DataFrame actions we need. This way, we can start on refactoring in narwhals, while manually adding extra bits of DataFrame logic as needed.

For example, @schloerke mentioned needing a get_column_names() function.

@singledispatch
def get_column_names(data: DataFrameLike) -> list[str]:
    raise TypeError()

@get_column_names.register
def _(data: pd.DataFrame) -> list[str]:
    # note that technically column names don't have to be strings in Pandas
    # so you might add validation, etc.. here
    return list(data.columns)

@get_column_names.register
def _(data: pl.DataFrame) -> list[str]:
    return data.columns

@get_column_names.register
def _(data: nw.DataFrame) -> list[str]:
    return data.columns

Notice that once narwhals is fully wired up everywhere, we can just always wrap inputs to the functions with narwhals.DataFrame. This is a classic refactor approach (strangler fig pattern), because we just wrap the old system, and keep things rolling, until we can fully flip the new one on (e.g. everything narwhals).

The number 1 reason IMO for not going directly to narwhals is that the requirements on the shiny side need to be fleshed out. I think it'll help to flesh out support for pandas DataFrames, and add test cases against Polars and Pandas, to indicate when a refactor has succeeded (or is blocked).

* main:
  fix(tests): dynamically determine the path to the shiny app (posit-dev#1485)
  tests(deploys): use a stable version of html tools instead of main branch (posit-dev#1483)
  feat(data frame): Support basic cell styling (posit-dev#1475)
  fix: support static files on pyodide / py.cafe under a prefix (posit-dev#1486)
  feat: Dynamic theming (posit-dev#1358)
  Add return type for `_task()` (posit-dev#1484)
  tests(controls): Change API from controls to controller (posit-dev#1481)
  fix(docs): Update path to reflect correct one (posit-dev#1478)
  docs(testing): Add quarto page for testing (posit-dev#1461)
  fix(test): Remove unused testrail reporting from nightly builds (posit-dev#1476)
* main:
  test(controllers): Refactor column sort and filter methods for Dataframe class (posit-dev#1496)
  Follow up to posit-dev#1453: allow user roles when normalizing a dictionary (posit-dev#1495)
  fix(layout_columns): Fix coercion of scalar row height to list for python <= 3.9 (posit-dev#1494)
  Add `shiny.ui.Chat` (posit-dev#1453)
  docs(Theme): Fix example and clarify usage (posit-dev#1491)
  chore(pyright): Pin pyright version to `1.1.369` to avoid CI failures (posit-dev#1493)
  tests(dataframe): Add additional tests for dataframe (posit-dev#1487)
  bug(data frame): Export `render.StyleInfo` (posit-dev#1488)
@machow
Copy link
Collaborator Author

machow commented Jul 8, 2024

From chatting with @schloerke, I glanced over the code just now and it LGTM. I think the _pandas.py module could be moved into _tbl_data.py, so there's a very clear rule: all pandas and polars logic live in one place. But IMO having an extra file is reasonable.

I didn't know enough about a lot of the shiny bits to have an opinion on things outside _tbl_data.py 😅

@schloerke schloerke changed the title feat: support polars in data grid feat(data frame): Support polars Jul 8, 2024
@schloerke schloerke marked this pull request as ready for review July 8, 2024 21:05
@schloerke schloerke added this pull request to the merge queue Jul 9, 2024
Merged via the queue into posit-dev:main with commit 08cd1f0 Jul 9, 2024
31 checks passed
schloerke added a commit that referenced this pull request Jul 9, 2024
* main:
  feat(data frame): Support `polars` (#1474)
  api(playwright): Code review of complete playwright API (#1501)
  fix: Move `www/shared/py-shiny` to `www/py-shiny` (#1499)
  test(controllers): Refactor column sort and filter methods for Dataframe class (#1496)
  Follow up to #1453: allow user roles when normalizing a dictionary (#1495)
  fix(layout_columns): Fix coercion of scalar row height to list for python <= 3.9 (#1494)
  Add `shiny.ui.Chat` (#1453)
  docs(Theme): Fix example and clarify usage (#1491)
  chore(pyright): Pin pyright version to `1.1.369` to avoid CI failures (#1493)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data frame Related to @render.data_frame
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants