Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Table.Sample doesn't sample on DuckDB after ibis version > 9.0.0 #10116

Closed
1 task done
balintpto opened this issue Sep 13, 2024 · 16 comments
Closed
1 task done

bug: Table.Sample doesn't sample on DuckDB after ibis version > 9.0.0 #10116

balintpto opened this issue Sep 13, 2024 · 16 comments
Labels
bug Incorrect behavior inside of ibis

Comments

@balintpto
Copy link

balintpto commented Sep 13, 2024

What happened?

Sampling doesn't seem to have an effect with ibis versions higher then 9.0.0 (I have to use 9.0.0 because of this)

9.5.0:
image

9.0.0:
image

Code:

import ibis as ib

t = ib.memtable({"x": [1, 2, 3, 4], "y": ["a", "b", "c", "d"]})

t.sample(0.5, seed=1234).execute()

What version of ibis are you using?

9.5.0

What backend(s) are you using, if any?

DuckDB

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@balintpto balintpto added the bug Incorrect behavior inside of ibis label Sep 13, 2024
@gforsyth
Copy link
Member

Hi @balintpto -- thanks for the report! I can't reproduce this -- can you try also updating duckdb itself?

[ins] In [1]: import ibis as ib
         ...: 
         ...: t = ib.memtable({"x": [1, 2, 3, 4], "y": ["a", "b", "c", "d"]})
         ...: 
         ...: t.sample(0.5, seed=1234).execute()
Out[1]: 
   x  y
0  2  b
1  3  c

[ins] In [2]: import duckdb

[ins] In [3]: duckdb.__version__
Out[3]: '1.1.0'

@balintpto
Copy link
Author

Hey, I just reinstalled duckdb and getting the same problem. image pip freeze: asttokens==2.4.1 atpublic==4.1.0 attrs==23.2.0 bidict==0.23.1 bleach==6.1.0 bokeh==3.4.1 certifi==2024.2.2 cffi==1.16.0 charset-normalizer==3.3.2 colorama==0.4.6 colorcet==3.1.0 comm==0.2.2 contourpy==1.2.1 cryptography==42.0.7 cycler==0.12.1 debugpy==1.8.1 decorator==5.1.1 duckdb==1.1.0 executing==2.0.1 fonttools==4.51.0 h11==0.14.0 holoviews==1.18.3 hvplot==0.10.0 ibis-framework==9.5.0 idna==3.7 ipykernel==6.29.4 ipython==8.24.0 jedi==0.19.1 Jinja2==3.1.4 joblib==1.4.2 jupyter_client==8.6.2 jupyter_core==5.7.2 kiwisolver==1.4.5 linkify-it-py==2.0.3 llvmlite==0.43.0 Markdown==3.6 markdown-it-py==3.0.0 MarkupSafe==2.1.5 matplotlib==3.9.0 matplotlib-inline==0.1.7 mdit-py-plugins==0.4.1 mdurl==0.1.2 nest-asyncio==1.6.0 numba==0.60.0 numpy==1.26.4 oracledb==2.2.0 outcome==1.3.0.post0 packaging==24.0 pandas==2.2.2 panel==1.4.2 param==2.1.0 parso==0.8.4 parsy==2.1 pillow==10.3.0 platformdirs==4.2.2 polars==0.20.31 prompt_toolkit==3.0.45 psutil==5.9.8 pure-eval==0.2.2 pyarrow==16.1.0 pyarrow-hotfix==0.6 pycparser==2.22 Pygments==2.18.0 pyparsing==3.1.2 PySocks==1.7.1 python-dateutil==2.9.0.post0 pytz==2024.1 pyviz_comms==3.0.2 pywin32==306 PyYAML==6.0.1 pyzmq==26.0.3 regex==2024.5.15 requests==2.32.1 rich==13.7.1 scikit-learn==1.5.0 scipy==1.13.0 seaborn==0.13.2 selenium==4.21.0 six==1.16.0 sniffio==1.3.1 sortedcontainers==2.4.0 sqlglot==23.12.2 stack-data==0.6.3 threadpoolctl==3.5.0 toolz==0.12.1 tornado==6.4 tqdm==4.66.4 traitlets==5.14.3 trio==0.25.1 trio-websocket==0.11.1 typing_extensions==4.11.0 tzdata==2024.1 uc-micro-py==1.0.3 urllib3==2.2.1 wcwidth==0.2.13 webencodings==0.5.1 wsproto==1.2.0 xyzservices==2024.4.0

@gforsyth
Copy link
Member

Hm, that's odd. I can't tell from the screenshot what kind of IDE you're running in, but could you try running in a non-fancy Python REPL so we can rule out outside factors?

@balintpto
Copy link
Author

Sure, I'm using vscode. I tried from terminal with prints added and the same happens

@cpcloud
Copy link
Member

cpcloud commented Sep 13, 2024

Also if you can run in DuckDB without ibis that'd be useful in ruling out a bug on our side.

@balintpto
Copy link
Author

Also if you can run in DuckDB without ibis that'd be useful in ruling out a bug on our side.

Can you post an example command? Never used duckdb without ibis

@gforsyth
Copy link
Member

Here's one you can run from Python:

>>> import pandas as pd

>>> df = pd.DataFrame.from_dict({"a": [1, 2, 3, 4], "b": ["a", "b", "c", "d"]})

>>> import duckdb

>>> con = duckdb.connect()

>>> con.sql("SELECT * FROM df TABLESAMPLE bernoulli (50.0 PERCENT) REPEATABLE (1234)")
┌───────┬─────────┐
│   ab    │
│ int64varchar │
├───────┼─────────┤
│     2b       │
│     3c       │
└───────┴─────────┘

@lostmygithubaccount
Copy link
Member

it does look fine for me in VSCode (I'm in a branch off main):

image

@cpcloud
Copy link
Member

cpcloud commented Sep 13, 2024

I tried this on main and on the commit associated with the 9.5.0 tag and I also am not able to reproduce the problem:

70de8db

In [1]: import ibis as ib
   ...:
   ...: t = ib.memtable({"x": [1, 2, 3, 4], "y": ["a", "b", "c", "d"]})
   ...:
   ...: t.sample(0.5, seed=1234).execute()
Out[1]:
   x  y
0  2  b
1  3  c

9.5.0

In [1]: import ibis as ib
   ...:
   ...: t = ib.memtable({"x": [1, 2, 3, 4], "y": ["a", "b", "c", "d"]})
   ...:
   ...: t.sample(0.5, seed=1234).execute()
Out[1]:
   x  y
0  2  b
1  3  c

@cpcloud
Copy link
Member

cpcloud commented Sep 13, 2024

@balintpto Can you also show the generated SQL? It should look something like this:

In [3]: ib.to_sql(t.sample(0.5, seed=1234))
Out[3]:
SELECT
  *
FROM "ibis_pandas_memtable_l5yj4ykbenhr5j7jl22uazkkyi" AS "t0" TABLESAMPLE bernoulli (50.0 PERCENT) REPEATABLE (1234)

@lostmygithubaccount
Copy link
Member

I'll just note that if I remove the seed, this operation results in anywhere from 0 to all 4 rows. since sampling is non-deterministic, it's possible this is just the result you're getting for your environment with this particular seed

could you try other seeds? clearly this isn't ideal (ideally, you'd get exactly half of the number of rows) but I'm not sure there's a ton Ibis can do about it

@gforsyth
Copy link
Member

Hi @balintpto -- well, you've got our attention, certainly!

Could you also try this out with a (slightly) larger dataset? It could be that on your machine, with the seed=1234 that it's just returning all the rows (sampling is inexact):

>>> import ibis

>>> import string

>>> t = ibis.memtable({"num": range(26), "let": list(string.ascii_lowercase)})

>>> ibis.options.interactive = True

>>> t.sample(0.5, seed=1234)

@balintpto
Copy link
Author

balintpto commented Sep 16, 2024

Thanks for all the help, well this is interesting:

import ibis as ib t = ib.memtable({"x": [1, 2, 3, 4], "y": ["a", "b", "c", "d"]})ib.to_sql(t.sample(0.5, seed=12))

SELECT * FROM "ibis_pandas_memtable_pau72qgn2zcztfvqrli6xtxvq4" AS "t0"

With a longer dataset: image

Also with only duckdb it seems to be okay: image

@balintpto
Copy link
Author

I uninstalled and reinstalled sqlglot but the sampling is still not showing up after the .to_sql method

@balintpto
Copy link
Author

Okay, this is interesting. I made a fresh venv and sampling works fine. So maybe it's some library not updating? I don't think creating a new venv with every minor version update is a good solution though.

@gforsyth
Copy link
Member

Thanks, @balintpto ! We try to keep our version bounds wide to avoid making people upgrade unnecessarily, but we can't always control which versions get updated or which don't. Glad you got it sorted out.

@github-project-automation github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis
Projects
Archived in project
Development

No branches or pull requests

4 participants