Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add raw_sql docstring to explain usage tradeoffs #9985

Closed
1 task done
djouallah opened this issue Sep 1, 2024 · 6 comments
Closed
1 task done

docs: add raw_sql docstring to explain usage tradeoffs #9985

djouallah opened this issue Sep 1, 2024 · 6 comments
Labels
docs Documentation related issues or PRs

Comments

@djouallah
Copy link

What happened?

when testing con.raw_sql(' select 42 ;'), it seems to be backend specific

for mssql/duckdb, you need to use con.raw_sql(' select 42 ;').fetchall() which is not compatible with other backend like datafusion/pyspark

What version of ibis are you using?

9.3

What backend(s) are you using, if any?

mssql,datafusion,pyspark

ideally con.raw_sql should return a standard object regardless of the backend

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@djouallah djouallah added the bug Incorrect behavior inside of ibis label Sep 1, 2024
@djouallah
Copy link
Author

testing further

for engine in ['dwh','datafusion','pyspark','duckdb']:
 con = get_ibis_connection(workspace ='pythononly',lakehouse ='NY',engine = engine)
 print(engine)
 print(type(con.raw_sql(' select 42 ;')))

I am getting

dwh
<class 'pyodbc.Cursor'>
datafusion
<class 'datafusion.DataFrame'>
pyspark
<class 'pyspark.sql.dataframe.DataFrame'>
duckdb
<class 'duckdb.duckdb.DuckDBPyConnection'>

maybe it will be nice if all the backend return a pyarrow dataset or something like that ?

@gforsyth
Copy link
Member

gforsyth commented Sep 1, 2024

Hi @djouallah -- we should have a docstring for raw_sql, so I'll leave this issue open to track that work.

Standardizing the output of raw_sql would require wrapping the backend's dbapi cursor, and the point of raw_sql is to expose that cursor directly, without interference. There is no cross-backend guarantee for what comes out of raw_sql (nor is there any guarantee that SQL you pass to raw_sql is in the correct dialect to execute on the backend in question).

It's a trade-off, but for raw_sql to remain an effective escape hatch, we don't mess with the output from it.

@gforsyth gforsyth added docs Documentation related issues or PRs and removed bug Incorrect behavior inside of ibis labels Sep 1, 2024
@gforsyth gforsyth changed the title con.raw_sql(' select 42 ;') is not a standard docs: add raw_sql docstring to explain usage tradeoffs Sep 1, 2024
@jcrist
Copy link
Member

jcrist commented Sep 1, 2024

maybe it will be nice if all the backend return a pyarrow dataset or something like that

You might be looking for con.sql instead? This method takes in a SQL string and returns an ibis.Table, which can be executed to return a pyarrow table with to_pyarrow. https://ibis-project.org/how-to/extending/sql#backend.sql

@djouallah
Copy link
Author

djouallah commented Sep 1, 2024

@jcrist I endup creating my own function, it will take a sql query using duckdb dialect and transpile it using sqlglot,

def sql_to_dataframe(query,engine):
    con = get_ibis_connection(workspace =workspace,lakehouse = lakehouse,engine = engine)
    if engine =='mssql':
       engine = 'tsql'
    standard_sql = sqlglot.transpile(query, read="duckdb", write=engine)[0]
    if engine =='tsql':
        import pyarrow as pa
        cursor = con.raw_sql(standard_sql)
        columns = [column[0] for column in cursor.description]
        data = cursor.fetchall()
        columnar_data = list(zip(*data))
        arrow_table = pa.Table.from_arrays([pa.array(col) for col in columnar_data], columns)
        return arrow_table
    elif engine =='duckdb':
       return  con.raw_sql(standard_sql).arrow()
    elif engine =='datafusion':
       return  con.raw_sql(standard_sql).to_arrow_table()
    elif engine =='pyspark':
       return  con.raw_sql(standard_sql).toPandas()

@jcrist
Copy link
Member

jcrist commented Sep 3, 2024

If that works for you, that's fine. Note that that's functionally equivalent to:

# Execute some duckdb sql on any ibis sql backend, returning a pyarrow table
con.sql(query, dialect="duckdb").to_pyarrow()

@djouallah
Copy link
Author

@jcrist this is brilliant !!!

@github-project-automation github-project-automation bot moved this from backlog to done in Ibis planning and roadmap Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation related issues or PRs
Projects
Archived in project
Development

No branches or pull requests

3 participants