Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataframe serialization loses schema information #94

Open
quartox opened this issue Jan 12, 2021 · 3 comments
Open

Dataframe serialization loses schema information #94

quartox opened this issue Jan 12, 2021 · 3 comments

Comments

@quartox
Copy link

quartox commented Jan 12, 2021

We have found some edge cases when using read to return a dataframe related to the transformation toJSON and then json.loads.

Specifically if all values of a column are null then the column is dropped from the pandas dataframe. In addition we lose type information when coercing to json on all columns. For example, timestamps would be converted to pandas datatimes but are now returned as strings.

Is there an alternative approach we could explore for serializing the dataframe, such as using apache arrow as the toPandas function does in pyspark?

@acroz
Copy link
Owner

acroz commented Jan 27, 2021

Hi Jesse, sorry for the delay in getting back to you!

I've thought about implementing an arrrow-based approach for doing this, but never quite got around to it. I think it would make sense to implement this as "use arrow if it's installed, otherwise use JSON", and then set up extras_require to declare arrow as an optional dependency.

What do you think? The hardest part is probably figuring out the snippet we need to run on the server side to spit out the arrow-serialised dataframe as a byte string and print it in the output of the Spark session. We may need to do something like base64 encode the byte string to avoid issues with Livy returning it inside the JSON response from the API.

@quartox
Copy link
Author

quartox commented Jan 27, 2021

Yeah, that sounds great. I might have some time to help with this as well. From looking at the livy API I was also uncertain how to encode the bytes, but will try to test this at this if I get some free time.

@ryanpetm
Copy link

ryanpetm commented Jan 4, 2022

Hi guys , I have hit the similar issue recently where duplicated columns in the spark DF do not survive the serializing process into a pandas df.
I think we can have cleaner behavior if you leverage the object_pairs_hook in the json.loads method here

rows.append(json.loads(line))

this should at least allow us to recover duplicated columns in the pandas DF

line
'{"firstname":"James","middlename":"","lastname":"Smith","id":"36636","gender":"M","salary":3000,"middlename":"","lastname":"Smith","id":"36636","gender":"M","salary":3000}'
def dict_rename_on_duplicates(ordered_pairs):
    """Rename duplicate keys."""
    d = {}
    for k, v in ordered_pairs:
        if k in d:
           k_1 = '{}*'.format(k) 
           d[k_1] = v
        else:
           d[k] = v
    return d
x = json.loads(line, object_pairs_hook=dict_rename_on_duplicates)
x
{'firstname': 'James', 'middlename': '', 'lastname': 'Smith', 'id': '36636', 'gender': 'M', 'salary': 3000, 'middlename*': '', 'lastname*': 'Smith', 'id*': '36636', 'gender*': 'M', 'salary*': 3000}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants