Dataframe serialization loses schema information #94

quartox · 2021-01-12T18:57:09Z

We have found some edge cases when using read to return a dataframe related to the transformation toJSON and then json.loads.

Specifically if all values of a column are null then the column is dropped from the pandas dataframe. In addition we lose type information when coercing to json on all columns. For example, timestamps would be converted to pandas datatimes but are now returned as strings.

Is there an alternative approach we could explore for serializing the dataframe, such as using apache arrow as the toPandas function does in pyspark?

The text was updated successfully, but these errors were encountered:

acroz · 2021-01-27T19:56:19Z

Hi Jesse, sorry for the delay in getting back to you!

I've thought about implementing an arrrow-based approach for doing this, but never quite got around to it. I think it would make sense to implement this as "use arrow if it's installed, otherwise use JSON", and then set up extras_require to declare arrow as an optional dependency.

What do you think? The hardest part is probably figuring out the snippet we need to run on the server side to spit out the arrow-serialised dataframe as a byte string and print it in the output of the Spark session. We may need to do something like base64 encode the byte string to avoid issues with Livy returning it inside the JSON response from the API.

quartox · 2021-01-27T22:53:21Z

Yeah, that sounds great. I might have some time to help with this as well. From looking at the livy API I was also uncertain how to encode the bytes, but will try to test this at this if I get some free time.

ryanpetm · 2022-01-04T16:20:27Z

Hi guys , I have hit the similar issue recently where duplicated columns in the spark DF do not survive the serializing process into a pandas df.
I think we can have cleaner behavior if you leverage the object_pairs_hook in the json.loads method here

pylivy/livy/session.py

Line 51 in 6c7bf18

rows.append(json.loads(line))

this should at least allow us to recover duplicated columns in the pandas DF

line
'{"firstname":"James","middlename":"","lastname":"Smith","id":"36636","gender":"M","salary":3000,"middlename":"","lastname":"Smith","id":"36636","gender":"M","salary":3000}'
def dict_rename_on_duplicates(ordered_pairs):
    """Rename duplicate keys."""
    d = {}
    for k, v in ordered_pairs:
        if k in d:
           k_1 = '{}*'.format(k) 
           d[k_1] = v
        else:
           d[k] = v
    return d
x = json.loads(line, object_pairs_hook=dict_rename_on_duplicates)
x
{'firstname': 'James', 'middlename': '', 'lastname': 'Smith', 'id': '36636', 'gender': 'M', 'salary': 3000, 'middlename*': '', 'lastname*': 'Smith', 'id*': '36636', 'gender*': 'M', 'salary*': 3000}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataframe serialization loses schema information #94

Dataframe serialization loses schema information #94

quartox commented Jan 12, 2021

acroz commented Jan 27, 2021

quartox commented Jan 27, 2021

ryanpetm commented Jan 4, 2022 •

edited

Loading

Dataframe serialization loses schema information #94

Dataframe serialization loses schema information #94

Comments

quartox commented Jan 12, 2021

acroz commented Jan 27, 2021

quartox commented Jan 27, 2021

ryanpetm commented Jan 4, 2022 • edited Loading

ryanpetm commented Jan 4, 2022 •

edited

Loading