-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataframe serialization loses schema information #94
Comments
Hi Jesse, sorry for the delay in getting back to you! I've thought about implementing an arrrow-based approach for doing this, but never quite got around to it. I think it would make sense to implement this as "use arrow if it's installed, otherwise use JSON", and then set up What do you think? The hardest part is probably figuring out the snippet we need to run on the server side to spit out the arrow-serialised dataframe as a byte string and print it in the output of the Spark session. We may need to do something like base64 encode the byte string to avoid issues with Livy returning it inside the JSON response from the API. |
Yeah, that sounds great. I might have some time to help with this as well. From looking at the livy API I was also uncertain how to encode the bytes, but will try to test this at this if I get some free time. |
Hi guys , I have hit the similar issue recently where duplicated columns in the spark DF do not survive the serializing process into a pandas df. Line 51 in 6c7bf18
this should at least allow us to recover duplicated columns in the pandas DF
|
We have found some edge cases when using
read
to return a dataframe related to the transformationtoJSON
and thenjson.loads
.Specifically if all values of a column are
null
then the column is dropped from the pandas dataframe. In addition we lose type information when coercing to json on all columns. For example, timestamps would be converted to pandas datatimes but are now returned as strings.Is there an alternative approach we could explore for serializing the dataframe, such as using apache arrow as the
toPandas
function does in pyspark?The text was updated successfully, but these errors were encountered: