You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on May 18, 2022. It is now read-only.
We want the list method to be able to take a limit and an offset parameters. However there is no OFFSET keyword in spark SQL so we have tried to find some workarounds. Here is a sum up of the different potential solutions we have explored.
First approach
Our first idea was to query limit + offset rows and then slice the result array in Java.
While this solution does work, it is highly inefficient for large offsets as we fetch numerous useless rows, which takes memory and time.
monotonically_increasing_id
Our second idea was to add a temporary autogenerated column with an id thanks to the spark function monotonically_increasing_id.
At first, this solution seemed to be properly working and had way better performance results than our first approach. However we realized that this code was not always working when running tests on Databricks with large dataset. In fact, as soon as spark uses more than one partition for the data, it stops working. After checking the documentation, we realized that the generated IDs were not consecutive: there are on a same partition but there is an offset of 2^33 (=8589934592) between each partition.
Our last solution was to define a "User Defined Function" thanks to the help of this StackOverflow post . The idea is to add an _id column generated from a monotonically_increasing_id column and a partition_id column.
Test
We checked the method with a notebook on Databricks.
There are 15160640 entries and these entries are on several partitions, so the monotonically_increasing_id (the column mon_id here) goes up to 68719915055.
The _id is properly generated, with IDs going from 0 to 15160639
and they are all different:
This means we can use this solution to properly compute the offset!
NB: We could have made a much simpler function, that would just subtract partition_id * 2^31 from the monotonically_increasing_id but the documentation says that the offset of 2^31 is the current implementation, implying it could change in the future. That is why we thought it would be safer to use this more complex but more generic function.
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
We want the
list
method to be able to take a limit and an offset parameters. However there is noOFFSET
keyword in spark SQL so we have tried to find some workarounds. Here is a sum up of the different potential solutions we have explored.First approach
Our first idea was to query
limit + offset
rows and then slice the result array in Java.While this solution does work, it is highly inefficient for large offsets as we fetch numerous useless rows, which takes memory and time.
monotonically_increasing_id
Our second idea was to add a temporary autogenerated column with an id thanks to the spark function
monotonically_increasing_id
.At first, this solution seemed to be properly working and had way better performance results than our first approach. However we realized that this code was not always working when running tests on Databricks with large dataset. In fact, as soon as spark uses more than one partition for the data, it stops working. After checking the documentation, we realized that the generated IDs were not consecutive: there are on a same partition but there is an offset of 2^33 (=8589934592) between each partition.
nth_value
Next we tried to use the
nth_value
function (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.nth_value.html). However, this is not adapted to our need, since it is quite tricky to handle the case where there are not enough rows andnth_value
returnsnull
.User Defined Function
Our last solution was to define a "User Defined Function" thanks to the help of this StackOverflow post . The idea is to add an
_id
column generated from amonotonically_increasing_id
column and apartition_id
column.Test
We checked the method with a notebook on Databricks.
There are 15160640 entries and these entries are on several partitions, so the
monotonically_increasing_id
(the columnmon_id
here) goes up to 68719915055.The
_id
is properly generated, with IDs going from 0 to 15160639and they are all different:
This means we can use this solution to properly compute the offset!
NB: We could have made a much simpler function, that would just subtract
partition_id * 2^31
from themonotonically_increasing_id
but the documentation says that the offset of2^31
is the current implementation, implying it could change in the future. That is why we thought it would be safer to use this more complex but more generic function.The text was updated successfully, but these errors were encountered: