Description
Is your feature request related to a problem or challenge?
The TABLESAMPLE statement is used to sample the table.
Different DBs have different sample implementations.
Spark:
https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-sampling.html
example with replacement, poisson sample.
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.sample.html
it implements
- Sample the table down to the given number of rows.
- Sample the table down to the given percentage.
a. poisson sample, (only in dataframe api)
b. bernoulli sample.
Spark introduced a Sample logical plan, and many other dataframe apis are also based on this logical plan. e.g.
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.randomSplit.html
Hive:
https://cwiki.apache.org/confluence/display/hive/languagemanual+sampling
- Sample the table down to the given number of rows.
- Sample the table down to the given percentage. (bernoulli)
- Sample on column, it's useful for clustered table.
Clickhouse:
https://clickhouse.com/docs/sql-reference/statements/select/sample
- Sample the table down to the given number of rows.
- Sample the table down to the given percentage.
- Sample with offset.
Postgres:
https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation
- BERNOULLI sample.
- SYSTEM
Describe the solution you'd like
Add a Sample logical plan.
Describe alternatives you've considered
I have considered resusing current logical plan, e.g. Filter. But it seems that it's hard to implement poisson sample with current logical plan.
in spark, real_seed = input_seed + partition_id. then different partitions have different sample results. it makes sense to me. it's also hard to implement with current logical plan.
Additional context
No response