Skip to content

Feature: Support Sample #16533

Open
@chenkovsky

Description

@chenkovsky

Is your feature request related to a problem or challenge?

The TABLESAMPLE statement is used to sample the table.

Different DBs have different sample implementations.

Spark:

https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-sampling.html

example with replacement, poisson sample.
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.sample.html

it implements

  1. Sample the table down to the given number of rows.
  2. Sample the table down to the given percentage.
    a. poisson sample, (only in dataframe api)
    b. bernoulli sample.

Spark introduced a Sample logical plan, and many other dataframe apis are also based on this logical plan. e.g.
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.randomSplit.html

Hive:
https://cwiki.apache.org/confluence/display/hive/languagemanual+sampling

  1. Sample the table down to the given number of rows.
  2. Sample the table down to the given percentage. (bernoulli)
  3. Sample on column, it's useful for clustered table.

Clickhouse:

https://clickhouse.com/docs/sql-reference/statements/select/sample

  1. Sample the table down to the given number of rows.
  2. Sample the table down to the given percentage.
  3. Sample with offset.

Postgres:

https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation

  1. BERNOULLI sample.
  2. SYSTEM

Describe the solution you'd like

Add a Sample logical plan.

Describe alternatives you've considered

I have considered resusing current logical plan, e.g. Filter. But it seems that it's hard to implement poisson sample with current logical plan.
in spark, real_seed = input_seed + partition_id. then different partitions have different sample results. it makes sense to me. it's also hard to implement with current logical plan.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions