Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement HistogramTransformation #338

Open
osopardo1 opened this issue Jun 25, 2024 · 0 comments
Open

Implement HistogramTransformation #338

osopardo1 opened this issue Jun 25, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@osopardo1
Copy link
Member

After analyzing the efficiency of distribution functions for indexing (see issue #336 ), we can start implementing the HistogramTransformation.

The idea is to build it as another type of transformation, and eventually turn it into the default.

The API can be something like:

df.write.format("qbeast").option("columnsToIndex", "id:histogram").save("/tmp/test-histogram")

And under the hood:

case class HistogramTransformation(hist: IndexedSeq[String]) extends Transformation {

  override def transform(value: Any): Double = ???

  /**
   * This method should determine if the new data will cause the creation of a new revision.
   *
   * @param newTransformation
   *   the new transformation created with statistics over the new data
   * @return
   *   true if the domain of the newTransformation is not fully contained in this one.
   */
  override def isSupersededBy(newTransformation: Transformation): Boolean = ???

  /**
   * Merges two transformations. The domain of the resulting transformation is the union of this
   *
   * @param other
   * @return
   *   a new Transformation that contains both this and other.
   */
  override def merge(other: Transformation): Transformation = ???
}

object HistogramTransformation {
  def apply(hist: IndexedSeq[String]): HistogramTransformation = new HistogramTransformation(hist)
}

We would take advantage of the first step of OTreeDataAnalyzer and compute an approximate histogram or quartiles of the columns specified.

  /**
   * Analyze a specific group of columns of the dataframe and extract valuable statistics
   * @param data
   *   the data to analyze
   * @param columnTransformers
   *   the columns to analyze
   * @return
   */
  private[index] def getDataFrameStats(
      data: DataFrame,
      columnTransformers: IISeq[Transformer]): Row = {
    val columnStats = columnTransformers.map(_.stats)
    val columnsExpr = columnStats.flatMap(_.statsSqlPredicates)
    data.selectExpr(columnsExpr ++ Seq("count(1) AS count"): _*).first()
  }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant