Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Adds .summarize() to compute statistics #3810

Merged
merged 2 commits into from
Feb 16, 2025
Merged

Conversation

rchowell
Copy link
Contributor

Adds .summarize() which computes column statistics as a new dataframe – this works by aggregating into a list expression then exploding the lists to make new columns.

Example

import daft

>>> df = daft.from_pydict(
    {
        "A": [1, 2, 3, 4, 5],
        "B": [1.5, 2.5, 3.5, 4.5, 5.5],
        "C": [True, True, False, False, None],
        "D": [None, None, None, None, None],
    }
)

>>> df.summarize().show()
╭────────┬─────────┬───────┬────────────┬────────┬─────────────┬───────────────────────╮
│ column ┆ type    ┆ min   ┆      …     ┆ countcount_nullsapprox_count_distinct │
│ ---------   ┆            ┆ ---------                   │
│ Utf8Utf8Utf8  ┆ (1 hidden) ┆ UInt64UInt64UInt64                │
╞════════╪═════════╪═══════╪════════════╪════════╪═════════════╪═══════════════════════╡
│ AInt641     ┆ …          ┆ 505                     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ BFloat641.5   ┆ …          ┆ 505                     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ CBooleanfalse ┆ …          ┆ 412                     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ DNullNone  ┆ …          ┆ 050                     │
╰────────┴─────────┴───────┴────────────┴────────┴─────────────┴───────────────────────╯

Explain

>>> df.summarize().explain()

== Unoptimized Logical Plan ==

* Explode: col(column), col(type), col(min), col(max), col(count), col(count_nulls), col(approx_count_distinct)
|   Schema = column#Utf8, type#Utf8, min#Utf8, max#Utf8, count#UInt64, count_nulls#UInt64, approx_count_distinct#UInt64
|
* Aggregation: list(lit("A"), lit("B"), lit("C"), lit("D")) as column, list(lit("Int64"), lit("Float64"),
|     lit("Boolean"), lit("Null")) as type, list(cast(min(col(A)) as Utf8), cast(min(col(B)) as Utf8), cast(min(col(C))
|     as Utf8), cast(min(col(D)) as Utf8)) as min, list(cast(max(col(A)) as Utf8), cast(max(col(B)) as Utf8),
|     cast(max(col(C)) as Utf8), cast(max(col(D)) as Utf8)) as max, list(count(col(A), Valid), count(col(B), Valid),
|     count(col(C), Valid), count(col(D), Valid)) as count, list(count(col(A), Null), count(col(B), Null), count(col(C),
|     Null), count(col(D), Null)) as count_nulls, list(approx_count_distinct(col(A)), approx_count_distinct(col(B)),
|     approx_count_distinct(col(C)), approx_count_distinct(col(D))) as approx_count_distinct
|   Output schema = column#List(Utf8), type#List(Utf8), min#List(Utf8), max#List(Utf8), count#List(UInt64),
|     count_nulls#List(UInt64), approx_count_distinct#List(UInt64)
|
* Source:
|   Number of partitions = 1
|   Output schema = A#Int64, B#Float64, C#Boolean, D#Null

**

@rchowell
Copy link
Contributor Author

This PR replaces #3711

Copy link

codspeed-hq bot commented Feb 14, 2025

CodSpeed Performance Report

Merging #3810 will degrade performances by 34.53%

Comparing rchowell/summarize (bb569d4) with main (fba938d)

Summary

❌ 2 regressions
✅ 25 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark BASE HEAD Change
test_count[1 Small File] 3.2 ms 3.7 ms -13.7%
test_iter_rows_first_row[100 Small Files] 204.2 ms 311.9 ms -34.53%

@rchowell
Copy link
Contributor Author

Posted in the other thread, but here's a pure-python helper that can be used today.

import daft

def summarize(df: daft.DataFrame):
    cols = []  # column             :: utf8
    typs = []  # type               :: utf8
    mins = []  # min                :: utf8
    maxs = []  # max                :: utf8
    cnts = []  # count              :: int64
    nuls = []  # nulls              :: int64
    unqs = []  # approx_distinct    :: int64
    for field in df.schema():
        col = daft.col(field.name)
        cols.append(daft.lit(field.name))
        typs.append(daft.lit(str(field.dtype)))
        mins.append(col.min().cast(daft.DataType.string()))
        maxs.append(col.max().cast(daft.DataType.string()))
        cnts.append(col.count("valid"))
        nuls.append(col.count("null"))
        unqs.append(col.approx_count_distinct())
    df = df.agg(
        [
            daft.list_(*cols).alias("column"),
            daft.list_(*typs).alias("type"),
            daft.list_(*mins).alias("min"),
            daft.list_(*maxs).alias("max"),
            daft.list_(*cnts).alias("count"),
            daft.list_(*nuls).alias("count_nulls"),
            daft.list_(*unqs).alias("approx_count_distinct"),
        ]
    )
    return df.explode(*df.columns)

Copy link

codecov bot commented Feb 14, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.89%. Comparing base (f9a4b70) to head (bb569d4).
Report is 6 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3810      +/-   ##
==========================================
+ Coverage   75.60%   77.89%   +2.29%     
==========================================
  Files         748      751       +3     
  Lines       99035    94852    -4183     
==========================================
- Hits        74875    73886     -989     
+ Misses      24160    20966    -3194     
Files with missing lines Coverage Δ
daft/dataframe/dataframe.py 85.30% <100.00%> (+0.05%) ⬆️
daft/logical/builder.py 90.81% <100.00%> (+0.15%) ⬆️
src/daft-logical-plan/src/builder/mod.rs 87.81% <100.00%> (+0.17%) ⬆️
src/daft-logical-plan/src/ops/summarize.rs 100.00% <100.00%> (ø)

... and 43 files with indirect coverage changes

}

/// Creates a list constructor for the given items.
fn list_(items: Vec<ExprRef>, alias: &str) -> ExprRef {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list isn't a keyword in rust, so we can just call it list

Suggested change
fn list_(items: Vec<ExprRef>, alias: &str) -> ExprRef {
fn list(items: Vec<ExprRef>, alias: &str) -> ExprRef {

@rchowell rchowell merged commit b34c2bf into main Feb 16, 2025
43 of 44 checks passed
@rchowell rchowell deleted the rchowell/summarize branch February 16, 2025 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants