feat: Adds .summarize() to compute statistics #3810

rchowell · 2025-02-14T19:45:26Z

Adds .summarize() which computes column statistics as a new dataframe – this works by aggregating into a list expression then exploding the lists to make new columns.

Example

import daft

>>> df = daft.from_pydict(
    {
        "A": [1, 2, 3, 4, 5],
        "B": [1.5, 2.5, 3.5, 4.5, 5.5],
        "C": [True, True, False, False, None],
        "D": [None, None, None, None, None],
    }
)

>>> df.summarize().show()
╭────────┬─────────┬───────┬────────────┬────────┬─────────────┬───────────────────────╮
│ column ┆ type    ┆ min   ┆      …     ┆ count  ┆ count_nulls ┆ approx_count_distinct │
│ ---    ┆ ---     ┆ ---   ┆            ┆ ---    ┆ ---         ┆ ---                   │
│ Utf8   ┆ Utf8    ┆ Utf8  ┆ (1 hidden) ┆ UInt64 ┆ UInt64      ┆ UInt64                │
╞════════╪═════════╪═══════╪════════════╪════════╪═════════════╪═══════════════════════╡
│ A      ┆ Int64   ┆ 1     ┆ …          ┆ 5      ┆ 0           ┆ 5                     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ B      ┆ Float64 ┆ 1.5   ┆ …          ┆ 5      ┆ 0           ┆ 5                     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ C      ┆ Boolean ┆ false ┆ …          ┆ 4      ┆ 1           ┆ 2                     │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ D      ┆ Null    ┆ None  ┆ …          ┆ 0      ┆ 5           ┆ 0                     │
╰────────┴─────────┴───────┴────────────┴────────┴─────────────┴───────────────────────╯

Explain

>>> df.summarize().explain()

== Unoptimized Logical Plan ==

* Explode: col(column), col(type), col(min), col(max), col(count), col(count_nulls), col(approx_count_distinct)
|   Schema = column#Utf8, type#Utf8, min#Utf8, max#Utf8, count#UInt64, count_nulls#UInt64, approx_count_distinct#UInt64
|
* Aggregation: list(lit("A"), lit("B"), lit("C"), lit("D")) as column, list(lit("Int64"), lit("Float64"),
|     lit("Boolean"), lit("Null")) as type, list(cast(min(col(A)) as Utf8), cast(min(col(B)) as Utf8), cast(min(col(C))
|     as Utf8), cast(min(col(D)) as Utf8)) as min, list(cast(max(col(A)) as Utf8), cast(max(col(B)) as Utf8),
|     cast(max(col(C)) as Utf8), cast(max(col(D)) as Utf8)) as max, list(count(col(A), Valid), count(col(B), Valid),
|     count(col(C), Valid), count(col(D), Valid)) as count, list(count(col(A), Null), count(col(B), Null), count(col(C),
|     Null), count(col(D), Null)) as count_nulls, list(approx_count_distinct(col(A)), approx_count_distinct(col(B)),
|     approx_count_distinct(col(C)), approx_count_distinct(col(D))) as approx_count_distinct
|   Output schema = column#List(Utf8), type#List(Utf8), min#List(Utf8), max#List(Utf8), count#List(UInt64),
|     count_nulls#List(UInt64), approx_count_distinct#List(UInt64)
|
* Source:
|   Number of partitions = 1
|   Output schema = A#Int64, B#Float64, C#Boolean, D#Null

**

rchowell · 2025-02-14T19:47:29Z

This PR replaces #3711

codspeed-hq · 2025-02-14T19:56:35Z

CodSpeed Performance Report

Merging #3810 will degrade performances by 34.53%

_{Comparing rchowell/summarize (bb569d4) with main (fba938d)}

Summary

❌ 2 regressions
✅ 25 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`BASE`	`HEAD`	Change
❌	`test_count[1 Small File]`	3.2 ms	3.7 ms	-13.7%
❌	`test_iter_rows_first_row[100 Small Files]`	204.2 ms	311.9 ms	-34.53%

rchowell · 2025-02-14T20:01:05Z

Posted in the other thread, but here's a pure-python helper that can be used today.

import daft

def summarize(df: daft.DataFrame):
    cols = []  # column             :: utf8
    typs = []  # type               :: utf8
    mins = []  # min                :: utf8
    maxs = []  # max                :: utf8
    cnts = []  # count              :: int64
    nuls = []  # nulls              :: int64
    unqs = []  # approx_distinct    :: int64
    for field in df.schema():
        col = daft.col(field.name)
        cols.append(daft.lit(field.name))
        typs.append(daft.lit(str(field.dtype)))
        mins.append(col.min().cast(daft.DataType.string()))
        maxs.append(col.max().cast(daft.DataType.string()))
        cnts.append(col.count("valid"))
        nuls.append(col.count("null"))
        unqs.append(col.approx_count_distinct())
    df = df.agg(
        [
            daft.list_(*cols).alias("column"),
            daft.list_(*typs).alias("type"),
            daft.list_(*mins).alias("min"),
            daft.list_(*maxs).alias("max"),
            daft.list_(*cnts).alias("count"),
            daft.list_(*nuls).alias("count_nulls"),
            daft.list_(*unqs).alias("approx_count_distinct"),
        ]
    )
    return df.explode(*df.columns)

codecov · 2025-02-14T20:07:55Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.89%. Comparing base (f9a4b70) to head (bb569d4).
Report is 6 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3810      +/-   ##
==========================================
+ Coverage   75.60%   77.89%   +2.29%     
==========================================
  Files         748      751       +3     
  Lines       99035    94852    -4183     
==========================================
- Hits        74875    73886     -989     
+ Misses      24160    20966    -3194

Files with missing lines	Coverage Δ
daft/dataframe/dataframe.py	`85.30% <100.00%> (+0.05%)`	⬆️
daft/logical/builder.py	`90.81% <100.00%> (+0.15%)`	⬆️
src/daft-logical-plan/src/builder/mod.rs	`87.81% <100.00%> (+0.17%)`	⬆️
src/daft-logical-plan/src/ops/summarize.rs	`100.00% <100.00%> (ø)`

... and 43 files with indirect coverage changes

universalmind303 · 2025-02-16T20:35:52Z

src/daft-logical-plan/src/ops/summarize.rs

+}
+
+/// Creates a list constructor for the given items.
+fn list_(items: Vec<ExprRef>, alias: &str) -> ExprRef {


list isn't a keyword in rust, so we can just call it list

Suggested change

fn list_(items: Vec<ExprRef>, alias: &str) -> ExprRef {

fn list(items: Vec<ExprRef>, alias: &str) -> ExprRef {

feat: Adds .summarize() to compute statistics

b1994d1

rchowell requested a review from desmondcheongzx February 14, 2025 19:45

github-actions bot added the feat label Feb 14, 2025

rchowell mentioned this pull request Feb 14, 2025

feat: Add a .describe() API for stats on dataframe columns #3711

Closed

Fix doctests?

bb569d4

universalmind303 approved these changes Feb 16, 2025

View reviewed changes

rchowell merged commit b34c2bf into main Feb 16, 2025
43 of 44 checks passed

rchowell deleted the rchowell/summarize branch February 16, 2025 20:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Adds .summarize() to compute statistics #3810

feat: Adds .summarize() to compute statistics #3810

rchowell commented Feb 14, 2025

rchowell commented Feb 14, 2025

codspeed-hq bot commented Feb 14, 2025 •

edited

Loading

rchowell commented Feb 14, 2025

codecov bot commented Feb 14, 2025 •

edited

Loading

universalmind303 Feb 16, 2025

	fn list_(items: Vec<ExprRef>, alias: &str) -> ExprRef {
	fn list(items: Vec<ExprRef>, alias: &str) -> ExprRef {

feat: Adds .summarize() to compute statistics #3810

feat: Adds .summarize() to compute statistics #3810

Conversation

rchowell commented Feb 14, 2025

rchowell commented Feb 14, 2025

codspeed-hq bot commented Feb 14, 2025 • edited Loading

CodSpeed Performance Report

Merging #3810 will degrade performances by 34.53%

Summary

Benchmarks breakdown

rchowell commented Feb 14, 2025

codecov bot commented Feb 14, 2025 • edited Loading

Codecov Report

universalmind303 Feb 16, 2025

Choose a reason for hiding this comment

codspeed-hq bot commented Feb 14, 2025 •

edited

Loading

codecov bot commented Feb 14, 2025 •

edited

Loading