-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Adds .summarize() to compute statistics #3810
Conversation
This PR replaces #3711 |
CodSpeed Performance ReportMerging #3810 will degrade performances by 34.53%Comparing Summary
Benchmarks breakdown
|
Posted in the other thread, but here's a pure-python helper that can be used today. import daft
def summarize(df: daft.DataFrame):
cols = [] # column :: utf8
typs = [] # type :: utf8
mins = [] # min :: utf8
maxs = [] # max :: utf8
cnts = [] # count :: int64
nuls = [] # nulls :: int64
unqs = [] # approx_distinct :: int64
for field in df.schema():
col = daft.col(field.name)
cols.append(daft.lit(field.name))
typs.append(daft.lit(str(field.dtype)))
mins.append(col.min().cast(daft.DataType.string()))
maxs.append(col.max().cast(daft.DataType.string()))
cnts.append(col.count("valid"))
nuls.append(col.count("null"))
unqs.append(col.approx_count_distinct())
df = df.agg(
[
daft.list_(*cols).alias("column"),
daft.list_(*typs).alias("type"),
daft.list_(*mins).alias("min"),
daft.list_(*maxs).alias("max"),
daft.list_(*cnts).alias("count"),
daft.list_(*nuls).alias("count_nulls"),
daft.list_(*unqs).alias("approx_count_distinct"),
]
)
return df.explode(*df.columns) |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3810 +/- ##
==========================================
+ Coverage 75.60% 77.89% +2.29%
==========================================
Files 748 751 +3
Lines 99035 94852 -4183
==========================================
- Hits 74875 73886 -989
+ Misses 24160 20966 -3194
|
} | ||
|
||
/// Creates a list constructor for the given items. | ||
fn list_(items: Vec<ExprRef>, alias: &str) -> ExprRef { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list isn't a keyword in rust, so we can just call it list
fn list_(items: Vec<ExprRef>, alias: &str) -> ExprRef { | |
fn list(items: Vec<ExprRef>, alias: &str) -> ExprRef { |
Adds .summarize() which computes column statistics as a new dataframe – this works by aggregating into a list expression then exploding the lists to make new columns.
Example
Explain
**