Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Rewrite 'Getting started' page #19028

Merged
merged 6 commits into from
Oct 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
The documentation is split across two subfolders, `source` and `assets`.
The folder `source` contains the static source files that make up the user guide, which are mostly markdown files and the snippets of code.
The folder `assets` contains (dynamically generated) assets used by those files, including data files for the snippets and images with plots or diagrams.

Do _not_ merge the two folders together.
In [PR #18773](https://github.com/pola-rs/polars/pull/18773) we introduced this split to fix the MkDocs server live reloading.
If everything is in one folder `docs`, the MkDocs server will watch the folder `docs`.
When you make one change the MkDocs server live reloads and rebuilds the docs.
This triggers scripts that build asset files, which change the folder `docs`, leading to an infinite reloading loop.
46 changes: 41 additions & 5 deletions docs/source/_build/API_REFERENCE_LINKS.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,9 @@ python:
filter: https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.filter.html
with_columns: https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.with_columns.html
group_by: https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.group_by.html
agg: https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.dataframe.group_by.GroupBy.agg.html
join: https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.join.html
hstack: https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.hstack.html
vstack: https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.vstack.html
read_csv: https://docs.pola.rs/api/python/stable/reference/api/polars.read_csv.html
write_csv: https://docs.pola.rs/api/python/stable/reference/api/polars.DataFrame.write_csv.html
read_excel: https://docs.pola.rs/api/python/stable/reference/api/polars.read_excel.html
Expand All @@ -34,7 +35,12 @@ python:
fold: https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.fold.html
concat_str: https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.concat_str.html
str.split: https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.str.split.html
Expr.list: https://docs.pola.rs/api/python/stable/reference/expressions/list.html
Expr.list:
name: "list namespace"
link: https://docs.pola.rs/api/python/stable/reference/expressions/list.html
Expr.str:
name: "str namespace"
link: https://docs.pola.rs/api/python/stable/reference/expressions/string.html
element: https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.element.html
all: https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.all.html
exclude: https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.exclude.html
Expand Down Expand Up @@ -122,6 +128,9 @@ python:
is_selector: https://docs.pola.rs/api/python/stable/reference/selectors.html#polars.selectors.is_selector
expand_selector: https://docs.pola.rs/api/python/stable/reference/selectors.html#polars.selectors.expand_selector

Expr.dt:
name: "dt namespace"
link: https://docs.pola.rs/api/python/stable/reference/expressions/temporal.html
dt.convert_time_zone:
name: dt.convert_time_zone
link: https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.dt.convert_time_zone.html
Expand All @@ -147,6 +156,10 @@ python:

struct.field: https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.struct.field.html
struct.rename_fields: https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.struct.rename_fields.html
Expr.name:
name: "name namespace"
link: https://docs.pola.rs/api/python/stable/reference/expressions/name.html
round: https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.Expr.round.html#polars.Expr.round

rust:
DataFrame: https://docs.pola.rs/api/rust/dev/polars/frame/struct.DataFrame.html
Expand All @@ -160,12 +173,13 @@ rust:
filter: https://docs.pola.rs/api/rust/dev/polars_lazy/frame/struct.LazyFrame.html#method.filter
with_columns: https://docs.pola.rs/api/rust/dev/polars_lazy/frame/struct.LazyFrame.html#method.with_columns
group_by: https://docs.pola.rs/api/rust/dev/polars_lazy/frame/struct.LazyFrame.html#method.group_by
agg: https://docs.rs/polars/latest/polars/prelude/struct.LazyGroupBy.html#method.agg
group_by_dynamic:
name: group_by_dynamic
link: https://docs.pola.rs/api/rust/dev/polars_lazy/frame/struct.LazyFrame.html#method.group_by_dynamic
feature_flags: [dynamic_group_by]
join: https://docs.pola.rs/api/rust/dev/polars/prelude/trait.DataFrameJoinOps.html#method.join
hstack: https://docs.pola.rs/api/rust/dev/polars_core/frame/struct.DataFrame.html#method.hstack
vstack: https://docs.pola.rs/api/rust/dev/polars_core/frame/struct.DataFrame.html#method.vstack
concat: https://docs.pola.rs/api/rust/dev/polars_lazy/dsl/functions/fn.concat.html

explain: https://docs.rs/polars/latest/polars/prelude/struct.LazyFrame.html#method.explain
Expand Down Expand Up @@ -283,15 +297,24 @@ rust:
n_unique: https://docs.pola.rs/api/rust/dev/polars_lazy/dsl/enum.Expr.html#method.n_unique
null_count: https://docs.pola.rs/api/rust/dev/polars_lazy/dsl/enum.Expr.html#method.null_count
interpolate: https://docs.pola.rs/api/rust/dev/polars_lazy/dsl/enum.Expr.html#method.interpolate
is_between: https://github.com/pola-rs/polars/issues/11285
is_between:
name: is_between
link: https://docs.pola.rs/api/rust/dev/polars/prelude/enum.Expr.html#method.is_between
feature_flags: [is_between]
is_duplicated: https://docs.pola.rs/api/rust/dev/polars_lazy/dsl/enum.Expr.html#method.is_duplicated
is_null: https://docs.pola.rs/api/rust/dev/polars/prelude/enum.Expr.html#method.is_null
value_counts:
name: value_counts
link: https://docs.pola.rs/api/rust/dev/polars_lazy/dsl/enum.Expr.html#method.value_counts
feature_flags: [dtype-struct]

Expr.list: https://docs.pola.rs/api/rust/dev/polars_lazy/dsl/struct.ListNameSpace.html
Expr.list:
name: "list namespace"
link: https://docs.pola.rs/api/rust/dev/polars_lazy/dsl/struct.ListNameSpace.html
Expr.str:
name: "str namespace"
link: https://docs.pola.rs/api/rust/dev/polars/prelude/trait.StringNameSpaceImpl.html
feature_flags: [strings]
Series.arr: https://docs.pola.rs/api/rust/dev/polars_lazy/dsl/struct.ArrayNameSpace.html

date_range:
Expand Down Expand Up @@ -374,3 +397,16 @@ rust:
struct.field:
name: struct.field_by_name
link: https://docs.pola.rs/api/rust/dev/polars_lazy/dsl/struct.StructNameSpace.html#method.field_by_name

Expr.name:
name: "name namespace"
link: https://docs.pola.rs/api/rust/dev/polars_lazy/dsl/struct.ExprNameNameSpace.html
feature_flags: [lazy]
Expr.dt:
name: "dt namespace"
link: https://docs.pola.rs/api/rust/dev/polars_lazy/dsl/dt/struct.DateLikeNameSpace.html
feature_flags: [temporal]
round:
name: "round"
link: https://docs.pola.rs/api/rust/dev/polars/prelude/enum.Expr.html#method.round
feature_flags: [round_series]
4 changes: 2 additions & 2 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Polars is a blazingly fast DataFrame library for manipulating structured data. T
- **Fast**: Written from scratch in Rust, designed close to the machine and without external dependencies.
- **I/O**: First class support for all common data storage layers: local, cloud storage & databases.
- **Intuitive API**: Write your queries the way they were intended. Polars, internally, will determine the most efficient way to execute using its query optimizer.
- **Out of Core**: The streaming API allows you to process your results without requiring all your data to be in memory at the same time
- **Out of Core**: The streaming API allows you to process your results without requiring all your data to be in memory at the same time.
- **Parallel**: Utilises the power of your machine by dividing the workload among the available CPU cores without any additional configuration.
- **Vectorized Query Engine**: Using [Apache Arrow](https://arrow.apache.org/), a columnar data format, to process your queries in a vectorized manner and SIMD to optimize CPU usage.
- **GPU Support**: Optionally run queries on NVIDIA GPUs for maximum performance for in-memory workloads.
Expand All @@ -45,7 +45,7 @@ The goal of Polars is to provide a lightning fast DataFrame library that:
- A consistent and predictable API.
- Adheres to a strict schema (data-types should be known before running the query).

Polars is written in Rust which gives it C/C++ performance and allows it to fully control performance critical parts in a query engine.
Polars is written in Rust which gives it C/C++ performance and allows it to fully control performance-critical parts in a query engine.

## Example

Expand Down
135 changes: 135 additions & 0 deletions docs/source/src/python/user-guide/getting-started.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# --8<-- [start:df]
import polars as pl
import datetime as dt

df = pl.DataFrame(
{
"name": ["Alice Archer", "Ben Brown", "Chloe Cooper", "Daniel Donovan"],
"birthdate": [
dt.date(1997, 1, 10),
dt.date(1985, 2, 15),
dt.date(1983, 3, 22),
dt.date(1981, 4, 30),
],
"weight": [57.9, 72.5, 53.6, 83.1], # (kg)
"height": [1.56, 1.77, 1.65, 1.75], # (m)
}
)

print(df)
# --8<-- [end:df]

# --8<-- [start:csv]
df.write_csv("docs/assets/data/output.csv")
df_csv = pl.read_csv("docs/assets/data/output.csv", try_parse_dates=True)
print(df_csv)
# --8<-- [end:csv]

# --8<-- [start:select]
result = df.select(
pl.col("name"),
pl.col("birthdate").dt.year().alias("birth_year"),
(pl.col("weight") / (pl.col("height") ** 2)).alias("bmi"),
)
print(result)
# --8<-- [end:select]

# --8<-- [start:expression-expansion]
result = df.select(
pl.col("name"),
(pl.col("weight", "height") * 0.95).round(2).name.suffix("-5%"),
)
print(result)
# --8<-- [end:expression-expansion]

# --8<-- [start:with_columns]
result = df.with_columns(
birth_year=pl.col("birthdate").dt.year(),
bmi=pl.col("weight") / (pl.col("height") ** 2),
)
print(result)
# --8<-- [end:with_columns]

# --8<-- [start:filter]
result = df.filter(pl.col("birthdate").dt.year() < 1990)
print(result)
# --8<-- [end:filter]

# --8<-- [start:filter-multiple]
result = df.filter(
pl.col("birthdate").is_between(dt.date(1982, 12, 31), dt.date(1996, 1, 1)),
pl.col("height") > 1.7,
)
print(result)
# --8<-- [end:filter-multiple]

# --8<-- [start:group_by]
result = df.group_by(
(pl.col("birthdate").dt.year() // 10 * 10).alias("decade"),
maintain_order=True,
).len()
print(result)
# --8<-- [end:group_by]

# --8<-- [start:group_by-agg]
result = df.group_by(
(pl.col("birthdate").dt.year() // 10 * 10).alias("decade"),
maintain_order=True,
).agg(
pl.len().alias("sample_size"),
pl.col("weight").mean().round(2).alias("avg_weight"),
pl.col("height").max().alias("tallest"),
)
print(result)
# --8<-- [end:group_by-agg]

# --8<-- [start:complex]
result = (
df.with_columns(
(pl.col("birthdate").dt.year() // 10 * 10).alias("decade"),
pl.col("name").str.split(by=" ").list.first(),
)
.select(
pl.all().exclude("birthdate"),
)
.group_by(
pl.col("decade"),
maintain_order=True,
)
.agg(
pl.col("name"),
pl.col("weight", "height").mean().round(2).name.prefix("avg_"),
)
)
print(result)
# --8<-- [end:complex]

# --8<-- [start:join]
df2 = pl.DataFrame(
{
"name": ["Ben Brown", "Daniel Donovan", "Alice Archer", "Chloe Cooper"],
"parent": [True, False, False, False],
"siblings": [1, 2, 3, 4],
}
)

print(df.join(df2, on="name", how="left"))
# --8<-- [end:join]

# --8<-- [start:concat]
df3 = pl.DataFrame(
{
"name": ["Ethan Edwards", "Fiona Foster", "Grace Gibson", "Henry Harris"],
"birthdate": [
dt.date(1977, 5, 10),
dt.date(1975, 6, 23),
dt.date(1973, 7, 22),
dt.date(1971, 8, 3),
],
"weight": [67.9, 72.5, 57.6, 93.1], # (kg)
"height": [1.76, 1.6, 1.66, 1.8], # (m)
}
)

print(pl.concat([df, df3], how="vertical"))
# --8<-- [end:concat]

This file was deleted.

29 changes: 0 additions & 29 deletions docs/source/src/python/user-guide/getting-started/joins.py

This file was deleted.

Loading
Loading