Skip to content

Commit

Permalink
Merge pull request #4612 from szarnyasg/latest-docs-version
Browse files Browse the repository at this point in the history
Update CONTRIBUTING guide
  • Loading branch information
szarnyasg authored Jan 20, 2025
2 parents 3c93b5f + a4cce49 commit 17220f0
Show file tree
Hide file tree
Showing 9 changed files with 16 additions and 12 deletions.
10 changes: 7 additions & 3 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,10 +144,14 @@ Some of this style guide is automated with GitHub Actions, but feel free to run
* :white_check_mark: ```see the [`COPY ... FROM` statement]({% link docs/sql/statements/copy.md %}#copy-from)```
* In most cases, linking related GitHub issues/discussions is discouraged. This allows the documentation to be self-contained.

## Archive and Generated Pages
## Latest and Stable Pages

* The archive pages (e.g., <https://duckdb.org/docs/archive/0.8.1/>) contain documentation for old versions of DuckDB. In general, we do not accept contributions to these pages – please target the latest version of the page when submitting your contributions.
* Many of the documentation's pages are auto-generated. Before editing, please check the [`scripts/generate_all_docs.sh`](scripts/generate_all_docs.sh) script. Do not edit the generated content, instead, edit the source files (often found in the [`duckdb` repository](https://github.com/duckdb/duckdb)).
* The latest page, <https://duckdb.org/docs/latest/> contains documentation for the latest `main` branch of DuckDB
* The versioned pages (e.g., <https://duckdb.org/docs/v1.1/>) contain documentation for the stable versions of DuckDB. We generally only accept contributions to the latest stable version. Older pages are only maintained if they contain a critical error.

## Generated Pages

Many of the documentation's pages are auto-generated. Before editing, please check the [`scripts/generate_all_docs.sh`](scripts/generate_all_docs.sh) script. Avoid directly editing the generated content, instead, edit the source files (often found in the [`duckdb` repository](https://github.com/duckdb/duckdb)), and run the generator script.

## Notice

Expand Down
2 changes: 1 addition & 1 deletion _posts/2021-06-25-querying-parquet.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ tags: ["using DuckDB"]

Apache Parquet is the most common "Big Data" storage format for analytics. In Parquet files, data is stored in a columnar-compressed binary format. Each Parquet file stores a single table. The table is partitioned into row groups, which each contain a subset of the rows of the table. Within a row group, the table data is stored in a columnar fashion.

<img src="/images/blog/parquet.svg" alt="Example parquet file shown visually. The parquet file (taxi.parquet) is divided into row-groups that each have two columns (pickup_at and dropoff_at)" title="Taxi Parquet File" style="max-width:30%"/>
<img src="/images/blog/parquet.svg" alt="Example parquet file shown visually. The parquet file (taxi.parquet) is divided into row groups that each have two columns (pickup_at and dropoff_at)" title="Taxi Parquet File" style="max-width:30%"/>

The Parquet format has a number of properties that make it suitable for analytical use cases:

Expand Down
2 changes: 1 addition & 1 deletion _posts/2021-12-03-duck-arrow.md
Original file line number Diff line number Diff line change
Expand Up @@ -369,7 +369,7 @@ new_table = pa.Table.from_pandas(res)

The difference in times between DuckDB and Pandas is a combination of all the integration benefits we explored in this article. In DuckDB the filter pushdown is applied to perform partition elimination (i.e., we skip reading the Parquet files where the year is <= 2014). The filter pushdown is also used to eliminate unrelated row_groups (i.e., row groups where the total amount is always <= 100). Due to our projection pushdown, Arrow only has to read the columns of interest from the Parquet files, which allows it to read only 4 out of 20 columns. On the other hand, Pandas is not capable of automatically pushing down any of these optimizations, which means that the full dataset must be read. **This results in the 4 orders of magnitude difference in query execution time.**

In the table above, we also depict the comparison of peak memory usage between DuckDB (Streaming) and Pandas (Fully-Materializing). In DuckDB, we only need to load the row-group of interest into memory. Hence our memory usage is low. We also have constant memory usage since we only have to keep one of these row groups in-memory at a time. Pandas, on the other hand, has to fully materialize all Parquet files when executing the query. Because of this, we see a constant steep increase in its memory consumption. **The total difference in memory consumption of the two solutions is around 3 orders of magnitude.**
In the table above, we also depict the comparison of peak memory usage between DuckDB (Streaming) and Pandas (Fully-Materializing). In DuckDB, we only need to load the row group of interest into memory. Hence our memory usage is low. We also have constant memory usage since we only have to keep one of these row groups in-memory at a time. Pandas, on the other hand, has to fully materialize all Parquet files when executing the query. Because of this, we see a constant steep increase in its memory consumption. **The total difference in memory consumption of the two solutions is around 3 orders of magnitude.**

## Conclusion and Feedback

Expand Down
2 changes: 1 addition & 1 deletion _posts/2023-10-27-csv-sniffer.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ tags: ["using DuckDB"]
width="300"
/>

There are many different file formats that users can choose from when storing their data. For example, there are performance-oriented binary formats like Parquet, where data is stored in a columnar format, partitioned into row-groups, and heavily compressed. However, Parquet is known for its rigidity, requiring specialized systems to read and write these files.
There are many different file formats that users can choose from when storing their data. For example, there are performance-oriented binary formats like Parquet, where data is stored in a columnar format, partitioned into row groups, and heavily compressed. However, Parquet is known for its rigidity, requiring specialized systems to read and write these files.

On the other side of the spectrum, there are files with the CSV (comma-separated values) format, which I like to refer to as the 'Woodstock of data'. CSV files offer the advantage of flexibility; they are structured as text files, allowing users to manipulate them with any text editor, and nearly any data system can read and execute queries on them.

Expand Down
2 changes: 1 addition & 1 deletion _posts/2024-10-23-whats-new-in-the-vss-extension.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ In the [previous blog post]({% post_url 2024-05-03-vector-similarity-search-vss

## Indexing Speed Improvements

As previously documented, creating an HNSW (Hierarchical Navigable Small Worlds) index over an already populated table is much more efficient than first creating the index and then inserting into the table. This is because it is much easier to predict how large the index will be if the total amount of rows are known up-front, which makes its possible to divide the work into chunks large enough to distribute over multiple threads. However, in the initial release this work distribution was a bit too coarse-grained as we would only schedule an additional worker thread for each [_row-group_]({% link docs/internals/storage.md%}#row-groups) (about 120,000 rows by default) in the table.
As previously documented, creating an HNSW (Hierarchical Navigable Small Worlds) index over an already populated table is much more efficient than first creating the index and then inserting into the table. This is because it is much easier to predict how large the index will be if the total amount of rows are known up-front, which makes its possible to divide the work into chunks large enough to distribute over multiple threads. However, in the initial release this work distribution was a bit too coarse-grained as we would only schedule an additional worker thread for each [_row group_]({% link docs/internals/storage.md %}#row-groups) (about 120,000 rows by default) in the table.

We've now introduced an extra buffer step in the index creation pipeline which enables more fine-grained work distribution, smarter memory allocation and less contention between worker threads. This results in much higher CPU saturation and a significant speedup when building HNSW indexes in environments with many threads available, regardless of how big or small the underlying table is.

Expand Down
2 changes: 1 addition & 1 deletion docs/extensions/delta.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ While the `delta` extension is still experimental, many (scanning) features and

* multithreaded scans and Parquet metadata reading
* data skipping/filter pushdown
* skipping row-groups in file (based on Parquet metadata)
* skipping row groups in file (based on Parquet metadata)
* skipping complete files (based on Delta partition information)
* projection pushdown
* scanning tables with deletion vectors
Expand Down
4 changes: 2 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,13 @@ title: Documentation
<div class="box-link-wrapper">
<div class="box-link half-width">
<a href="{% link docs/api/cli/overview.md %}"></a>
<span class="symbol"><img src="{%link images/icons/cli.svg %}"></span>
<span class="symbol"><img src="{% link images/icons/cli.svg %}"></span>
<span>CLI (Command Line Interface)</span>
<span class="chevron"></span>
</div>
<div class="box-link half-width">
<a href="{% link docs/api/java.md %}"></a>
<span class="symbol"><img src="{%link images/icons/java.svg %}"></span>
<span class="symbol"><img src="{% link images/icons/java.svg %}"></span>
<span>Java</span>
<span class="chevron"></span>
</div>
Expand Down
2 changes: 1 addition & 1 deletion faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ Ducks are amazing animals. They can fly, walk and swim. They can also live off p
DuckDB is fully open-source under the MIT license and its development takes place [on GitHub in the `duckdb/duckdb` repository](https://github.com/duckdb/duckdb).
All components of DuckDB are available in the free version under this license: there is no “enterprise version” of DuckDB.

The intellectual property of DuckDB has been purposefully moved to a non-profit entity to disconnect the licensing of the project from the commercial company, DuckDB Labs.
Most of the intellectual property of DuckDB has been purposefully moved to a non-profit entity to disconnect the licensing of the project from the commercial company, DuckDB Labs.
The DuckDB Foundation's statutes also ensure DuckDB remains open-source under the MIT license in perpetuity.
The [CWI (Centrum Wiskunde & Informatica)](https://cwi.nl/) has a seat on the board of the DuckDB Foundation
and donations to the DuckDB Foundation directly fund DuckDB development.
Expand Down
2 changes: 1 addition & 1 deletion foundation/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ <h1>DuckDB Foundation</h1>
<h1>Our Purpose</h1>
<p>
The independent non-profit DuckDB Foundation safeguards the long-term maintenance and
development of DuckDB. The foundation holds much of the intellectual property (IP)
development of DuckDB. The foundation holds most of the intellectual property (IP)
of the project. The DuckDB Foundation is funded by charitable donations. All collected funds go directly to DuckDB development.
A number of organizations have thankfully chosen to support the DuckDB
project.
Expand Down

0 comments on commit 17220f0

Please sign in to comment.