Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: honor appendOnly table config #1747

Merged
merged 2 commits into from
Oct 25, 2023

Conversation

junjunjd
Copy link
Contributor

@junjunjd junjunjd commented Oct 20, 2023

Description

Throw an error if a transaction includes Remove action with data change but the Delta Table is append-only.

Related Issue(s)

@github-actions github-actions bot added binding/rust Issues for the Rust crate rust labels Oct 20, 2023
rust/src/operations/transaction/mod.rs Outdated Show resolved Hide resolved
rust/src/operations/transaction/mod.rs Outdated Show resolved Hide resolved
@rtyler rtyler added this to the Rust v0.17 milestone Oct 20, 2023
Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking care of this @junjunjd!

Looking great, I especially appreciate the test coverage!

There are a few smaller things we may consider. I believe the commit or commit_with_retry function may be a little bit more fitting place to do the check?

On amore general level - and this is completely optional for this PR. For the different validations we have been creating *Checker structs - like the DataChecker or the ConflictChecker. I am currently working on broader protocol support, and planned to introduce something like a PreconditionChecker which handles more pre-commit checks - e.g. writer version compatibility.

Just for append-only this may be overkill, so I leave it up to you if you want to adopt this here.

rust/src/operations/transaction/mod.rs Outdated Show resolved Hide resolved
@junjunjd junjunjd force-pushed the feat/honor-append-only-table-config branch from b9cbd1a to aa8062d Compare October 23, 2023 05:08
@junjunjd junjunjd requested review from rtyler and roeap October 23, 2023 05:09
@junjunjd junjunjd force-pushed the feat/honor-append-only-table-config branch from aa8062d to d01aac3 Compare October 23, 2023 05:18
@junjunjd
Copy link
Contributor Author

junjunjd commented Oct 23, 2023

There are a few smaller things we may consider. I believe the commit or commit_with_retry function may be a little bit more fitting place to do the check?

@roeap I agree logically commit or commit_with_retry function would be the right place to do the check. However, the code that iterates the Action vector exists in log_entry_from_actions. commit_with_retry calls log_entry_from_actions (and a bunch of intermediate functions) to prepare the commit. I think it makes sense to put the check in log_entry_from_actions so that the Action vector is only iterated once.

I am currently working on broader protocol support, and planned to introduce something like a PreconditionChecker which handles more pre-commit checks - e.g. writer version compatibility.

This is a good idea to build a high-level checker. For append-only, we can add some high-level checking to delete, merge and update operations to return an error immediately. I have created a follow-up issue to add high-level checking #1759.

@junjunjd
Copy link
Contributor Author

@rtyler @roeap @wjones127 I have addressed the comments. This MR is ready for final review.

@roeap
Copy link
Collaborator

roeap commented Oct 24, 2023

I think it makes sense to put the check in log_entry_from_actions so that the Action vector is only iterated once.

i see your point, however we discussed with databricks and even in their environments commits rarely get larger then low 10s of actions (with some extreme outliers like shallow copy). So all in all i would guess the cost of one traversal is negligible compared to all the io we just did writing files and the potentially many traversals in conflict checking ...

That said, we can revisit this when we create a higher level helper.

Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! thanks for this work!!

there is one more thing i realised. could we update the readme to refelct the feature support in the tables towards the end?

Throw an error if a transaction includes Remove action with data change
but the Delta Table is append-only.
@junjunjd junjunjd force-pushed the feat/honor-append-only-table-config branch from d01aac3 to 45e10a7 Compare October 24, 2023 21:35
@junjunjd
Copy link
Contributor Author

junjunjd commented Oct 24, 2023

there is one more thing i realised. could we update the readme to refelct the feature support in the tables towards the end?

@roeap README is updated. Thanks for the review.

@junjunjd junjunjd requested a review from roeap October 24, 2023 21:39
@roeap roeap enabled auto-merge (squash) October 25, 2023 04:54
@roeap roeap merged commit 2cbf938 into delta-io:main Oct 25, 2023
21 checks passed
ryanaston pushed a commit to segmentio/delta-rs that referenced this pull request Nov 1, 2023
* feat: extend unit catalog support

* chore: draft datafusion integration

* fix: allow passing catalog options from python

* chore: clippy

* feat: add more azure credentials

* fix: add defaults for return types

* fix: simpler defaults

* Update rust/src/data_catalog/unity/mod.rs

Co-authored-by: nohajc <[email protected]>

* fix: imports

* fix: add some defaults

* test: add failing provider test

* feat: list catalogs

* merge main

* fix: remove artifact

* fix: errors after merge with main

* Start python api docs

* docs: update Readme (delta-io#1440)

# Description

With summit coming up I thought we might update our README, since
delta-rs has evolved quite a bit since the README was first written...

Just opening the Draft to get feedback on the general "patterns" i.e.
how the tables are formatted, how detailed we want to show the features
and mostly the looks of the header.

Also hoping our community experts may have some content they wat to add
here 😆.

cc @dennyglee @MrPowers @wjones127 @rtyler @houqp @fvaleye

---------

Co-authored-by: Will Jones <[email protected]>
Co-authored-by: R. Tyler Croy <[email protected]>

* Pin chrono to 0.4.30

v0.4.31 was just released which introduces some spurious deprecation warnings

* docs: update Readme (delta-io#1633)

# Description
- Changed the icons as, at first glance, it looked like AWS was not
supported (in blue), while the green open icon looked like it was
completed
- Added one line linking to the Delta Lake docker
- Fixed some minor grammar issues

Including community experts @roeap @MrPowers @wjones127 @rtyler @houqp
@fvaleye to ensure these updates make sense. Thanks!

* chore: update datafusion to 31, arrow to 46 and object_store to 0.7 (delta-io#1634)

# Description

Update datafusion to 31

* chore: relax chrono pin to 0.4 (delta-io#1635)

# Description

relax chrono pin to improve downstream compatibility.

* make create_checkpoint_for public

* add documentation to create_checkpoint_for

* Implement parsing for the new `domainMetadata` actions in the commit log

The Delta Lake protocol which will be released in conjunction with "3.0.0"
(currently at RC1) introduces `domainMetadata` actions to the commit log to
enable system or user-provided metadata about the commits to be added to the
log. With DBR 13.3 in the Databricks ecosystem, tables are already being written
with this action via the "liquid clustering" feature.

This change enables the clean reading of these tables, but at present nothing
novel is done with this information.

[Read more here](https://www.databricks.com/blog/announcing-delta-lake-30-new-universal-format-and-liquid-clustering)

Fixes delta-io#1626

Sponsored-by: Databricks Inc

* fix: include in-progress row group when calculating in-memory buffer length (delta-io#1638)

# Description
`PartitionWriter.buffer_len()` is documented as returning: 

> the current byte length of the in memory buffer.

However, this doesn't currently include the length of the in-progress
row group. This means that until a row group is flushed, `buffer_len()`
returns `0`. Based on the documented description, its length should
probably include the bytes currently in-memory as part of an unflushed
row group.

`buffered_record_batch_count` _does_ include in-progress row groups, so
this change also means record count and buffered bytes are reported
consistently.

# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->
- closes delta-io#1637

# Documentation

<!---
Share links to useful documentation
--->

[`buffer_len` on
`RecordBatchWriter`](https://docs.rs/deltalake/0.15.0/deltalake/writer/record_batch/struct.RecordBatchWriter.html#method.buffer_len)

---------

Co-authored-by: Will Jones <[email protected]>

* feat: allow multiple incremental commits in optimize

Currently "optimize" executes the whole plan in one commit, which might
fail. The larger the table, the more likely it is to fail and the more
expensive the failure is.

Add an option in OptimizeBuilder that allows specifying a commit
interval. If that is provided, the plan executor will periodically
commit the accumulated actions.

* fix: explicitly require chrono 0.4.31 or greater

The Python binding relies on `timestamp_nanos)opt()` which requires 0.4.31 or
greater from chroni since it did not previously exist.

As a [cargo dependency
refresher](https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html#specifying-dependencies-from-cratesio)
this version range is >=0.4.31, < 0.5.0 which is I believe what we need for
optimal downstream compatibility.

* Correct some merge related errors with redundant package names from the workspace

* Address some latent clippy failures after merging main

* Correct the incorrect documentation for `Backoff`

* fix: avoid excess listing of log files

* feat: pass known file sizes to filesystem in Python (delta-io#1630)

# Description
Currently the Filesystem implementation always makes a HEAD request when
opening a file, to determine the file size. The proposed change is to
read the file sizes from the delta log instead, and to pass them down to
the `open_input_file` call, eliminating the HEAD request.

# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->

# Documentation

<!---
Share links to useful documentation
--->

* Proposed updated CODEOWNERS to allow better review notifications

Based on current pull request feedback and maintenance trends I'm suggesting
these rules to get the right people on the reviews by default.

Closes delta-io#1553

* fix: add support for Microsoft OneLake

This change introduces tests and support for Microsoft OneLake. This specific
commit is a rebase of the work done by our pals at Microsoft.

Co-authored-by: Mohammed Muddassir <[email protected]>
Co-authored-by: Christopher Watford <[email protected]>

* Ignore failing integration tests which require a special environment to operate

The OneLake support should be considered unsupported and experimental until such
time when we can add integration testing to our CI process

* Compensate for invalid log files created by Delta Live Tables

It would appear that in some cases Delta Live Tables will create a Delta table
which does not adhere to the Delta Table protocol.

The metaData action as a **required** `schemaString` property which simply
doesn't exist. Since it appears that this only exists at version zero of the
transaction log, and the _actual_ schema exists in the following versions of the
table (e.g. 1), this change introduces a default deserializer on the MetaData
action which provides a simple empty schema.

This is an alternative implementation to delta-io#1305 which is a bit more invasive and
makes our schema_string struct member `Option<String>` which I do not believe is
worth it for this unfortunate compatibility issue

Closes delta-io#1305, delta-io#1302, delta-io#1357

Sponsored-by: Databricks Inc

* chore: fix the incorrect Slack link in our readme

not sure what the deal with the go.delta.io service, no idea where that lives

Fixes delta-io#1636

* enable offset listing for s3

* Make docs.rs build docs with all features enabled

I was confused that I could not find the documentation integrating datafusion with delta-rs.

With this PR, everything should show up. Perhaps docs for a feature gated method should also mention which feature is required. Similar to what Tokio does. Perhaps it could be done in followup PRs.

* feat: expose min_commit_interval to `optimize.compact` and `optimize.z_order` (delta-io#1645)

# Description
Exposes min_commit_interval in the Python API to `optimize.compact` and
`optimize.z_order`. Added one test-case to verify the
min_commit_interval.

# Related Issue(s)
closes delta-io#1640

---------

Co-authored-by: Will Jones <[email protected]>

* docs: add docstring to protocol method (delta-io#1660)

* fix: percent encoding of partition values and paths

* feat: handle path encoding in serde and encode partition values in file names

* fix: always unquote partition values extracted from path

* test: add tests for related issues

* fix: consistent serialization of partition values

* fix: rounbdtrip special characters

* chore: format

* fix: add feature requirement to load example

* test: add timestamp col to partitioned roundtrip tests

* test: add rust roundtip test for special characters

* fix: encode characters illegal on windows

* docs: fix some typos (delta-io#1662)

# Description
Saw two typos and marking merge in rust as half-done with a comment on
it's current limitation.

* feat: use url parsing from object store

* fix: ensure config for ms fabric

* chore: drive-by simplify test files

* fix: update aws http config key

* fix: feature gate azure update

* feat: more robust azure config handling

* fix: in memory store handling

* feat: use object-store's s3 store if copy-if-not-exists headers are specified (delta-io#1356)

* refactor: re-organize top level modules (delta-io#1434)

# Description

~This contains changes from delta-io#1432, will rebase once that's merged.~

This PR constitutes the bulk of re-organising our top level modules.
- move `DeltaTable*` structs into new `table` module
- move table configuration into `table` module
- move schema related modules into `schema` module
- rename `action` module to `protocol` - hoping to isolate everything
that can one day be the log kernel.

~It also removes the deprecated commit logic from `DeltaTable` and
updates call sites and tests accordingly.~

I am planning one more follow up, where I hope to make `transactions`
currently within `operations` a top level module. While the number of
touched files here is already massive, I want to do this in a follow up,
as it will also include some updates to the transactions itself, that
should be more carefully reviewed.

# Related Issue(s)

closes: delta-io#1136

# Documentation

<!---
Share links to useful documentation
--->

* chore: increment python library version (delta-io#1664)

# Description
The description of the main changes of your pull request

# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->

# Documentation

<!---
Share links to useful documentation
--->

* fix exception string in writer.py

The exception message is ambiguous as it interchanges the table and data schemas.

* Update docs

* add read me

* Add space

* feat: allow to set large dtypes for the schema check in `write_deltalake` (delta-io#1668)

# Description
Currently it was always checking the schema for non-large types, I
didn't know before we could change it so in polars we added some schema
casting from large to non-large, this however became a problem today
when I wanted to write 200M records at once because the array was too
big the fit in normal string type.

```python
ArrowInvalid: Failed casting from large_string to string: input array too large
```

Adding this flag will allow libraries like polars to write directly with
their large dtypes in arrow. If this is merged, I can work on fix in
polars to remove the schema casting for these large types.

* fix: change partitioning schema from large to normal string for pyarrow<12 (delta-io#1671)

# Description
If pyarrow is below v12.0.0 it changes the partitioning schema fields
from large_string to string.

# Related Issue(s)
closes delta-io#1669 

# Documentation
apache/arrow#34546 (comment)

---------

Co-authored-by: Will Jones <[email protected]>

* chore: bump rust crate version

* fix: use epoch instead of ce for date stats (delta-io#1672)

# Description
date32 statistics logic was subjectively wrong. It was using
`from_num_days_from_ce_opt` which
> Makes a new NaiveDate from a day's number in the proleptic Gregorian
calendar, with January 1, 1 being day 1.

while date32 is commonly represented as days since UNIX epoch
(1970-01-01)



# Related Issue(s)
closes delta-io#1670

# Documentation
It doesn't seem like parquet actually has a spec for what a `date`
should be, but many other tools seem to use the epoch logic.

duckdb, and polars seem to use epoch instead of gregorian. 

Also arrow spec states that date32 should be epoch.

for example, if i write using polars
```py
import polars as pl

# %%
df = pl.DataFrame(
    {
        "a": [
            10561,
            9200,
            9201,
            9202,
            9203,
            9204,
            9205,
            9206,
            9207,
            9208,
            9199,
        ]
    }
)
# %%

df.select(pl.col("a").cast(pl.Date)).write_delta("./db/polars/")
```
the stats are correctly interpreted
```
{"add":{"path":"0-7b8f11ab-a259-4673-be06-9deedeec34ff-0.parquet","size":557,"partitionValues":{},"modificationTime":1695779554372,"dataChange":true,"stats":"{\"numRecords\": 11, \"minValues\": {\"a\": \"1995-03-10\"}, \"maxValues\": {\"a\": \"1998-12-01\"}, \"nullCount\": {\"a\": 0}}"}}
```

* chore: update changelog for the rust-v0.16.0 release

* Remove redundant changelog entry for 0.16

* update readme

* fix: update the delta-inspect CLI to be build again by Cargo

This sort of withered on the vine a bit, this pull request allows it to be built
properly again

* update readme

* chore: bump the version of the Rust crate

* fix: unify environment variables referenced by Databricks docs

Long-term fix will be for Databricks to release a Rust SDK for Unity 😄

Fixes delta-io#1627

* feat: support CREATE OR REPLACE

* docs: get docs.rs configured correctly again (delta-io#1693)

# Description

The docs build was changed in delta-io#1658 to compile on docs.rs with all
features, but our crate cannot compile with all-features due to the TLS
features, which are mutually exclusive.

# Related Issue(s)

For example:

- closes delta-io#1692

This has been tested locally with the following command:

```
cargo doc --features azure,datafusion,datafusion,gcs,glue,json,python,s3,unity-experimental
```

* fix!: ensure predicates are parsable (delta-io#1690)

# Description
Resolves two issues that impact Datafusion implemented operators

1. When a user has an expression with a scalar built-in scalar function
we are unable parse the output predicate since the
`DummyContextProvider`'s methods are unimplemented. The provider now
uses the user provided state or a default. More work is required in the
future to allow a user provided Datafusion state to be used during the
conflict checker.

2. The string representation was not parsable by sqlparser since it was
not valid SQL. New code was written to transform an expression into a
parsable sql string. Current implementation is not exhaustive however
common use cases are covered.

The delta_datafusion.rs file is getting large so I transformed it into a
module.

This implementation makes reuse of some code from Datafusion. I've added
the Apache License at the top of the file. Let me know if any else is
required to be compliant.


# Related Issue(s)
- closes delta-io#1625

---------

Co-authored-by: Will Jones <[email protected]>

* fix typo in readme

* fix: address formatting errors

* fix: remove an unused import

* feat(python): expose delete operation (delta-io#1687)

# Description
Naively expose the delete operation, with the option to provide a
predicate.

I first tried to expose a richer API with the Python `FilterType` and
DNF expressions, but from what I understand delta-rs doesn't implement
generic filters but only `PartitionFilter`. The `DeleteBuilder` also
only accepts datafusion expressions. So Instead of hacking my way around
or proposing a refactor I went for the simpler approach of sending a
string predicate to the rust lib.

If this implementation is OK I will add tests.

# Related Issue(s)
- closes delta-io#1417

---------

Co-authored-by: Will Jones <[email protected]>

* docs(python): document the delete operation

* Introduce some redundant type definitions to the mypy stub

* chore: fix new clippy lints introduced in Rust 1.73

* Update the sphinx ignore for building

=_=

* Enable prebuffer

* implement issue 1169

* fix format

* feat: add version number in `.history()` and display in reversed chronological order (delta-io#1710)

# Description
Adds the version number to each commit info.

# Related Issue(s)
<!---
For example:

- closes delta-io#106 
--->
- Closes delta-io#1561
- Closes delta-io#1680

---------

Co-authored-by: R. Tyler Croy <[email protected]>

* feat(python): expose UPDATE operation (delta-io#1694)

# Description

- Exposes UPDATE operation to Python.
- Added two test cases, with predicate and without
- Took some learnings in simplifying the code (will apply it in MERGE PR
as well)


# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->

Closes delta-io#1505

---------

Co-authored-by: Will Jones <[email protected]>

* fix: merge operation with string predicates (delta-io#1705)

# Description
Fixes an issue when users use string predicates with the merge
operation.

Parsing a string predicate did not properly handle table references and
would always assume a bare table with a table name of the empty string.
Now the qualifier is `None` however a `DFSchema` with qualifiers can be
supplied where it makes sense.

Now users must provide source and target aliases whenever both sides
share a column name otherwise the operation will error out.

Minor refactoring of the expression parser was also done and allowed
using of case expressions.


# Related Issue(s)
- closes delta-io#1699

---------

Co-authored-by: Will Jones <[email protected]>

* refactor!: remove a layer of lifetimes from PartitionFilter (delta-io#1725)

# Description
This commit removes a bunch of lifetime restrictions on the
`PartitionFilter` and `PartitionFilterValue` classes to make them easier
to use. While the original discussion in Slack and delta-io#1501 made mention of
using a reference type, there doesn't seem to a need for it. A
particular instance of a `PartitionFilter` is created once and just
borrowed and read for the remainder of its life.

Functions, when necessary continue to accept the non-container types
(i.e, `&str` and `&[&str]`), allowing their containerized counterparts
to continue working with them without needing to borrow or clone the
containers (i.e, `String` and `Vec<String>`).

# Related Issue(s)
- resolves delta-io#1501 

# Documentation

* feat(python): expose MERGE operation (delta-io#1685)

# Description
This exposes MERGE commands to the Python API. The updates and
predicates are first kept in the Class TableMerger and only dispatched
to Rust after `TableMerge.execute()`.

This was my first thought on how to implement it since I have limited
experience with Rust and PyO3 (still learning 😄). Maybe a more elegant
solution is that every class method on TableMerger is dispatched to Rust
and then the Rust MergeBuilder gets serialized and sent back to Python
(back and forth). Let me know your thoughts on this. If this is better,
I could also do this in the next PR, so we at least can push this one
out sooner.

Couple of issues at the moment, I need feedback on, where the first one
is blocking since I can't test it now:

~- Source_alias is not applying, somehow during a schema check the
prefix is missing, however when I printed the lines inside merge, it
showed the prefix correctly. So not sure where the issue is~
~- I had to make datafusion_utils public since I needed to get the
Expression Struct from it, is this the right way to do that? @Blajda~

Edit:
I will pull @Blajda's changes
delta-io#1705 once merged with develop:


# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->
closes  delta-io#1357

* chore: remove deprecated functions

* chore: bump the python package version (delta-io#1734)

# Description
The description of the main changes of your pull request

# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->

# Documentation

<!---
Share links to useful documentation
--->

* fix: reorder encode_partition_value() checks and add tests (delta-io#1733)

# Description
The `isinstance(val, datetime)` check was after `isinstance(val, date)`
which meant that it was never found. I added a test for each encoding
type.

---------

Co-authored-by: Robert Pack <[email protected]>

* Relax `pyarrow` pin

* fix: remove `pandas` pin (delta-io#1746)

# Description

Removes the `pandas` pin.

# Related Issue(s)

Resolves delta-io#1745

* docs: get docs.rs configured correctly again (delta-io#1693)

# Description

The docs build was changed in delta-io#1658 to compile on docs.rs with all
features, but our crate cannot compile with all-features due to the TLS
features, which are mutually exclusive.

# Related Issue(s)

For example:

- closes delta-io#1692

This has been tested locally with the following command:

```
cargo doc --features azure,datafusion,datafusion,gcs,glue,json,python,s3,unity-experimental
```

* Make this a patch release to fix docs.rs

* Remove the hdfs feature from the docsrs build

* refactor!: update operations to use delta scan (delta-io#1639)

# Description
Recently implemented operations did not use `DeltaScan` it had some
gaps. These gaps would make it harder switch towards logical plans which
is required for merge.

Gaps:
- It was not possible to include file lineage in the result
- The subset of files to be scanned is known ahead of time. Users had to
reconstruct a parquet scan based on those files

The PR introduces a `DeltaScanBuilder` that allow users to specify which
files to use when constructing the scan, if the scan should be enhanced
to include additional metadata columns, and allows a projection to be
specified. It also retains previous functionality of pruning based on
the provided filter when files to scan are not provided.

`DeltaScanConfig` is also introduced which allows users to deterministic
obtain the names of any added metadata columns or allows them to specify
the name if required.

The public interface for `find_files` has changed but functionality
remains the same.

A new table provider was introduced which accepts an `DeltaScanConfig`.
This is required for future merge enhancements so unmodified files can
be pruned pruned prior to writes.

---------

Co-authored-by: Robert Pack <[email protected]>

* chore: update datafusion (delta-io#1741)

Updates arrow and datafusion dependencies to latest.

* docs: convert docs to use mkdocs (delta-io#1731)

# Description
Completed the outstanding tasks in delta-io#1708

Also changed theme from readthedocs to mkdocs - both are built-in but
latter looks sleeker

# Related Issue(s)
closes delta-io#1708

---------

Co-authored-by: Robert Pack <[email protected]>
Co-authored-by: R. Tyler Croy <[email protected]>

* docs: dynamodb lock configuration (delta-io#1752)

# Description
I have added documentation in the API and also on the Python usage page
regarding this configuration. Please let me know if it is satisfactory,
and if not, I am more than happy to address any issues or make any
necessary adjustments.

# Related Issue(s)
- closes delta-io#1674

# Documentation

* feat: ignore binary columns for stats generation

* feat: honor appendOnly table config (delta-io#1747)

# Description
Throw an error if a transaction includes Remove action with data change
but the Delta Table is append-only.

# Related Issue(s)
- closes delta-io#352

* chore: fix building/running tests without the datafusion feature

This looks like an oversight that our CI didn't test because we have the
datafusion feature typically enabled for our tests. The build error would only
show up when building tests without it.

* add write support explicitly for pyarrow dataset

* feat(python): expose FSCK (repair) operation  (delta-io#1730)

# Description
This PR exposes the FSCK operation as a `repair` method under the
`DeltaTable `class.

# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->
- closes delta-io#1727

---------

Co-authored-by: Will Jones <[email protected]>

* refactor: perform bulk deletes during metadata cleanup

In addition to doing bulk deletes, I removed what seems like (at least to me)
unnecessary code. At it's core, files are considered up for deletion
when their last_modified time is older than the cutoff time AND the version
if less than the specific version (usually the latest version).

* Make an attempt at improving the utilization of delete_stream for cleaning up expired logs

This change builds on @cmackenzie1's work and feeds the list stream directly into
the delete_stream with a predicate function to identify paths for deletion

* start to add vacuum into transaction log

* add vacuum operations in transaction log

* attempt to calculate size

* add test

* chore: bump Python package version

* fix: ignore inf in stats

* doc(README): remove typo

* enhance docs to enable multi-lingual examples

* use official Python API for references

* chore: refactor into the deltalake meta crate and deltalake-core crates

This puts the groundwork in place for starting to partition into smaller crates
in a simpler and more manageable fashion.

See delta-io#1713

* Correct the working directory for the parquet2 tests

* feat: add deltalake sql crate (delta-io#1757)

# Description

This is an fairly early draft to create logical plans from sql using the
datafusion abstractions. Adopted the patterns over there quite closely
since the ultimate goal would be to ask the datafusion community if they
would accept these changes within the datafusion sql crate ...

# Related Issue(s)
<!---
For example:

- closes delta-io#106
--->

# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: R. Tyler Croy <[email protected]>

* rollback resolve bucket region change

---------

Co-authored-by: Robert Pack <[email protected]>
Co-authored-by: Robert Pack <[email protected]>
Co-authored-by: nohajc <[email protected]>
Co-authored-by: Will Jones <[email protected]>
Co-authored-by: R. Tyler Croy <[email protected]>
Co-authored-by: Denny Lee <[email protected]>
Co-authored-by: QP Hou <[email protected]>
Co-authored-by: haruband <[email protected]>
Co-authored-by: Ben Magee <[email protected]>
Co-authored-by: Constantin S. Pan <[email protected]>
Co-authored-by: Eero Lihavainen <[email protected]>
Co-authored-by: Mohammed Muddassir <[email protected]>
Co-authored-by: Christopher Watford <[email protected]>
Co-authored-by: Simon Vandel Sillesen <[email protected]>
Co-authored-by: Ion Koutsouris <[email protected]>
Co-authored-by: Matthew Powers <[email protected]>
Co-authored-by: Sébastien Diemer <[email protected]>
Co-authored-by: Cory Grinstead <[email protected]>
Co-authored-by: Trinity Xia <[email protected]>
Co-authored-by: hnaoto <[email protected]>
Co-authored-by: universalmind303 <[email protected]>
Co-authored-by: David Blajda <[email protected]>
Co-authored-by: Josiah Parry <[email protected]>
Co-authored-by: Guilhem de Viry <[email protected]>
Co-authored-by: Nikolay Ulmasov <[email protected]>
Co-authored-by: Cole Mackenzie <[email protected]>
Co-authored-by: ldacey <[email protected]>
Co-authored-by: Dave Hirschfeld <[email protected]>
Co-authored-by: David Blajda <[email protected]>
Co-authored-by: Brayan Jules <[email protected]>
Co-authored-by: emcake <[email protected]>
Co-authored-by: Junjun Dong <[email protected]>
Co-authored-by: Ion Koutsouris <[email protected]>
Co-authored-by: Deep145757 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate rust
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Honor appendOnly table config in table writes
3 participants