[Feature Request] Enable LOAD DATA for delta tables. #1354

PadenZach · 2022-08-24T17:41:35Z

Feature request

Overview

Currently Deltalake - Databricks has the "COPY INTO" DML statement, and vanilla parquet datasets in spark support the "LOAD DATA" DML statement. However, currently there doesn't seem to be work regarding this in deltalake that I can find. There's also currently tests that make sure they raise not supported warnings.

Motivation

Currently Delta has great support for inserting and writing NEW data into delta tables. However, patterns where we want to insert a new existing parquet file into a delta table currently require us to read the file into memory, and write it into the table.

Ideally, for cases where the file already exists and can be simply 'copied' into the delta table, we would support a DML statement to do this but also track the changes in the delta log.

This should make some use cases of delta more efficient, for example, writing staging partitions somewhere else before testing them, then using LOAD DATA to load them into the final delta table.

Further details

I'd need to look more into this to figure out exactly what would need to be done, but, I'd imagine it something like:

Inspect parquet file path for schema compatibility
Check for partition spec
calculate delta log changes
copy file into new location
commit transaction log

Willingness to contribute

The Delta Lake Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature?

Yes. I can contribute this feature independently.
Yes. I would be willing to contribute this feature with guidance from the Delta Lake community.
No. I cannot contribute this feature at this time.
( I'd be willing, but if this isn't a good 'first' issue, it may require more knowledge/expertise than I have)

The text was updated successfully, but these errors were encountered:

tdas · 2022-09-08T20:42:00Z

At a high level, it makes sense to support LOAD DATA on Delta where, after running a whole bunch of checks, the input file is copied to the delta table directory and added to the log.

My concern is the whole bunch of checks needed to pass is going to be pretty complicated given Delta protocol has so much restrictions on schema, etc. Furthermore, as we are adding more things like column mapping (where the column name in parquet file needs to follow a special naming scheme for it to be valid), data constraints... this can get pretty complex as any of features require actually processing the data (i.e., blindly adding the file to the delta log is going corrupt the table). So while i agree that for the absolutely simple case where none of these are additional stuff is enabled on a table can take advantage of this, its going to be a challenge to implement all the checks, and very brittle from the user point of view as many advanced feature enabled will break it.

liquidaty · 2022-09-14T19:20:31Z

How about, as an initial scope for this feature, just looking for basic parity with common tools for RDBMS such as BCP? For our use case, the feature support does not need to be very sophisticated for this to be of high value.

zsxwing · 2022-09-14T19:30:31Z

This is definitely something worth to look at. We are not putting this on our roadmap (#1307) right now as there are many items there already. But if anyone in the community would like to give it a try, feel free to discuss it in the issue.

liquidaty · 2022-09-16T18:42:37Z

My company might consider contributing to this effort; outcome likely depends on a few things including others' support and various technicalities, including:

Can this feature be structured as a standalone code module, dynamic library and/or CLI, that interfaces with a service API that operates on data passed from memory (and therefore source data file format (parquet, json, etc) is moot as the API only cares about cell data bytes that are sent to the API)?
Can any existing code bases (such as freeTDS) be modified for this purpose?
Are there any restrictions on what language or dev environment this would need to be built in (presumably, if the answer to question 1 is "Yes", then the answer to this is "No")?

zsxwing · 2022-09-27T17:24:55Z

Can this feature be structured as a standalone code module, dynamic library and/or CLI, that interfaces with a service API that operates on data passed from memory (and therefore source data file format (parquet, json, etc) is moot as the API only cares about cell data bytes that are sent to the API)?

We would leverage Spark to read other source format (parquet, json, etc.) We don't want to re-build them.

Can any existing code bases (such as freeTDS) be modified for this purpose?

There is probably no existing code you can use as an example.

Are there any restrictions on what language or dev environment this would need to be built in (presumably, if the answer to question 1 is "Yes", then the answer to this is "No")?

We prefer Scala since the entire code path is using Scala heavily.

PadenZach added the enhancement New feature or request label Aug 24, 2022

nkarpov self-assigned this Aug 30, 2022

zsxwing mentioned this issue Sep 13, 2022

[Feature Request] Support data type changes for schema evolution #1111

Open

3 tasks

nkarpov removed their assignment Sep 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Enable LOAD DATA for delta tables. #1354

[Feature Request] Enable LOAD DATA for delta tables. #1354

PadenZach commented Aug 24, 2022

tdas commented Sep 8, 2022

liquidaty commented Sep 14, 2022

zsxwing commented Sep 14, 2022

liquidaty commented Sep 16, 2022

zsxwing commented Sep 27, 2022

[Feature Request] Enable LOAD DATA for delta tables. #1354

[Feature Request] Enable LOAD DATA for delta tables. #1354

Comments

PadenZach commented Aug 24, 2022

Feature request

Overview

Motivation

Further details

Willingness to contribute

tdas commented Sep 8, 2022

liquidaty commented Sep 14, 2022

zsxwing commented Sep 14, 2022

liquidaty commented Sep 16, 2022

zsxwing commented Sep 27, 2022