-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get_add_actions does not return any records #2507
Comments
CREATE OR REPLACE TABLE AS SELECT
statement
@antonsteenvoorden the previous behavior was incorrect as it would not remove the added file actions when doing a create or replace. Now it properly does what its supposed to. So doing create with mode overwrite, replaces the table and removes the content. |
@ion-elgreco thanks for your reply. Are you refering to #2437? If I understand correctly, that PR fixes a bug that occurs when writing a table with delta-rs. The problem above as described by @antonsteenvoorden however, is about an issue when reading a table with delta-rs that was written using spark. I've delved into this problem a bit further so I can provide more details of what (I think) is going on:
Now to be honest I don't think this should be fixed on the delta-rs side. This approach of recreating a table in a location where there is already another table is of course hacky and dangerous and we didn't like it in the first place; we'll look for a different approach. It is also a clear violation of the delta protocol so we cannot expect you to support this. Thanks for your help and you may close the issue. |
@diederikperdok thanks for the additional info, it wasn't entirely clear before. so As it is an odd thing to see, it could be considered a bug or missing robustness of the log replay. I'll keep it open for now. @roeap do you have any inputs here, I think theoretically we should be able to read add remove add of the same path, as long as the sum of the path is not >1 |
Ok, on 2nd look I might have misdiagnosed the issue. From the code it still looks like the It looks like that in our case certain actions that I can see in the deltalog .json for that version are simply not returned by |
I dove into it again and it seems to have something to do with how checkpoints are handled since >= 0.15.2. Every time there is a checkpoint in the delta log, When I look into this checkpoint parquet file, I can see that the add actions are there (and there aren't remove actions for the same path) so I would expect to get output. This is also true for the oldest checkpoint that is still in our delta log. However, I still cannot reproduce this behavior :( To reproduce it I tried the following:
CREATE DATABASE IF NOT EXISTS mytmp;
DROP TABLE IF EXISTS mytmp.mytbl;
CREATE TABLE mytmp.mytbl
USING DELTA
PARTITIONED BY (part)
AS
SELECT 1 AS part, "foo"
for i in range(100):
spark.sql(f"INSERT INTO mytmp.mytbl SELECT {i} AS part, 'foo'")
delta_table = DeltaTable("{your spark.sql.warehouse.dir}/mytmp.db/mytbl/", version=36)
delta_table.get_add_actions(flatten=True).to_pandas()
I believe the behavior of 0.15.2 is correct here, while 0.15.1 is wrong. So there still is no explanation for why we get empty results for checkpoints on our production table on >=0.15.2 as @antonsteenvoorden described. Will let you know if I find out more. |
Hi, this is still an issue that blocks us from bumping our delta-rs version, which is getting many great improvements.. Any chance you can take look at this @ion-elgreco? |
maybe a possible reproduce will be
in the final ( Also the value of |
@sherlockbeard in our actual use case, the select is/should not be empty. We overwrite the last 2 week's worth of data. It could be that those last 2 weeks are temporarily missing, but we still have a few years worth of data in this table, so those files should theoretically still show up in the |
Nope @antonsteenvoorden |
Sorry I'm too busy with other stuff nowadays. |
Environment
Delta-rs version: python-0.15.2 up until python-0.17.3
Binding: python
Bug
deltalake versions higher than v0.15.1 return an empty dataframe for
get_add_actions
after aCREATE OR REPLACE TABLE AS SELECT
commit on our delta table.What happened:
This correctly shows the columns we partitioned on
The columns of the dataframe are present and correct.
What you expected to happen:
(and to have records of course ;)
More details:
We are interested in the values for the partitioned columns. Since deltalake does not add the partition columns when reading a DeltaTable we were using the
get_add_actions
method to join the partition values onto the table using the_file_uri
.(side note: if anyone knows of a better way to do this, I would be happy to hear).
Initially, we found no issues switching to v0.17.3. However, the values for the partition columns were suddenly missing. We looked at the table history (see below) and found that it broke after an ETL job caused a
CREATE OR REPLACE TABLE AS SELECT
.We found that with v0.17.3 it still worked for version our table version 217 but is broken from version 218 onwards.
We tried all version in between to pin-point which release broke it, and it appears to be broken from >= v0.15.2
v0.15.1 continues to work for our delta table after this commit and does not display this behavior.
We are not familiar enough with the codebase and Rust to determine whether this is by design or broken
How to reproduce it:
We were unable to. We tried to do the bewlo on a small example:
Reading is just fine, then we do the potentially troublesome operation:
But afterwards, we are still able to read using v0.17.3
The text was updated successfully, but these errors were encountered: