Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iceberg fails ACID verification test #10454

Closed
matthijseikelenboom opened this issue Jun 6, 2024 · 14 comments
Closed

Iceberg fails ACID verification test #10454

matthijseikelenboom opened this issue Jun 6, 2024 · 14 comments
Labels
bug Something isn't working

Comments

@matthijseikelenboom
Copy link

Apache Iceberg version

1.5.2 (latest release)

Query engine

Spark

Please describe the bug 🐞

Problem statement

For work we had needed to have a concurrent read/write support for our data lake, which uses Spark. We where noticing some inconsistencies, so we wrote a test that can verify whether something like Iceberg adheres to ACID. We did however find that Iceberg fails this test.

Now, it can be that we've wrongly configured Iceberg or that there is some mistake in the test code.

My question is if someone of you can take a look at it, and perhaps can explain what is going wrong here.

To Reproduce

How to run the test and it's findings are described in the README of the repository, but here is a short run down

Steps to reproduce the behavior:

  1. Check out repo: iceberg-acid-verification
  2. Start Docker if not already running
  3. Run the test TransactionManagerTest.java
  4. Observe that the test fails.

Expected behavior

  • I expect that the full amount of transactions are executed and that Iceberg passes the ACID verification test

Environment Description

  • Iceberg version : 1.5.2
  • Spark version : 3.5.1
  • Hive version : 4.0.0-beta-1
  • Hadoop version : 3.2.2
  • Storage (HDFS/S3/GCS..) : NTFS(Windows), APFS(macOS) & HDFS

Additional context

It's worth noting that other solutions, Hudi and Delta Lake, have also been tested this way. Hudi also didn't pass this test, but it was resolved with a bug ticket on GitHub issue. Delta Lake did pass the test.

@matthijseikelenboom matthijseikelenboom added the bug Something isn't working label Jun 6, 2024
@nastra
Copy link
Contributor

nastra commented Jun 6, 2024

@matthijseikelenboom can you please share details of what exactly is failing in which scenario for the people that aren't running the test? This makes it easier to reason about what the issue is and what your expectation within a particular scenario is.

@matthijseikelenboom
Copy link
Author

@nastra The README of the repo describes in detail what the test does. But in short: The test fails because what it expects to be present on disc, isn't there. On the term "scenario", there is only one. A transaction is generated, is picked up by a writer thread and is then attempted to be inserted in to the data lakehouse. The reader thread(s) then verify whether the transaction has happened and if it happened correctly.

@RussellSpitzer
Copy link
Member

Is the Catalog Cache in use?

@matthijseikelenboom
Copy link
Author

matthijseikelenboom commented Jun 6, 2024

SparkSessionProvider.java - line 68: .config("spark.sql.catalog.iceberghive.cache-enabled", "false")

If this is the property that you mean, then no, is has been disabled. Though, I've tested it with this property set to true as well, it doesn't make a difference.

@RussellSpitzer
Copy link
Member

Looking briefly at the code it looks like the error may be in the retry logic for the transaction statements as written. From what I can tell the code does not handle Iceberg Validation exceptions , I had to put away my computer for a flight but will look again tomorrow.

@szehon-ho
Copy link
Collaborator

szehon-ho commented Jun 7, 2024

Yea , I think @RussellSpitzer has a good point.

I think this code https://github.com/matthijseikelenboom/iceberg-acid-verification/blob/master/src/main/java/org/example/writer/TransactionWriter.java#L167 swallows the exception.

Iceberg isolation was designed to work optimistically by throwing exceptions like ValidationException when it detects concurrent changes to the table that would break the given isolation level. The user then needs to retry the query.

Running this test, I saw some instances of IllegalStateException

Runtime file filtering is not possible: the table has been concurrently modified. "
            + "Row-level operation scan snapshot ID: %s, current table snapshot ID: %s. "
            + "If an external process modifies the table, enable table caching in the catalog. "
            + "If multiple threads modify the table, use independent Spark sessions in each thread.",

This is actually different than ValidationException, and is actually thrown because a certain optimization for MERGE INTO (runtime file-filtering) seems to requires the table to be completely up to date. This can be turned off if you wish: spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled=false if you want more lenient behavior that throws exception only if isolation constraints are violated.

But for either case, the test should handle the exception and retry.

@RussellSpitzer
Copy link
Member

RussellSpitzer commented Jun 7, 2024 via email

@matthijseikelenboom
Copy link
Author

@szehon-ho I saw that I forgot to copy over the retry mechanism that we build. It has been included in the latest commit. Another interesting thing, is that when you set withNumberOfWriterThreads to 1 and the withNumberOfSparkSessionsForWriters also to 1, there are no concurrent writers, and still the test fails.

@RussellSpitzer interesting find, curious to what else you'll find

@RussellSpitzer
Copy link
Member

RussellSpitzer commented Jun 7, 2024 via email

@RussellSpitzer
Copy link
Member

RussellSpitzer commented Jun 7, 2024

I didn't go to sleep, here is the error. This is what your delete statements look like:

DELETE FROM iceberghive.concurrencytestdb.acid_verification WHERE primaryKeyValue IN ("Record54""Record57""Record29")
            var primaryKeyValues = transaction.dataManipulations
                    .stream()
                    .map(dataManipulation -> "\"" + dataManipulation.primaryKeyValue + "\"")
                    .collect(Collectors.joining());

@RussellSpitzer
Copy link
Member

The Delete verification is also wrong

        var deleteSucceededExpectation = ExpectRecordPresence.create(recordToDelete);

This should be ExpectedRecordAbsence

@RussellSpitzer
Copy link
Member

RussellSpitzer commented Jun 7, 2024

I fixed all the bugs and the test suite now passes. I would strongly encourage you to write unit tests for frameworks like this to make sure they behave correctly before submitting issues to the project.

https://github.com/matthijseikelenboom/iceberg-acid-verification/pull/1/files

@matthijseikelenboom
Copy link
Author

Oh wow, I'll take this back to the team at work! Thanks for taking the time, but my issue wasn't important enough to not go to sleep I think 😅.

About the unit tests, those where written, but I didn't copy all of them over to my repo. I should have, because the ExpectRecordPresence should indeed have been a ExpectRecordAbsence, so that is a typo on my end.

@matthijseikelenboom
Copy link
Author

Closing as this has been resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants