Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Prevent writing checkpoints with a version that does not exist in tab…
…le state I have seen this in a production environment where the same writer is issuing append transactions using the operations API, which returns the newly created version, such as 10. If the caller then attempts to create a checkpoint for version 10, the operation will produce an inconsistency in the `_last_checkpoint` file, if the callers in-memory table state has *not* been reloaded since the append operation completed. In this scenario the _delta_log/ directory may contain: . ├── 00000000000000000000.json ├── 00000000000000000001.json ├── 00000000000000000002.json ├── 00000000000000000003.json ├── 00000000000000000004.json ├── 00000000000000000005.json ├── 00000000000000000006.json ├── 00000000000000000007.json ├── 00000000000000000008.json ├── 00000000000000000009.json ├── 00000000000000000010.checkpoint.parquet ├── 00000000000000000010.json └── _last_checkpoint While `_last_checkpoint` contains the following: {"num_of_add_files":null,"parts":null,"size":342,"size_in_bytes":95104,"version":9} This will result in an error on any attempts to read the Delta table: >>> from deltalake import DeltaTable >>> dt = DeltaTable('.') [2023-11-14T18:05:59Z DEBUG deltalake_core::protocol] loading checkpoint from _delta_log/_last_checkpoint [2023-11-14T18:05:59Z DEBUG deltalake_core::table] update with latest checkpoint CheckPoint { version: 9, size: 342, parts: None, size_in_bytes: Some(95104), num_of_add_files: None } Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/tyler/venv/lib64/python3.11/site-packages/deltalake/table.py", line 253, in __init__ self._table = RawDeltaTable( ^^^^^^^^^^^^^^ FileNotFoundError: Object at location /home/tyler/corrupted-table/_delta_log/00000000000000000009.checkpoint.parquet not found: No such file or directory (os error 2) >>> To prevent this error condition, the create_checkpoint_for() function should ensure that the provided checkpoint version (used to write the `.checkpoint.parquet` file) matches the table state's version (used to write the `_last_checkpoint` file). This has the added benefit of helping prevent the caller from passing in a nonsensical version number that could also lead to a broken table. Sponsored-by: Scribd Inc
- Loading branch information