Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to read glue spark external table when table path does not match the default db pattern #2062

Closed
yogyang opened this issue Jan 9, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@yogyang
Copy link

yogyang commented Jan 9, 2024

Environment

Delta-rs version:

0.8.1

Binding:
AWS Glue

Environment:

  • OS:
    ubuntu

Bug

What happened:
we have a table in aws glue created by spark as an external table db.fake,
withe the db location set to s3://bucket-staging/delta/
thus
the storage description looks like

"StorageDescriptor": {
            "Columns": [
                {
                    "Name": "col",
                    "Type": "array<string>",
                    "Comment": "from deserializer"
                }
            ],
            "Location": "s3://bucket-staging/delta/fake-__PLACEHOLDER__",
            "InputFormat": "org.apache.hadoop.mapred.SequenceFileInputFormat",
            "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat",
            "Compressed": false,
            "NumberOfBuckets": -1,
            "SerdeInfo": {
                "SerializationLibrary": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
                "Parameters": {
                    "serialization.format": "1",
                    "path": "s3://bucket-staging/delta/fake/csv_logs"
                }
            },
            "BucketColumns": [],
            "SortColumns": [],
            "Parameters": {},
            "SkewedInfo": {
                "SkewedColumnNames": [],
                "SkewedColumnValues": [],
                "SkewedColumnValueLocationMaps": {}
            },
            "StoredAsSubDirectories": false

you can see the
StorageDescriptor.Location = s3://bucket-staging/delta/fake-PLACEHOLDER
StorageDescriptor.SerdeInfo.Parameters.path = s3://bucket-staging/delta/fake/csv_logs

the real s3 path for this table is in StorageDescriptor.SerdeInfo.Parameters.path
using code

from deltalake import DeltaTable
from deltalake import DataCatalog
database_name = "db"
table_name = "fake"
data_catalog = DataCatalog.AWS
dt = DeltaTable.from_data_catalog(data_catalog=data_catalog, database_name=database_name, table_name=table_name)

it would throw error

File "/home/runner/.local/lib/python3.8/site-packages/deltalake/table.py", line 154, in from_data_catalog
    return cls(table_uri=table_uri, version=version)
  File "/home/runner/.local/lib/python3.8/site-packages/deltalake/table.py", line 122, in __init__
    self._table = RawDeltaTable(
deltalake.PyDeltaTableError: Not a Delta table: No snapshot or version 0 found, perhaps s3://bucket-staging/delta/fake is an empty dir?
Error: Process completed with exit code 1.

seems like glue location is only reading from StorageDescriptor.Location in

let location = response
.table
.ok_or(GlueError::MissingMetadata {
metadata: "Table".to_string(),
})
.map_err(<GlueError as Into<DataCatalogError>>::into)?
.storage_descriptor
.ok_or(GlueError::MissingMetadata {
metadata: "Storage Descriptor".to_string(),
})
.map_err(<GlueError as Into<DataCatalogError>>::into)?
.location
.map(|l| l.replace("s3a", "s3"))
.ok_or(GlueError::MissingMetadata {
metadata: "Location".to_string(),
});

What you expected to happen:

the table should read from right path

How to reproduce it:

create an external table to Glue using Spark specify the location and read this table from Glue

@yogyang yogyang added the bug Something isn't working label Jan 9, 2024
@ion-elgreco ion-elgreco closed this as not planned Won't fix, can't repro, duplicate, stale Dec 7, 2024
@ion-elgreco
Copy link
Collaborator

The catalog support is not in anymore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants