Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Include Additional Metadata Fields in PhotonScan Node Parsing #1385

Open
parthosa opened this issue Oct 17, 2024 · 1 comment
Open
Assignees
Labels
bug Something isn't working core_tools Scope the core module (scala)

Comments

@parthosa
Copy link
Collaborator

Photon Scan nodes contain additional metadata fields not present in Spark Scan nodes, such as RequiredDataFilters and DictionaryFilters.

Example of a complete JSON representation of a PhotonScan node:

  "nodeName" : "PhotonScan parquet ",
  "simpleString" : "PhotonScan parquet [ss_sold_time_sk#806,ss_hdemo_sk#810,ss_store_sk#812,ss_sold_date_sk#828] DataFilters: [isnotnull(ss_hdemo_sk#810), isnotnull(ss_sold_time_sk#806), isnotnull(ss_store_sk#812)], DictionaryFilters: [], Format: parquet, Location: InMemoryFileIndex(1 paths)[s3://ndsv2-data/parquet_sf3000/store_sales], PartitionFilters: [], ReadSchema: struct<ss_sold_time_sk:int,ss_hdemo_sk:int,ss_store_sk:int>, RequiredDataFilters: [isnotnull(ss_hdemo_sk#810), isnotnull(ss_sold_time_sk#806), isnotnull(ss_store_sk#812)]",
  "metadata" : {
    "Location" : "InMemoryFileIndex(1 paths)[s3://*****/parquet_sf3000/store_sales]",
    "ReadSchema" : "struct<ss_sold_time_sk:int,ss_hdemo_sk:int,ss_store_sk:int>",
    "Format" : "parquet",
    "RequiredDataFilters" : "[isnotnull(ss_hdemo_sk#810), isnotnull(ss_sold_time_sk#806), isnotnull(ss_store_sk#812)]",
    "DictionaryFilters" : "[]",
    "PartitionFilters" : "[]",
    "DataFilters" : "[isnotnull(ss_hdemo_sk#810), isnotnull(ss_sold_time_sk#806), isnotnull(ss_store_sk#812)]"
  },

We store this metadata in data_source_information.csv and should ensure these additional fields are included when parsing the PhotonScan node.

@parthosa
Copy link
Collaborator Author

parthosa commented Oct 25, 2024

The metadata properties RequiredDataFilters and DictionaryFilters are present only in Photon event logs. This metadata information is saved in the data_source_information.csvfile.

Current schema of data_source_information.csv:

root
 |-- appIndex: integer (nullable = true)
 |-- sqlID: integer (nullable = true)
 |-- sql_plan_version: integer (nullable = true)
 |-- nodeId: integer (nullable = true)
 |-- format: string (nullable = true)
 |-- buffer_time: integer (nullable = true)
 |-- scan_time: integer (nullable = true)
 |-- data_size: long (nullable = true)
 |-- decode_time: integer (nullable = true)
 |-- location: string (nullable = true)
 |-- pushedFilters: string (nullable = true)
 |-- schema: string (nullable = true)
 |-- data_filters: string (nullable = true)
 |-- partition_filters: string (nullable = true)
 |-- from_final_plan: boolean (nullable = true)

After addition of these there will be two new columns:

root
 |-- appIndex: integer (nullable = true)
..
 |-- data_filters: string (nullable = true)
 |-- partition_filters: string (nullable = true)
 |-- required_data_filters: string(nullable = true)
 |-- dictionary_filters: string(nullable = true)

Questions

  • The columns required_data_filters and dictionary_filters will always be empty for non-Photon event logs. Should we use a dynamic schema to exclude these columns for non-Photon event logs, or keep them as empty fields?

cc: @amahussein @tgravescs @mattahrens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working core_tools Scope the core module (scala)
Projects
None yet
Development

No branches or pull requests

2 participants