Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Skip Databricks Photon jobs at app level in Qualification tool #886

Merged
merged 5 commits into from
Mar 27, 2024

Conversation

cindyyuanjiang
Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang commented Mar 26, 2024

Fixes #804

This PR implements the logic to skip Databricks Photon jobs at app level in Qualification tool. Each Photon event log will be marked at SKIPPED in rapids_4_spark_qualification_output_status.csv and removed from other generated files.

Changes

  • Add class SkippedQualAppResult to represent skipped Qualification App creation and skippedCounter for console progress bar visualization
  • Look for keyword Photon when processing SQL plans and short-circuit the process if there exists one
  • Add case class FailureApp for handling different kinds of exceptions when creating Qualification App

Tests

spark_rapids_user_tools databricks-azure qualification --cpu_cluster my_cpu_cluster --eventlogs my_event_logs --tools_jar my_tools_jar

Before this PR:

rapids_4_spark_qualification_output.csv
App Name,App ID,Recommendation,Estimated GPU Speedup,Estimated GPU Duration,Estimated GPU Time Saved,SQL DF Duration,SQL Dataframe Task Duration,App Duration,GPU Opportunity,Executor CPU Time Percent,SQL Ids with Failures,Unsupported Read File Formats and Types,Unsupported Write Data Format,Complex Types,Nested Complex Types,Potential Problems,Longest SQL Duration,NONSQL Task Duration Plus Overhead,Unsupported Task Duration,Supported SQL DF Task Duration,Task Speedup Factor,App Duration Estimated,Unsupported Execs,Unsupported Expressions,Estimated Job Frequency (monthly),Cluster Tags
"Databricks Shell","app-20240220023733-0000","Not Recommended",1.0,3520729.91,22972.08,501797,1682938390,3543702,41203,41.59,"","JSON[timestamp:map:string:array:boolean:date:int:struct:bigint:dec:byte]","","struct;struct,size:bigint,modificationTime:bigint,dataChange:boolean,stats:string,tags:map,deletionVector:struct,baseRowId:bigint,defaultRowCommitVersion:bigint,clusteringProvider:string>;struct,size:bigint,tags:map,deletionVector:struct,baseRowId:bigint,defaultRowCommitVersion:bigint>;struct>,schemaString:string,partitionColumns:array,configuration:map,createdTime:bigint>;struct,writerFeatures:array>;struct,size:bigint,tags:map>;struct>;struct>;struct;struct,job:struct,notebook:struct,clusterId:string,readVersion:bigint,isolationLevel:string,isBlindAppend:boolean,operationMetrics:map,userMetadata:string,tags:map,engineInfo:string,txnId:string>","struct,size:bigint,modificationTime:bigint,dataChange:boolean,stats:string,tags:map,deletionVector:struct,baseRowId:bigint,defaultRowCommitVersion:bigint,clusteringProvider:string>;struct,size:bigint,tags:map,deletionVector:struct,baseRowId:bigint,defaultRowCommitVersion:bigint>;struct>,schemaString:string,partitionColumns:array,configuration:map,createdTime:bigint>;struct,writerFeatures:array>;struct,size:bigint,tags:map>;struct>;struct>;struct,job:struct,notebook:struct,clusterId:string,readVersion:bigint,isolationLevel:string,isBlindAppend:boolean,operationMetrics:map,userMetadata:string,tags:map,engineInfo:string,txnId:string>","UDF:NESTED COMPLEX TYPE",241288,516242735,1544550551,138387839,2.25,true,"SerializeFromObject;PhotonShuffleMapStage;PhotonGroupingAgg;PhotonShuffleExchangeSink;PhotonShuffleExchangeSource;PhotonProject;MapPartitions;PhotonRowToColumnar;Execute ClearCacheCommand$;Scan ExistingRDD;DeserializeToObject;PhotonFilter;LocalTableScan;Project;Scan json;PhotonResultStage;ObjectHashAggregate;PhotonScan parquet ;MapElements;PhotonParquetWriter;Execute OptimizeTableCommandEdge;PhotonWriteStage","UDF;from_json;filesizehistogramagg",30,"ClusterId -> 0220-023022-xvizld3;created_by -> at657h;DatabricksEnvironment -> workerenv-2876790984019916;ClusterName -> job-518534639051370-run-104068185572167-deDup_lteDist_cluster;BU_Name -> IQI-Standard-Compute-Pool;JobId -> 518534639051370;RunName -> DeviceAnalyzer Production;Creator -> [email protected];Vendor -> Databricks;mots_id -> 30277;contact_dl -> [email protected];env -> prod;BU_Mots_ID -> 26015"
rapids_4_spark_qualification_output_status.csv
Event Log,Status,Description
".../photon-eventlogs","SUCCESS","app-20240220023733-0000,Took 37812ms to process"
Console Log
| 24/03/26 14:04:08 INFO SuccessQualAppResult: File: .../photon-eventlogs, Message: Took 37812ms to process
| Qual Tool Progress 100% [=========================================================] (1 succeeded + 0 failed + 0 N/A) / 1
| 
| Qual Tool execution time: 37849ms
| 	process.success.count = 1
| 	process.failure.count = 0
| 	process.NA.count = 0
| 	execution.total.count = 1

After this PR:

rapids_4_spark_qualification_output.csv
App Name,App ID,Recommendation,Estimated GPU Speedup,Estimated GPU Duration,Estimated GPU Time Saved,SQL DF Duration,SQL Dataframe Task Duration,App Duration,GPU Opportunity,Executor CPU Time Percent,SQL Ids with Failures,Unsupported Read File Formats and Types,Unsupported Write Data Format,Complex Types,Nested Complex Types,Potential Problems,Longest SQL Duration,NONSQL Task Duration Plus Overhead,Unsupported Task Duration,Supported SQL DF Task Duration,Task Speedup Factor,App Duration Estimated,Unsupported Execs,Unsupported Expressions,Estimated Job Frequency (monthly)
rapids_4_spark_qualification_output_status.csv
Event Log,Status,Description
".../photon-eventlogs","SKIPPED","PhotonEventLogException: Encountered Databricks Photon event log: skipping this file!"
Console Log
| 24/03/26 14:08:35 WARN SkippedQualAppResult: File: .../photon-eventlogs, Message: PhotonEventLogException: Encountered Databricks Photon event log: skipping this file!
| Qual Tool Progress 100% [=============================================] (0 succeeded + 0 failed + 1 skipped + 0 N/A) / 1
| 
| Qual Tool execution time: 683ms
| 	process.success.count = 0
| 	process.failure.count = 0
| 	process.skipped.count = 1
| 	process.NA.count = 0
| 	execution.total.count = 1

@cindyyuanjiang cindyyuanjiang self-assigned this Mar 26, 2024
@cindyyuanjiang cindyyuanjiang added feature request New feature or request core_tools Scope the core module (scala) labels Mar 26, 2024
Copy link
Collaborator

@nartal1 nartal1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang ! Just a couple of nits.

@@ -795,6 +795,10 @@ class QualificationAppInfo(
val sqlPlanInfoGraphEntry = SqlPlanInfoGraphBuffer.createEntry(sqlID, planInfo)
checkMetadataForReadSchema(sqlPlanInfoGraphEntry)
for (node <- sqlPlanInfoGraphEntry.sparkPlanGraph.allNodes) {
if (node.name.contains("Photon")) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Does Photon operators always begin with Photon keyword. If so can we use startsWith instead of contains here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not certain, but from previous observations in: #449 (comment), it looks like they begin with "Photon". I will update this.

case gpuLog: GpuEventLogException =>
gpuLog.message
("unknow", gpuLog.message)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unknow -> unknown

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated!

Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Signed-off-by: cindyyuanjiang <[email protected]>
Copy link
Collaborator

@amahussein amahussein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyyuanjiang !
I am fine with merging this as is for now, given that we investigate moving the logic of detection during the eventlog-processing phase as described in my comment.

No need to file a followup as\the investigation can be part of the existing #845, which is pretty similar to this feature.

@@ -795,6 +795,10 @@ class QualificationAppInfo(
val sqlPlanInfoGraphEntry = SqlPlanInfoGraphBuffer.createEntry(sqlID, planInfo)
checkMetadataForReadSchema(sqlPlanInfoGraphEntry)
for (node <- sqlPlanInfoGraphEntry.sparkPlanGraph.allNodes) {
if (node.name.startsWith("Photon")) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that we could detect the Photon in an earlier phase as we process the eventlogs.
For example, doSparkListenerSQLExecutionStart() could check for certain conditions to disqualify the entire app.
The pros of having the code in doSparkListenerSQLExecutionStart():

  • For Qualification tool, the bottleneck of performance is actually in reading and processing the evntlogs. Therefore, if we fail during that stage, then we are saving significant amount of useless processing and computation.
  • If we need to propagate that logic to Profiling, it becomes much easier as we need only to move the implementation to the EvenProcessorBase.doSparkListenerSQLExecutionStart instead of QualificationEventProcessor.doSparkListenerSQLExecutionStart. Qualification if we need

@amahussein amahussein merged commit b1916ea into NVIDIA:dev Mar 27, 2024
15 checks passed
@cindyyuanjiang cindyyuanjiang deleted the spark-rapids-tools-804 branch March 27, 2024 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core_tools Scope the core module (scala) feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Qualification tool: Disqualify Databricks Photon jobs at app level
4 participants