hive table is ignore caseSensitve, and hive table location just parquet files (schema with upper chars,eg componentId, userName ), after enable blaze, spark sql with upper filter condition won't return any data. #670

frencopei · 2024-11-29T04:03:13Z

Describe the bug
hive table is ignore caseSensitve, and hive table location just parquet files (schema with upper chars,eg componentId, userName ), after enable blaze, spark sql with upper filter condition won't return any data.

To Reproduce
Steps to reproduce the behavior:

spark.sql("set spark.sql.caseSensitive=false")

val executSql = """
    select  dnum
             from report.tb_39e85e2e76e444e195c6db2df728751e_34b7dfe549
             where dt between '2024-11-20' and   '2024-11-27'
             and componentId='255'    limit 50
 """

val df = spark.sql(executSql) 
 println(df.schema)
 df.show(10)

package scala jar.
spark-submit --class com.***.myapp.Test --master yarn --conf spark.sql.hive.convertMetastoreParquet=true --conf spark.blaze.enable=true --conf spark.sql.extensions=org.apache.spark.sql.blaze.BlazeSparkSessionExtension --conf spark.shuffle.manager=org.apache.spark.sql.execution.blaze.shuffle.BlazeShuffleManager --conf spark.sql.caseSensitive=false cosn://dc-sh-prod-03-1323003688/tasklibs/spark3.2.2_myapp.jar
executor logs:

测试sql :

       select  dnum, 3680 as moneys
                from report.tb_39e85e2e76e444e195c6db2df728751e_34b7dfe549
                where to_date_udf(year,month,day) between  date_sub('2024-11-27',7) and  '2024-11-27'
                and componentId='255' limit 50

userGroupInfo.getUserField : dnum

StructType(StructField(dnum,StringType,true), StructField(moneys,IntegerType,false))
+----+------+
|dnum|moneys|
+----+------+
+----+------+

obviusely, it cannt return any data. just filter conditions cause : componentId

Expected behavior （when set spark.blaze.enable=false ）

自动化分析任务导入的sql :

       select  dnum, 3680 as moneys
                from report.tb_39e85e2e76e444e195c6db2df728751e_34b7dfe549
                where to_date_udf(year,month,day) between  date_sub('2024-11-27',7) and  '2024-11-27'
                and componentId='255' limit 50

dataframe schema：

StructType(StructField(dnum,StringType,true), StructField(moneys,IntegerType,false))
+---------+------+
| dnum|moneys|
+---------+------+
|649409512| 3680|
|666687060| 3680|
|667198577| 3680|
|672462560| 3680|
|668511291| 3680|
|661643626| 3680|
|669103964| 3680|
|660927197| 3680|
|671793888| 3680|
|637719401| 3680|
+---------+------+
only showing top 10 rows

append:
A: hive table create scripts :

CREATE EXTERNAL TABLE report.tb_39e85e2e76e444e195c6db2df728751e_34b7dfe549(
android_id string,
systempid string,
appnm string,
appversion string,
appversioncode string,
biversion string,
cardstyleid string,
city string,
clientdatetime string,
componentcontentid string,
componentid string,
componentname string,
componentposition string,
componenttypeid string,
componentversion string,
datasource string,
dateofweek string,
datetime string,
dayofquarter string,
dayofyear string,
deviceid string,
devicetype string,
dnum string,
hour string,
id string,
imei string,
ip string,
launcherversionname string,
launcherdnum string,
launchervercode string,
mac string,
minute string,
nation string,
networktype string,
packagenm string,
phonetype string,
postconfigversion string,
projectid string,
province string,
region string,
remote_addr string,
scenetemplateid string,
scenetemplatename string,
second string,
sendtime string,
signature string,
systype string,
sysversion string,
systemvercode string,
tabposition string,
tclosversion string,
type string,
userid string,
weekofyear string,
wlanmac string,
xforwarded string,
packagename string,
componentstatus string,
musicstatus string,
componenttitle string,
vid string,
receipttime string)
PARTITIONED BY (
year bigint,
month bigint,
day bigint,
cleanhour bigint)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'hdfs://xxxxxx/data/report/584f9c5bab31fb1d59e138e1/39e85e2e76e444e195c6db2df728751e/34B7DFE549'

B location parquet schema：

The text was updated successfully, but these errors were encountered:

github-actions · 2025-01-04T02:32:16Z

This issue is stale because it has been open for 30 days with no activity.

richox added the bug Something isn't working label Dec 3, 2024

github-actions bot added the stale label Jan 4, 2025

richox self-assigned this Jan 15, 2025

github-actions bot removed the stale label Jan 16, 2025

richox mentioned this issue Jan 19, 2025

bug fixes #777

Merged

lihao712 closed this as completed in #777 Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hive table is ignore caseSensitve, and hive table location just parquet files (schema with upper chars,eg componentId, userName ), after enable blaze, spark sql with upper filter condition won't return any data. #670

hive table is ignore caseSensitve, and hive table location just parquet files (schema with upper chars,eg componentId, userName ), after enable blaze, spark sql with upper filter condition won't return any data. #670

frencopei commented Nov 29, 2024 •

edited

Loading

github-actions bot commented Jan 4, 2025

hive table is ignore caseSensitve, and hive table location just parquet files (schema with upper chars,eg componentId, userName ), after enable blaze, spark sql with upper filter condition won't return any data. #670

hive table is ignore caseSensitve, and hive table location just parquet files (schema with upper chars,eg componentId, userName ), after enable blaze, spark sql with upper filter condition won't return any data. #670

Comments

frencopei commented Nov 29, 2024 • edited Loading

测试sql :

userGroupInfo.getUserField : dnum

自动化分析任务导入的sql :

dataframe schema：

github-actions bot commented Jan 4, 2025

frencopei commented Nov 29, 2024 •

edited

Loading