Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hive table is ignore caseSensitve, and hive table location just parquet files (schema with upper chars,eg componentId, userName ), after enable blaze, spark sql with upper filter condition won't return any data. #670

Closed
frencopei opened this issue Nov 29, 2024 · 1 comment · Fixed by #777
Assignees
Labels
bug Something isn't working

Comments

@frencopei
Copy link

frencopei commented Nov 29, 2024

Describe the bug
hive table is ignore caseSensitve, and hive table location just parquet files (schema with upper chars,eg componentId, userName ), after enable blaze, spark sql with upper filter condition won't return any data.

To Reproduce
Steps to reproduce the behavior:

  1. spark.sql("set spark.sql.caseSensitive=false")
    
  2. val executSql = """
        select  dnum
                 from report.tb_39e85e2e76e444e195c6db2df728751e_34b7dfe549
                 where dt between '2024-11-20' and   '2024-11-27'
                 and componentId='255'    limit 50
     """
    
  3. val df = spark.sql(executSql) 
     println(df.schema)
     df.show(10)
    
  4. package scala jar.

  5. spark-submit --class com.***.myapp.Test --master yarn --conf spark.sql.hive.convertMetastoreParquet=true --conf spark.blaze.enable=true --conf spark.sql.extensions=org.apache.spark.sql.blaze.BlazeSparkSessionExtension --conf spark.shuffle.manager=org.apache.spark.sql.execution.blaze.shuffle.BlazeShuffleManager --conf spark.sql.caseSensitive=false cosn://dc-sh-prod-03-1323003688/tasklibs/spark3.2.2_myapp.jar

  6. executor logs:

测试sql :

       select  dnum, 3680 as moneys
                from report.tb_39e85e2e76e444e195c6db2df728751e_34b7dfe549
                where to_date_udf(year,month,day) between  date_sub('2024-11-27',7) and  '2024-11-27'
                and componentId='255' limit 50

userGroupInfo.getUserField : dnum

StructType(StructField(dnum,StringType,true), StructField(moneys,IntegerType,false))
+----+------+
|dnum|moneys|
+----+------+
+----+------+

obviusely, it cannt return any data. just filter conditions cause : componentId

  1. Expected behavior (when set spark.blaze.enable=false )

自动化分析任务导入的sql :

       select  dnum, 3680 as moneys
                from report.tb_39e85e2e76e444e195c6db2df728751e_34b7dfe549
                where to_date_udf(year,month,day) between  date_sub('2024-11-27',7) and  '2024-11-27'
                and componentId='255' limit 50

dataframe schema:

StructType(StructField(dnum,StringType,true), StructField(moneys,IntegerType,false))
+---------+------+
| dnum|moneys|
+---------+------+
|649409512| 3680|
|666687060| 3680|
|667198577| 3680|
|672462560| 3680|
|668511291| 3680|
|661643626| 3680|
|669103964| 3680|
|660927197| 3680|
|671793888| 3680|
|637719401| 3680|
+---------+------+
only showing top 10 rows

append:
A: hive table create scripts :

CREATE EXTERNAL TABLE report.tb_39e85e2e76e444e195c6db2df728751e_34b7dfe549(
android_id string,
systempid string,
appnm string,
appversion string,
appversioncode string,
biversion string,
cardstyleid string,
city string,
clientdatetime string,
componentcontentid string,
componentid string,
componentname string,
componentposition string,
componenttypeid string,
componentversion string,
datasource string,
dateofweek string,
datetime string,
dayofquarter string,
dayofyear string,
deviceid string,
devicetype string,
dnum string,
hour string,
id string,
imei string,
ip string,
launcherversionname string,
launcherdnum string,
launchervercode string,
mac string,
minute string,
nation string,
networktype string,
packagenm string,
phonetype string,
postconfigversion string,
projectid string,
province string,
region string,
remote_addr string,
scenetemplateid string,
scenetemplatename string,
second string,
sendtime string,
signature string,
systype string,
sysversion string,
systemvercode string,
tabposition string,
tclosversion string,
type string,
userid string,
weekofyear string,
wlanmac string,
xforwarded string,
packagename string,
componentstatus string,
musicstatus string,
componenttitle string,
vid string,
receipttime string)
PARTITIONED BY (
year bigint,
month bigint,
day bigint,
cleanhour bigint)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'hdfs://xxxxxx/data/report/584f9c5bab31fb1d59e138e1/39e85e2e76e444e195c6db2df728751e/34B7DFE549'

B location parquet schema:
11

@richox richox added the bug Something isn't working label Dec 3, 2024
Copy link

github-actions bot commented Jan 4, 2025

This issue is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Jan 4, 2025
@richox richox self-assigned this Jan 15, 2025
@github-actions github-actions bot removed the stale label Jan 16, 2025
@richox richox mentioned this issue Jan 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants