ORC-1403: ORC supports reading empty field name #1458

cxzl25 · 2023-04-07T10:24:13Z

What changes were proposed in this pull request?

ParserUtils removes empty check

Why are the changes needed?

java.lang.IllegalArgumentException: Empty quoted field name at 'struct<``^:string>'
    at org.apache.orc.impl.ParserUtils.parseName(ParserUtils.java:114)
    at org.apache.orc.impl.ParserUtils.parseStruct(ParserUtils.java:170)
    at org.apache.orc.impl.ParserUtils.parseType(ParserUtils.java:228)
    at org.apache.orc.TypeDescription.fromString(TypeDescription.java:202)
    at org.apache.orc.mapred.OrcInputFormat.buildOptions(OrcInputFormat.java:122)
    at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:130)
    at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$2(OrcFileFormat.scala:207)

How was this patch tested?

add UT

dongjoon-hyun · 2023-04-07T18:13:25Z

java/core/src/test/org/apache/orc/TestTypeDescription.java

-    IllegalArgumentException e = assertThrows(IllegalArgumentException.class, () -> {
-      TypeDescription.fromString("struct<``:int>");
-    });
-    assertTrue(e.getMessage().contains("Empty quoted field name at 'struct<``^:int>'"));


This test coverage means it's a feature before.

I'd probably prefer to just ignore it instead of removing it.

dongjoon-hyun

Does Apache Hive support empty field names?

cxzl25 · 2023-04-09T12:59:34Z

Does Apache Hive support empty field names?

See apache/spark#35253 (comment).

When spark.sql.orc.impl=native, writing to empty field name is supported, but reading is failed, when spark.sql.orc.impl=hive, neither writing nor reading empty field name is supported.

Before ORC-529 1.6.0, ORC should support reading empty filed name.

Apache Hive does not support empty field.
For example, Hive create table t as select '' sql will automatically generate _c1 if no alias is specified.

dongjoon-hyun

@cxzl25 . Sorry but the existing way is the correct direction. When Apache Hive (HiveFileFormat) doesn't support it, we should not remove a test coverage, testQuotedField2 .

Apache Hive does not support empty field.

As you see https://issues.apache.org/jira/browse/SPARK-20901, Apache Spark and ORC community is together trying to reduce the differences among different configurations.

cxzl25 · 2023-04-10T03:18:05Z

Sorry but the existing way is the correct direction. When Apache Hive (HiveFileFormat) doesn't support it, we should not remove a test coverage, testQuotedField2 .

Now the Spark orc datasource does not check the field name, but the Spark hive orc format checks the field name, which causes the Spark orc datasource to be able to write the field name.
Should we check it at the spark level, and it is not allowed to write the field name?

org.apache.spark.sql.execution.datasources.orc.OrcFileFormat 
org.apache.spark.sql.hive.execution.HiveFileFormat#supportFieldName

In addition, can we check the schema in orc writer, otherwise the written data may not be read?

org.apache.orc.OrcFile.WriterOptions#setSchema

dongjoon-hyun · 2023-04-10T05:45:45Z

Ya, we can do some. However, instead of switching behaviors back and forth in this layer, I believe we need to focus on your end goals in the higher layers. What is your end goal as a user? For example, did you hit the following? If then, do we have a test coverage in Apache Spark codebase across multiple data sources (Hive,ORC,Parquet)? Can we start from there?

automatically generate _c1 if no alias is specified

cxzl25 · 2023-04-10T14:07:33Z

What is your end goal as a user? For example, did you hit the following?

We have encountered some problems. Some lower versions of Spark use a lower version of ORC to write data with empty field name, but using Spark3.2 ORC 1.6 to read data fails.

We have injected custom rules into Spark through SparkSessionExtensions to realize the operation of adding alias automatically, which can avoid this problem.

So I was thinking, it would be better if this problem could be solved at Spark or ORC level.

hive.autogen.columnalias.prefix.label
Default Value: _c
Added In: Hive 0.8.0

String used as a prefix when auto generating column alias. By default the prefix label will be appended with a column position number to form the column alias. Auto generation would happen if an aggregate function is used in a select clause without an explicit alias.

deshanxiao · 2023-04-11T03:06:22Z

Based on the above scenario, in order to avoid some additional side effects, maybe we could skip the limitation by adding a new configuration?

empty field name

904a273

github-actions bot added the JAVA label Apr 7, 2023

fix UT

e0701ed

dongjoon-hyun reviewed Apr 7, 2023

View reviewed changes

dongjoon-hyun requested changes Apr 9, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORC-1403: ORC supports reading empty field name #1458

ORC-1403: ORC supports reading empty field name #1458

cxzl25 commented Apr 7, 2023

dongjoon-hyun Apr 7, 2023

deshanxiao Apr 11, 2023

dongjoon-hyun left a comment

cxzl25 commented Apr 9, 2023 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

cxzl25 commented Apr 10, 2023

dongjoon-hyun commented Apr 10, 2023

cxzl25 commented Apr 10, 2023 •

edited

Loading

deshanxiao commented Apr 11, 2023

ORC-1403: ORC supports reading empty field name #1458

Are you sure you want to change the base?

ORC-1403: ORC supports reading empty field name #1458

Conversation

cxzl25 commented Apr 7, 2023

What changes were proposed in this pull request?

Why are the changes needed?

How was this patch tested?

dongjoon-hyun Apr 7, 2023

Choose a reason for hiding this comment

deshanxiao Apr 11, 2023

Choose a reason for hiding this comment

dongjoon-hyun left a comment

Choose a reason for hiding this comment

cxzl25 commented Apr 9, 2023 • edited Loading

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

cxzl25 commented Apr 10, 2023

dongjoon-hyun commented Apr 10, 2023

cxzl25 commented Apr 10, 2023 • edited Loading

deshanxiao commented Apr 11, 2023

cxzl25 commented Apr 9, 2023 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

cxzl25 commented Apr 10, 2023 •

edited

Loading