Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1403: ORC supports reading empty field name #1458

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

cxzl25
Copy link
Contributor

@cxzl25 cxzl25 commented Apr 7, 2023

What changes were proposed in this pull request?

ParserUtils removes empty check

Why are the changes needed?

apache/spark#35253 (comment)

java.lang.IllegalArgumentException: Empty quoted field name at 'struct<``^:string>'
    at org.apache.orc.impl.ParserUtils.parseName(ParserUtils.java:114)
    at org.apache.orc.impl.ParserUtils.parseStruct(ParserUtils.java:170)
    at org.apache.orc.impl.ParserUtils.parseType(ParserUtils.java:228)
    at org.apache.orc.TypeDescription.fromString(TypeDescription.java:202)
    at org.apache.orc.mapred.OrcInputFormat.buildOptions(OrcInputFormat.java:122)
    at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initialize(OrcColumnarBatchReader.java:130)
    at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$2(OrcFileFormat.scala:207) 

How was this patch tested?

add UT

@github-actions github-actions bot added the JAVA label Apr 7, 2023
IllegalArgumentException e = assertThrows(IllegalArgumentException.class, () -> {
TypeDescription.fromString("struct<``:int>");
});
assertTrue(e.getMessage().contains("Empty quoted field name at 'struct<``^:int>'"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test coverage means it's a feature before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably prefer to just ignore it instead of removing it.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Apache Hive support empty field names?

@cxzl25
Copy link
Contributor Author

cxzl25 commented Apr 9, 2023

Does Apache Hive support empty field names?

See apache/spark#35253 (comment).

When spark.sql.orc.impl=native, writing to empty field name is supported, but reading is failed, when spark.sql.orc.impl=hive, neither writing nor reading empty field name is supported.

Before ORC-529 1.6.0, ORC should support reading empty filed name.

Apache Hive does not support empty field.
For example, Hive create table t as select '' sql will automatically generate _c1 if no alias is specified.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cxzl25 . Sorry but the existing way is the correct direction. When Apache Hive (HiveFileFormat) doesn't support it, we should not remove a test coverage, testQuotedField2 .

Apache Hive does not support empty field.

As you see https://issues.apache.org/jira/browse/SPARK-20901, Apache Spark and ORC community is together trying to reduce the differences among different configurations.

@cxzl25
Copy link
Contributor Author

cxzl25 commented Apr 10, 2023

Sorry but the existing way is the correct direction. When Apache Hive (HiveFileFormat) doesn't support it, we should not remove a test coverage, testQuotedField2 .

Now the Spark orc datasource does not check the field name, but the Spark hive orc format checks the field name, which causes the Spark orc datasource to be able to write the field name.
Should we check it at the spark level, and it is not allowed to write the field name?

org.apache.spark.sql.execution.datasources.orc.OrcFileFormat 
org.apache.spark.sql.hive.execution.HiveFileFormat#supportFieldName

In addition, can we check the schema in orc writer, otherwise the written data may not be read?

org.apache.orc.OrcFile.WriterOptions#setSchema

@dongjoon-hyun
Copy link
Member

Ya, we can do some. However, instead of switching behaviors back and forth in this layer, I believe we need to focus on your end goals in the higher layers. What is your end goal as a user? For example, did you hit the following? If then, do we have a test coverage in Apache Spark codebase across multiple data sources (Hive,ORC,Parquet)? Can we start from there?

automatically generate _c1 if no alias is specified

@cxzl25
Copy link
Contributor Author

cxzl25 commented Apr 10, 2023

What is your end goal as a user? For example, did you hit the following?

We have encountered some problems. Some lower versions of Spark use a lower version of ORC to write data with empty field name, but using Spark3.2 ORC 1.6 to read data fails.

We have injected custom rules into Spark through SparkSessionExtensions to realize the operation of adding alias automatically, which can avoid this problem.

So I was thinking, it would be better if this problem could be solved at Spark or ORC level.

hive.autogen.columnalias.prefix.label
Default Value: _c
Added In: Hive 0.8.0

String used as a prefix when auto generating column alias. By default the prefix label will be appended with a column position number to form the column alias. Auto generation would happen if an aggregate function is used in a select clause without an explicit alias.

@deshanxiao
Copy link
Contributor

Based on the above scenario, in order to avoid some additional side effects, maybe we could skip the limitation by adding a new configuration?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants