Skip to content

Commit

Permalink
Fix escape in HiveConnectorSplit
Browse files Browse the repository at this point in the history
Summary:
Exported to PR: facebookincubator#9662.

TextReader produces incorrect result when the input file contains
escaped field delimiter, e.g., it incorrectly recognizes the field "a\,bc" as
two fields "a\" and "bc". This happens because configureReaderOptions()
doesn't set SerDeOptions::isEscaped according to hiveSplit. In the Java
implementation, isEscaped is set to true when the escapedChar field is
non-empty
(https://github.com/apache/hive/blob/2d855b27d31db6476f18870651db6987816bb5e3/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazySerDeParameters.java#L105-L106).

This diff fixes this bug.

Reviewed By: Yuhta

Differential Revision: D56728664

fbshipit-source-id: c40cd3fc55eb067f21215d82634c640dbfe418ce
  • Loading branch information
kagamiori authored and facebook-github-bot committed Apr 30, 2024
1 parent 6e253f7 commit 44e9ec1
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 2 deletions.
19 changes: 17 additions & 2 deletions velox/connectors/hive/HiveConnectorUtil.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -436,12 +436,16 @@ std::unique_ptr<dwio::common::SerDeOptions> parseSerdeParameters(
auto mapKeyIt =
serdeParameters.find(dwio::common::SerDeOptions::kMapKeyDelim);

auto escapeCharIt =
serdeParameters.find(dwio::common::SerDeOptions::kEscapeChar);

auto nullStringIt = tableParameters.find(
dwio::common::TableParameter::kSerializationNullFormat);

if (fieldIt == serdeParameters.end() &&
collectionIt == serdeParameters.end() &&
mapKeyIt == serdeParameters.end() &&
escapeCharIt == serdeParameters.end() &&
nullStringIt == tableParameters.end()) {
return nullptr;
}
Expand All @@ -458,8 +462,19 @@ std::unique_ptr<dwio::common::SerDeOptions> parseSerdeParameters(
if (mapKeyIt != serdeParameters.end()) {
mapKeyDelim = parseDelimiter(mapKeyIt->second);
}
auto serDeOptions = std::make_unique<dwio::common::SerDeOptions>(
fieldDelim, collectionDelim, mapKeyDelim);

uint8_t escapeChar;
bool hasEscapeChar = false;
if (escapeCharIt != serdeParameters.end() && !escapeCharIt->second.empty()) {
hasEscapeChar = true;
escapeChar = escapeCharIt->second[0];
}

auto serDeOptions = hasEscapeChar
? std::make_unique<dwio::common::SerDeOptions>(
fieldDelim, collectionDelim, mapKeyDelim, escapeChar, true)
: std::make_unique<dwio::common::SerDeOptions>(
fieldDelim, collectionDelim, mapKeyDelim);
if (nullStringIt != tableParameters.end()) {
serDeOptions->nullString = nullStringIt->second;
}
Expand Down
1 change: 1 addition & 0 deletions velox/dwio/common/Options.h
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,7 @@ class SerDeOptions {
inline static const std::string kFieldDelim{"field.delim"};
inline static const std::string kCollectionDelim{"collection.delim"};
inline static const std::string kMapKeyDelim{"mapkey.delim"};
inline static const std::string kEscapeChar{"escape.delim"};

explicit SerDeOptions(
uint8_t fieldDelim = '\1',
Expand Down

0 comments on commit 44e9ec1

Please sign in to comment.