-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2347: Add interface layer between Parquet and Hadoop Configuration #1141
PARQUET-2347: Add interface layer between Parquet and Hadoop Configuration #1141
Conversation
aeb2348
to
a2ce8af
Compare
Add interface layer for the Hadoop Configuration.
a2ce8af
to
7db4478
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! I have left some inline comments.
BTW, I am also curious of the following things.
- It seems that we still have some compatibility issues. Can you confirm? If yes, could you please write them out explicitly?
- Is there any follow-up work item to do? Would be good if we can know the whole picture in advance.
- Is it possible to add a simple test case to prove that a simple writer and reader roundtrip can work successfully without Hadoop dependency?
parquet-pig/src/main/java/org/apache/parquet/pig/TupleReadSupport.java
Outdated
Show resolved
Hide resolved
parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/AbstractThriftWriteSupport.java
Show resolved
Hide resolved
if (conf instanceof HadoopParquetConfiguration) { | ||
return ((HadoopParquetConfiguration) conf).getConfiguration(); | ||
} | ||
Configuration configuration = new Configuration(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When will it happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using a HadoopParquetConfiguration, the user did not yet decouple from Hadoop as it is just a wrapper for Configuration. When the user wants to decouple from Hadoop, they can implement their own ParquetConfiguration, which does not rely on Hadoop's Configuration (or a simple implementation is added afterwards, this PR was getting a bit large for that already). There is still some code right now, mainly around the codecs which needs a Hadoop Configuration. It is therefore important that while we're still removing these last references to Hadoop, we can get such an instance from a ParquetConfiguration, in order not to break anything.
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/api/InitContext.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordReader.java
Show resolved
Hide resolved
parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java
Show resolved
Hide resolved
parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/ThriftReadSupport.java
Outdated
Show resolved
Hide resolved
…tContext.java Co-authored-by: Gang Wu <[email protected]>
@wgtmac thanks a lot for the review! It's quite a big one so I really appreciate you took the time for it. To address your concerns:
Correct, japicmp comes up with two incompatibilities. These are in the classes CodecFactory and ParquetReader. The incompatibilities it points to are both changes of private and protected field types from Configuration to ParquetConfiguration. These changes are strictly necessary for the effort to unhadoop the read/write API.
After this, the following steps that will need to be taken are: 1. the creation of a simple unhadooped implementation of the ParquetConfiguration interface, and 2. adding a simple way for users to avoid the Hadoop codecs, as the OOTB implementations of everything still rely very heavily on Hadoop classes. These changes should allow users to drop the Hadoop runtime dependency. The Hadoop client API dependency will still be necessary.
We do not yet have any serious ways for users to use the API without Hadoop dependencies. The added parameters to the TestReadWrite fixture make sure the read/write API should still function when using the ParquetConfiguration interface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detail explanation! This LGTM. Probably you can also create JIRA issues in advance for all planned work items.
I would listen to feedback from more maintainers to proceed. @gszadovszky @shangxinli @ggershinsky @Fokko
@@ -547,6 +547,8 @@ | |||
</excludeModules> | |||
<excludes> | |||
<exclude>${shade.prefix}</exclude> | |||
<exclude>org.apache.parquet.hadoop.CodecFactory</exclude> <!-- change field type from Configuration to ParquetConfiguration --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: can we remove this in a future PR? Only this PR introduces the breaking change here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not that well versed in how japicmp works exactly, but I reckon this can be removed after the next minor release.
@shangxinli @gszadovszky Do you have any comment? If not, I plan to merge this next week. Thanks! @amousavigourabi Could you please resolve the conflicts? Thank you! |
@wgtmac, @amousavigourabi, sorry, but I won't have the time to properly review this PR in the upcoming days. |
parquet-common/src/main/java/org/apache/parquet/conf/ParquetConfiguration.java
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/ParquetReadOptions.java
Outdated
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java
Show resolved
Hide resolved
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java
Outdated
Show resolved
Hide resolved
* | ||
* @deprecated override {@link ReadSupport#init(InitContext)} instead | ||
*/ | ||
@Deprecated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need to add a deprecated
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is focussed on transitioning from Configuration
to the ParquetConfiguration
interface. This included some calls to deprecated methods which I could not very quickly transition away from. I would consider this out-of-scope for this PR.
Map<String, String> keyValueMetaData, | ||
MessageType fileSchema, | ||
ReadContext readContext) { | ||
throw new UnsupportedOperationException("Override prepareForRead(ParquetConfiguration, Map<String, String>, MessageType, ReadContext)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message should be more meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I follow the example set by ReadSupport#init(Configuration, Map, MessageType)
. As this error will not occur unless you are implementing your own ReadSupport
class, I am not sure whether there needs to be that much more information in the exception. I'll add a reference to the ReadSupport
class though.
* @return the information needed to write the file | ||
*/ | ||
public WriteContext init(ParquetConfiguration configuration) { | ||
throw new UnsupportedOperationException("Override init(ParquetConfiguration)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
parquet-avro/src/main/java/org/apache/parquet/avro/AvroWriteSupport.java
Show resolved
Hide resolved
parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftRecordConverter.java
Outdated
Show resolved
Hide resolved
@amousavigourabi we have added so many public methods to keep the backward compatibility. So which one is preferred (I think it should be the |
the |
Thank you for the review @ConeyLiu. This PR is one of the efforts to use the parquet library without the Hadoop dependency. At the moment, we do not want to deprecate Hadoop dependency yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @amousavigourabi for the response and your efforts. LGTM.
This PR is one of the efforts to use the parquet library without the Hadoop dependency. At the moment, we do not want to deprecate Hadoop dependency yet.
@wgtmac This is a good start. It would be nice to have a best practice guide doc to help users understand the different configuration usage.
I totally agree with you. @ConeyLiu |
* @param name the property to set | ||
* @param value the value to set the property to | ||
*/ | ||
void set(String name, String value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setString()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just went with the way Hadoop's Configuration
does it for this and other methods in the interface, in order to not have to change references to conf.set() everywhere.
I will merge this if there is no more comment by the end of this week. Thanks @amousavigourabi for working on this! |
This broke the CI checking. |
private static final Map<Class<?>, Constructor<?>> CONSTRUCTOR_CACHE = new ConcurrentHashMap<Class<?>, Constructor<?>>(); | ||
|
||
@SuppressWarnings("unchecked") | ||
public static <T> T newInstance(Class<T> theClass) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why this breaks the API checking. However, this class seems redundant since there is not any usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this is strange. Should we exclude this class in the japicmp config? @amousavigourabi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As there is no current usage, we can get away with deleting this class for now. My main concern with the CI failure is the incompatibility at org.apache.parquet.conf.ParquetConfiguration.getClass(java.lang.String,java.lang.Class,java.lang.Class):CLASS_GENERIC_TEMPLATE_CHANGED
japicmp complains about. Especially as this did not come up before and the interface is completely new I have no idea why this is happening. Neither of the matters japicmp flags should be an issue AFAIK, so that might be something to investigate in our configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be a bug of japicmp. Because both of the methods are generic template methods. And it should be safe to suppress the checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll open a new PR shortly, deleting the unused stuff and suppressing the warnings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has also started failing at other completely new methods in the Parquet Hadoop module: org.apache.parquet.conf.HadoopParquetConfiguration.getClass(java.lang.String,java.lang.Class,java.lang.Class):CLASS_GENERIC_TEMPLATE_CHANGED,org.apache.parquet.hadoop.util.SerializationUtil.readObjectFromConfAsBase64(java.lang.String,org.apache.parquet.conf.ParquetConfiguration):CLASS_GENERIC_TEMPLATE_CHANGED
. Might be something to report to the japicmp maintainer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested 0.16.0 has no problem. A little strange.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably the current 0.18.1 is more aggressive than 0.16.0
@@ -48,7 +51,7 @@ public class CodecFactory implements CompressionCodecFactory { | |||
private final Map<CompressionCodecName, BytesCompressor> compressors = new HashMap<>(); | |||
private final Map<CompressionCodecName, BytesDecompressor> decompressors = new HashMap<>(); | |||
|
|||
protected final Configuration configuration; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This causes the Iceberg build to break when bumping to 1.14.0-SNAPSHOT:
> Task :iceberg-parquet:compileJava FAILED
/Users/fokkodriesprong/Desktop/iceberg/parquet/src/main/java/org/apache/iceberg/parquet/ParquetCodecFactory.java:60: error: cannot find symbol
codecClass = configuration.getClassLoader().loadClass(codecClassName);
^
symbol: method getClassLoader()
location: variable configuration of type ParquetConfiguration
/Users/fokkodriesprong/Desktop/iceberg/parquet/src/main/java/org/apache/iceberg/parquet/ParquetCodecFactory.java:62: error: incompatible types: ParquetConfiguration cannot be converted to Configuration
codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, configuration);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to fix this on the iceberg side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway this is a breaking change. We have to fix it before the release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Iceberg this is not an issue, since we would remove this class with the 1.14 release (that contains #1134):
https://github.com/apache/iceberg/blob/866021d7d34f274349ce7de1f29d113395e7f28c/parquet/src/main/java/org/apache/iceberg/parquet/ParquetCodecFactory.java#L28-L32
In Iceberg we shade Parquet, so this will not interfere with other libraries that use a different version of Parquet. I'm curious what if we also have this issue on the Spark side (cc @vinooganesh).
Make sure you have checked all steps below.
Jira
Tests
Additional parameters to run the read/write tests using the new interface-using methods.
Commits
Documentation
jacicmp exclusions have been added for the following classes:
org.apache.parquet.hadoop.CodecFactory
,org.apache.parquet.hadoop.ParquetReader
. When these exclusions are removed, the following incompatibilities are detected:This PR is part of an effort that has been discussed on the dev mailing list.