-
Notifications
You must be signed in to change notification settings - Fork 513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In SMB and ParquetAvroIOTap set GenericDataSupplier and read schemas #5121
Conversation
@@ -0,0 +1,38 @@ | |||
package com.spotify.scio.parquet.avro |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved this class out from package object
val filePattern = ScioUtil.filePattern(path, params.suffix) | ||
val conf = ParquetConfiguration | ||
.ofNullable(params.conf) | ||
conf.setReadSchemas(params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except of these two lines, the rest of the class remains the same, just moved
@@ -241,21 +241,24 @@ public Read<T> withPredicate(Predicate<T> predicate) { | |||
return toBuilder().setPredicate(predicate).build(); | |||
} | |||
|
|||
@SuppressWarnings("unchecked") | |||
@Override | |||
FileOperations<T> getFileOperations() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this to reflect the same style as in Write and TransformOutput
@Test | ||
public void testSpecificRecord() throws Exception { | ||
final ParquetAvroFileOperations<AvroGeneratedUser> fileOperations = | ||
ParquetAvroFileOperations.of(AvroGeneratedUser.getClassSchema()); | ||
ParquetAvroFileOperations.of(AvroGeneratedUser.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Prod we use the overload with Schema for GenericRecords only
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #5121 +/- ##
==========================================
+ Coverage 62.99% 63.06% +0.06%
==========================================
Files 298 300 +2
Lines 10946 10945 -1
Branches 754 760 +6
==========================================
+ Hits 6895 6902 +7
+ Misses 4051 4043 -8 ☔ View full report in Codecov by Sentry. |
final Schema schema = | ||
getRecordClass() == null | ||
? getSchema() | ||
: new ReflectData(getRecordClass().getClassLoader()).getSchema(getRecordClass()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is identical resolution of schema through ReflectData in ParquetAvroFileOperations
|
||
implicit class ConfigurationSyntax(conf: Configuration) { | ||
|
||
def setReadSchemas[A, T](params: ParquetAvroIO.ReadParam[A, T]): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of broadening the scope of this method and moving it into a class method of ParquetAvroIO.ReadParam
? Like:
final case class ReadParam[A: ClassTag, T: ClassTag] private (
projectionFn: A => T,
projection: Schema = ReadParam.DefaultProjection,
predicate: FilterPredicate = ReadParam.DefaultPredicate,
conf: Configuration = ReadParam.DefaultConfiguration,
suffix: String = null
) {
val avroClass: Class[A] = ScioUtil.classOf[A]
val isSpecific = classOf[SpecificRecord] isAssignableFrom ScioUtil.classOf[T]
private[avro] def getReadConfiguration(): Configuration = {
val c = ParquetConfiguration.ofNullable(conf)
// Set schema/projection
val readSchema: Schema = if (isSpecific) ReflectData.get().getSchema(avroClass) else params.projection
AvroReadSupport.setAvroReadSchema(readSchema)
if (params.projection != null) { /* set projection... */ }
if (params.predicate != null) { /* set predicate */ }
// Set data supplier
if (!isSpecific && conf.get(AvroReadSupport.AVRO_DATA_SUPPLIER) == null) {
....
}
}
That way it's easy to access from both read
and tap
methods 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! I missed it, the great place for common functions on input conf!
1673bf2
to
b952934
Compare
@@ -219,13 +212,27 @@ object ParquetAvroIO { | |||
} | |||
} | |||
|
|||
def setReadSchemas[A, T](params: ParquetAvroIO.ReadParam[A, T]): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ReadParam
is already parameterized with [A, T]
. I don't think we want type shadowing here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this function
val filePattern = ScioUtil.filePattern(path, params.suffix) | ||
val conf = ParquetConfiguration | ||
.ofNullable(params.conf) | ||
params.setReadSchemas(params) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to do this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably was tired when committed this, I am changing the approach
aa95561
to
1b8ddac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that using an empty config as parameter allows some simplification, but I'd still prefer to pass a null there for consistency.
This is a personal preference, not a blocker. @clairemcginty might also have an opinion about it.
@@ -159,7 +145,7 @@ object ParquetAvroIO { | |||
object ReadParam { | |||
val DefaultProjection: Schema = null | |||
val DefaultPredicate: FilterPredicate = null | |||
val DefaultConfiguration: Configuration = null | |||
def DefaultConfiguration: Configuration = ParquetConfiguration.empty() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I'd prefer keeping this as null and handle it internally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I think, partly because it matches other ScioIO param conventions and partly because I'm worried that over time we'll forget why this was made a def
and refactor it back into a val
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to challenge this approach :) but fully agree, it is so easy to introduce a bug. As an alternative would be full clone of ParquetConfiguration
after used passed it, because we modify this instance and we don't know what else a user does
Functions.serializableFn { x => | ||
cleanedProjectionFn(x) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit. can we revert to previous syntax?
current = reader.read() | ||
r | ||
} | ||
private[avro] def setupConfigAndGetSchema[T: ClassTag](): Schema = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would prefer getting rid of this helper method an just inlining it in ParquetAvroIO::write as before. Especially since once we merge the Parquet 1.13.1 upgrade, we'll be able to remove all the logical type checks, so there won't be much logic left :D
Alternately, we could convert it it into a method that just gets the Schema for a type T
:
private[avro] def getSchemaForClass[T: ClassTag](): Schema = {
val avroClass = ScioUtil.classOf[T]
val isSpecific: Boolean = classOf[SpecificRecord] isAssignableFrom avroClass
if (isSpecific) ReflectData.get().getSchema(avroClass) else schema
}
since that logic is reused a bit in this file :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. In case of write it is used only once, but I aimed to have similarity with read param handling. I will inline
scio-parquet/src/main/scala/com/spotify/scio/parquet/avro/ParquetAvroTap.scala
Outdated
Show resolved
Hide resolved
scio-parquet/src/main/scala/com/spotify/scio/parquet/avro/ParquetAvroSink.scala
Outdated
Show resolved
Hide resolved
public static <V extends IndexedRecord> ParquetAvroFileOperations<V> of( | ||
Schema schema, CompressionCodecName compression, FilterPredicate predicate, | ||
Configuration conf) { | ||
if (conf.get(AvroReadSupport.AVRO_DATA_SUPPLIER) == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer setting this in ParquetAvroReader::prepareReader, similar to how we set model in ParquetAvroSink::open -- that way this logic doesn't depend on the user choosing this constructor specifically :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, with the help of newly introduced private final Class<ValueT> recordClass;
we can do that
0cbfe14
to
338dc55
Compare
_.parquetAvroFile[TestRecord](_).map(identity) | ||
)(_.saveAsParquetAvroFile(_)) | ||
|
||
forAllCases(readConfigs) { case (c, _) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the main idea of changes in this file is to test both SDR and Legacy in the same test cases
Addressing: