-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iceberg read support in Velox #5977
Comments
This is the first cut to support reading Iceberg tables with delete files. The design doc can be found at facebookincubator#5977
This is the first cut to support reading Iceberg tables with delete files. The design doc can be found at facebookincubator#5977
@yingsu00 Thanks for the detailed design document. We need to separate Iceberg positional delete support and equality delete support into two separate issues as they are independent pieces. Equality deletes also has the design choice of using the coordinator. From a contribution standpoint as well, we need to add them independently. Now you mentioned
I do not see a lot of Iceberg domain-specific information here except for the delete list materialization, and updating the mutation. As you mentioned, the mutation update will also go away with offset support. Can we add a readPositionalDeletes() to HiveConnector that updates mutation? Inside readPositionalDeletes(), we can specialize based on the Iceberg split. |
@majetideepak Equality deletes will be a separate implementation for sure, as I already mentioned this in the doc:
We need to evaluate if both are possible and does not incur too much performance hit. The positional delete PR does not have it as its title already says it, but I put the discussion here in the design doc so people can learn the options and chime in their thoughts. For the HiveIcebergDataSource, I think everything inside it is Iceberg specific, like how the delete files are defined, what types of the columns there are in the delete files, how to interpret these row numbers(they are file positions instead of split positions), how to pre-filter the delete rows, and how to match the base file paths are all very Iceberg specific. The Hudi and Delta table will be very different in these. For example, the Iceberg defines its metadata columns in different ways than other protocols. For another example, Hudi's log file is different than Iceberg's delete files and will be opened in a different way. It'll be a mess to mix everything into the already huge HiveDataSource. Note that HiveIcebergDataSource doesn't have all features implemented yet and it's already several hundreds lines of code. In the next a few PRs we will introduce those Iceberg specific operations like pre-filter the delete rows, match the base file paths, and maybe column projections which is also defined by the Iceberg spec. To me, having this layer handling these domain specific things and construct the Mutation object is very natural and easier to maintain than having a huge HiveDataSource file with many large if-else blocks. In the future if we want to support Hudi and Delta, we can easily extend this structure without making the code too big and too hard to read. Also Meta already has an internal extended HiveDataSource for Prism, and this is not new. |
I see some code related to equality deletes here https://github.com/facebookincubator/velox/pull/5897/files#diff-54e191b935da6cdc7f5542e8e6997d7a56e2ec6cf8f8094f570d38aadcc57dfbR76. Am I missing something?
Sure, we can abstract away the deleteList -> mutation into another class and add all Iceberg-specific details there.
This is again delete list -> mutation
How does this metadata impact reading files in the worker?
You mentioned that this and schema evolution should be done in the HiveConnector itself.
IMO creating a new DataSource is almost like adding a new connector. @Yuhta can you share some pointers on what was done for Prism? |
readEqualityDeletes is VELOX_NYI. I can remove it for now.
Could you explain a bit how to abstract it into another class that is fundamentally different from the HiveIcebergeDataSource? Also, This is actually 1)IcebergSplit -> 2)Iceberge DeleteFile -> 3)Vectors for datafile path and deletelist -> 4) mutation. The Iceberg delete file and all metadata columns are defined by the Iceberg specification and therefore Iceberg specific. Apart from these data structures, the HiveIcebergDataSource is also about its behavior, that is, it needs to deserialize the split's extraFileInfo to get the IcebergDeletefiles, open the delete files and construct the delete lists. It is additional work to the plain HiveDataSource, which just reads the base data file. This behavior is on top of HiveDataSource and is different for other table formats like Hudi and Delta.
The pre-filter needs to use the
The columns in the Iceberg delete file has two required and one optional metadata columns, and it is defined by the Iceberg specification. It's 2 or 3 columns only, instead of 1 or 3, and the type of the required columns are Iceberg StringType and LongType. Other table formats define them differently, for example, Hudi may contain multiple log records in the same log file for multiple transactions. For positional deletes, we don't need to add metadata column to the base data file. But in the future, we may need to add these metadata columns to Iceberg scan. For example, Iceberg updates may require the scan to add the "row" and "_pos" columns to the existing columns so that the rows can be merged with new values in downstream operators. As another example, this query also needs to add additional metadata columns
I added more explanation in this section in the design doc.
Yes, but the design of schema evolution in Hive is not clear now and this is only a thought. We will need to discuss this later. But no matter whether schema evolution is implemented in the base HiveDataSource or the derived HiveIcebergDataSource, Iceberg column projections need to be supported to query Iceberg-specific metadata columns.
This is just a derived HiveDataSource that handles Iceberg specific things on top of Hive, and is far from being a full connector. They share the same configs, rely on the same storage adapters, use the same HiveTableHandle and HiveColumnHandles. They can additionally share some of the partition functions. The difference is it needs to open additional Iceberg files and construct mutation objects, or add Iceberg specific metadata columns. These are all specific to Iceberg. I think we all agree that we want to abstract them to some separate Iceberg class to handle Iceberg related operations. Now we have several choices
Do these type checks look awkward? As a common practice, if we see many of such type checks using if-else, we would naturally modify them to use the OOP paradigm, using either inheritance or delegation. It also helps to make the code easier to read and maintain. The HiveDataSource has already almost a thousand lines now. Adding these if-else will only make the code longer, harder to search and read. Adding additional features would make it even longer over time. It's non-trivial amount of code. Separating the code to multiple logic units or files is more friendly to us humans. Also, if we put everything in the base HiveDatasource, it would pollute the code, because we may have to add Iceberg or Hudi specific member fields. In summary, I think the original hope to use a standard way to express deletes in the HiveConnectorSplit does NOT work, because
Now that we need to do the work in Velox anyways, it's just a matter how we organize these table-format specific code. Having them mingled in the same class and file doesn't seem a good choice to me. Having a derived HiveIcebergDataSource does NOT mean it's a new connector, it's just a better way to hold the code we have to do anyways. |
The abstraction analogy I can think of is how we have the same
This should allow re-use of abstractions across various table formats that support deletes. |
@majetideepak This only works if these table format protocols work in the same semantics and behaviors, so you can extract the common behaviors (e.g. read deletes) and apply them in a pre-defined order. But unfortunately they are drastically different. For example, Hudi uses record ID that is unique to the whole partitions. For another example, Hudi does NOT use standalone delete files, but the reader needs to open the log file and take different actions on different block types. It defines different merge policies and requires the reader to
This means the reader needs to lookup keys and may need to do sorting and comparisons. The operations will be quite different than Iceberg operations both in terms of the operation content, and also the operation steps. Mixing them together would only make the base HiveDataSource logic too complex and hard to maintain. Having separate HiveIcebergDataSource and HiveHudiDataSource would give us much more flexibility in supporting the different behaviors of these table formats, making code files shorter and easier to maintain. Plus, it does NOT necessarily introduce duplicated code. We can easily extract common operations into separate util functions, e.g. open a file and read its content, or open a file and read one block, etc. |
@yingsu00 I believe you are referring to the Hudi real-time view. We cannot use the existing Parquet reader to support this. We might need to implement a new reader that supports real-time merging for this. The HiveHudiDataSource abstraction will not solve this. I think it will help if @Yuhta can share some insights into this. We should probably start a discussion in the Prestissimo working group as well to have formal community input. CC @aditi-pandit |
Yes I was referring to Hudi real-time view which needs to read Hudi log files.
Hudi Read-optimized view does NOT need to read the log files, and there's nothing need to be done in Velox. But to support real-time read which does require reading the log files, Velox needs to
Looking at these operations, they are all HUDI specific. It doesn't sound right to push them into Parquet reader or DWRF reader, whose task is just to decode the data in a single file in a batch. But these HUDI specific operations, e.g. open log files, merge with base file result vectors, etc. can be easily handled in Another fact was that Iceberg metadata columns are not persistent in the base file, but Hudi metadata columns are. Therefore if we need to read the metadata columns, they need to be constructed somehow, e.g. _is_deleted, or _row. These work can be easily done in
This is not true. These table format specification and implementations are quite complex, and I do think they deserve to have their own data sources. If we have everything in Even though you still think |
As I suggested in #5897, mutation is a split level operation so we should do it at split level instead of data source level. I proposed a The delete file can be in different format and it's the job of processor implementation to read it. However, the result will always be a bitmask (row numbers) or an hash table. This part is we will push down to reader and has nothing to do with the processor. Some post process will needed to be done on the result vector when pushdown is not possible. For update we will do that same, except we need some common format for updated values. |
@yingsu00 : Thanks for this detailed doc. Please link to prestodb/presto#19811 to give context of how this is used in Prestissimo. Regarding my take on some of the issues debated here:
@yingsu00, @majetideepak : We could add a |
@aditi-pandit The idea of On the split representation, we can compose the base split with additional framework specific data. It does not need to be a subclass of any interface. |
Thanks @Yuhta. I think we cross-posted. Yeah, there is lot of intersect. I'm having a bit of trouble understanding how Hudi MOR would translate to bitmask/HashTable ? Any ideas you can share here ? |
@aditi-pandit HUDI MOR should translate to To summarize, Iceberg, and Hudi table formats provide deletes and updates (i.e. deletes + inserts). Now the deletes can be of two types: 1) Positional (Iceberg), 2) Equality (Iceberg, Hudi). The AsyncDataCache can be applied to delete files, though would be less useful since they are ephemeral. |
@aditi-pandit Hudi MOR is based on equality, so for single column keys, we could get by with manipulating the |
@majetideepak Yes the new layer will sit between data source and file reader, keeping internal states of deltas at a per split basis. For file reader it will provide positional and value based data (with possibly updated cell values). From data source it will be invoked at add split, read next batch, add dynamic filter, finish split, and these can be the main interface it provides to the data source. |
@majetideepak : I'm in agreement with @Yuhta to create the new abstraction layer between Reader and DataSource with these operations (read/interpret delete files/positional deletes) that might be significantly different between Iceberg and Hudi. @yingsu00 : Any thoughts ? Your current PR would only need few refactoring to achieve this new design. |
I think this may lose the opportunity for performance gains, because if we push these as IN filter or equality filter on the recordIds, the reader can potentially skip a lot of rows without having to decompress and decode them. Making another hashtable and probe it after all RecordIds are read may be much slower. Besides, the IN filters already make hashtables inside and there's no need to duplicate that part of code. Moreover, we may not need to copy the full scanSpec. Instead, we can just remember and restore the filters. |
The "remember and restore" is the tricky part, the only way to do it correctly is keep the filter aside (maybe in |
Creating the hash table is about the same cost as creating the filter, but the benefit comes from skipping the rows from the readers. |
Yes, many customers use iceberg as meta store. |
Yes, Gluten supported some features but not all yet. We used Ying's PR and will collaborate on rest features. |
@FelixYBW I have tested Ying's PR of iceberg MOR table with Gluten, and it can work with Gluten. After these PRs are merged, we can add the support of iceberg MOR table in Gluten. |
Thank you for update, @liujiayi771 , also thank @yingsu00's work. |
Description
This document talks about supporting Iceberg reads in Velox
Iceberg is a widely adopted data lake table format. It uses a set of metadata and manifest files to manage the tables, and was carefully designed to provide the following functionalities and features:
Supports both Merge On Read and Copy On Write
In MOR mode
The table and its data files are included by the manifests therefore no expensive directory listing is required when scheduling the splits
In this doc we will focus on the Velox support on 1, reading the data with Iceberg DeleteFiles.
Overview
In Prestissimo, the existing Iceberg connector achieves these functionalities during the planning and scheduling phases by calling into the Iceberg library on the coordinator. The planner and MetadataManager on the coordinator get the Iceberg table information including the snapshot information by issuing some API calls to the Iceberg library. Since Velox only supports filter pushdown in the HiveConnector, a special planner rule will be applied to pushdown the filters to scan during the planning phase. The scheduler creates Prestissimo IcebergSplits, which will be serialized and converted to a HiveIcebergSplit to Velox.
HiveIcebergSplit and IcebergDeleteFile
The HiveIcebergSplit extends the HiveConnectorSplit and adds a special field called
deleteFiles
, which is a vector oficeberg::IcebergDeleteFile
. TheIcebergDeleteFile
contains information about which rows in the base data file are deleted. The delete files can be any Hive file formats, e.g. Parquet, ORC, AVRO, but in Prestissimo, it would be the same file format as the base data files.There are two types of Iceberg delete files
mark a row deleted by data file path and the row position in the data file
mark a row deleted by one or more column values, like id = 5
The
IcebergDeleteFile
has afilePath
field which points to an Iceberg delete file, which could be either a positional delete file or an equality delete file.equalityFieldIds
is for equality delete files only.lowerBounds
andupperBounds
are for positional delete files only.IcebergeDataSource
IcebergeDataSource
was used to read the delete files, and interpret the positional deletes as row numbers and encapsulates them in the Mutation object, or constructs predicates for the equality delete files. These delete rows or predicates will then be pushed down into the readers. Bearing in mind the common practice of writing code, namely, reducing code duplicity but maintain necessary structures and encapsulations, we create this class and make it extend HiveDataSource. The operations in this class do NOT overlap with that inHiveDataSource
, but are an Iceberg domain specific layer on top ofHiveDataSource
.The major Iceberg operations are:
The
IcebergeDataSource
was created when HiveConnector receives aHiveIcebergSplit
by checking thecustomSplitInfo
:There’re alternative ways to do this check, e.g. by using a special connectorId, or checking the TableHandle type, etc. We discarded the connectorId implementation because Velox uses non-fixed connector Ids and the tests could use other names. Other ways
Positional Deletes
The positional deletes file contains two special Iceberg columns "file_path" and "pos" and a optional third column "row".
They are with special Iceberg field IDs and are defined in IcebergMetadataColumn.h. "File_path" should match with the base data file path, and “pos” is the row numbers deleted in this file. "row" is not included in the first PR but will be added in the future.
In Velox, the “pos” column values are the row numbers of the delete rows in the data file, and will be converted into the delete bitmap within the batch, with adjusted row number relative to the beginning of the batch. For this we introduced field
readOffset_
which is the read offset to the beginning of the split in number of rows for the current batch, andfirstRowInSplit_
, which is the file row position for the first row in the split. The conversion is done inHiveIcebergDataSource::readNext(uint64_t size)
, this is becauseMutation
does not support offset yet, so we have to construct the bitmap for every batch instead of for the whole split. Improving the mutation performance is something we will do later.In fact, the row numbers can be directly passed to the readers. The current implementation has to convert the row numbers into bitmaps, then read the bitmap later in the SelectiveStructColumnReader and re-populate them as row numbers . This extra conversion can be avoided and we will do it in future PR.
Note that this process doesn’t need to read and construct extra columns like what Trino did. It added an additional column "_pos" to the existing columns in the base data file, and then constructed the filter of the positional deletes on this extra column. This would incur extra CPU and memory costs, but Velox can handle it without them.
Delete row range pre-filter
The IcebergDataSource will maintain two fields startRowInSplit_ and endRowInSplit_ to identify the start and end row numbers for this split in its data file. It shall get the startRowInSplit_ and endRowInSplit_ when adding each split in addSplit(). The DeleteFile also has 2 fields called lowerBounds and upperBounds, which are When opening the delete file in openDeletes(), the lower bound and upperbound will be compared against the startRowInSplit and endRowInSplit. If the delete file doesn’t contain the rows in this split, we can skip opening this delete file.
Equality Deletes
Equality delete files identify deleted rows in a collection of data files by one or more column values. Equality delete files store any subset of a table’s columns and use the table’s field ids.
An equality delete file can contain multiple columns and multiple rows. The columns are conjuncts(AND) and rows are disjuncts(OR)
example delete file 1:
id = 4 AND category = ‘Bear’
example delete file 2
Id IN {4, 7, 9, 100, 145}
example delete file 3
(Id = 4 AND category IS NULL) OR (id = 7 AND category = ‘Bear’) OR (id = 9 AND category = ‘Bear’)
Note that before we decide which part of the filters can be pushed to scan and readers, we need to convert the disjuncts into conjuncts. E.g.
(partkey < 2 and suppkey < 2) or (partkey > 9 and suppkey > 9)
will be converted to
In Presto, this is done by the
LogicalExpressionRewriter
in theSimplifyRowExpressions
rule. It will be simplified and transformed a few more times in several planner rules.HiveFiltePushdown
rule will push the non-dynamic filters into scan. In the above example, (Partkey<2 or parkey > 9) and (suppkey < 2 or suppkey > 9) will be domain filters pushed to the reader, while (parkey < 2 or suppkey > 9) and (suppkey < 2 or partkey > 9) will be the remaining filters which will be evaluated after the ColumnReaders loaded the two columns. There are no other remaining filters that has to be done in a FilterNode.There’re two places we can do the above process:
For 1 to happen, we shall notice that the delete files can only be acquired after the split was created. This is after the planning is done, and requires the scheduler to open the delete files, create
HivePagesource
, read the content, construct the expression, and reuseLogicalExpressionRewriter
to rewrite the expression, then merge the filters with the existing ones in theHiveTableLayoutHandle
. The concern is that opening the delete files on the coordinator might be a potential bottleneck. But on the other hand, we can build local caches to avoid opening the same files over and over and the expressions only need to be rewritten once. This is expected to work considering the whole partition or table may share the same delete files. Velox doesn’t need changes since the filters would be handled in the coordinator and pushed down to scan.If we want to choose 2 and open the delete files in Velox scan, we will have to manually rewrite and simplify the expressions because there is no equivalent LogicalExpressionRewriter in Velox. But luckily since they are all equality predicates, all of them can be handled in TableScan because there are no dynamic filters in it. We can treat these situations separately
We can also have a local equality delete file cache on the Velox workers to improve the performance.
We will need to make POCs for both options and see if the performance is ok. If you have thoughts please feel free to comment in this issue.
Type Conversion
Iceberg has its own data types. But for Prestissimo, the type conversions for the base data files would be done in the planning phase on the coordinator, so that when Velox receives the split, the type in the column handles would already be Velox types.
The delete files need to be treated separately. But since the delete files has pre-defined schema and the types are already known, we will directly use Velox types. E.g. the “pos” column would be read as a BIGINT column directly.
Metadata Columns
Iceberg defines metadata columns in the Spec https://iceberg.apache.org/spec/#table-metadata-fields
and in Iceberg library:
These metadata columns will be used internally by Presissimo. For example, an update query may require "row" and "pos" fields to be added to TableScan so that new values can be merged to the right rows by later operators.
We will also support directly querying these metadata columns in the future. Currently Presto Hive and Iceberg supports reading two metadata columns "$path" "$file_modified_time":
We can also query
Note that these columns are not in the actual data files, but need to be constructed by the data source and readers according to Iceberg spec. These extra columns can be added through Column Projections in HiveIcebergDataSource.
However, to just support reads with positional delete files, we found there is no need to include additional metadata fields like “_file” or “_pos” to the base data files. The information of these two columns is self-contained in the data files. This is different than Trino's implementation, which do need to add the "_pos" field to base file read.
Column Projections and Schema Evolution
Columns in Iceberg data files are selected by field id. The table schema’s column names and order may change after a data file is written, and projection must be done using field ids. This can be used to implement schema evolutions. It also can be used to query metadata columns that are not existing in the data file.
In Presto, the column mappings are done at the HivePageSource level, not just for Iceberg. We can do the same in Velox in HiveDataSource. Also the schema evolution would better be done for all Hive tables, not just Iceberg, if they achieve the same semantics. As far as I know Meta already has some plan to support it in the Hive connector for all table formats and file formats, and the design is unclear at the moment. We will need to learn more about it before making decision how to support schema evolution.
The text was updated successfully, but these errors were encountered: