-
Is it possible to write a Parquet file with a column of type array of struct/class? For instance:
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @nchabra, this isn't currently possible with the row-oriented API but we do want to improve support for nested data in the future. This can be done with the lower level API but gets pretty complicated as you have to manage specifying repetition levels and definition levels yourself. If you then want to read data back with ParquetSharp you can either use the low level Here's an example of how you could write data like this: // Create column data, each row has an array of MyColumItemType
var values = new []
{
new []
{
new MyColumnItemType(0.0f, 1.0f),
new MyColumnItemType(2.0f, 3.0f),
},
Array.Empty<MyColumnItemType>(),
new []
{
new MyColumnItemType(4.0f, 5.0f),
new MyColumnItemType(6.0f, 7.0f),
new MyColumnItemType(8.0f, 9.0f),
}
};
// Construct Parquet file schema,
// see https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists for the required schema for lists.
var float1 = new PrimitiveNode("float1", Repetition.Required, LogicalType.None(), PhysicalType.Float);
var float2 = new PrimitiveNode("float2", Repetition.Required, LogicalType.None(), PhysicalType.Float);
var element = new GroupNode("element", Repetition.Required, new[] {float1, float2});
var list = new GroupNode("list", Repetition.Repeated, new[] {element});
var column1 = new GroupNode("column_name", Repetition.Optional, new[] {list}, LogicalType.List());
var schemaNode = new GroupNode("schema", Repetition.Required, new[] {column1});
using var builder = new WriterPropertiesBuilder();
using var fileWriter = new ParquetFileWriter(filePath, schemaNode, builder.Build());
using var rowGroupWriter = fileWriter.AppendRowGroup();
// numValues is the length of definition level and repetition level data,
// so we need to add an extra 1 for each empty array
var numValues = values.Select(innerValues => Math.Max(innerValues.Length, 1)).Sum();
// Repetition levels are 0 for the start of a new list or 1 for a subsequent value.
// We also need to include a 0 value for the start of an empty list.
var repetitionLevels = values.SelectMany(
innerValues => innerValues.Length == 0
? new short[]{0}
: innerValues.Select((_, idx) => idx == 0 ? (short) 0 : (short) 1)).ToArray();
// Definition levels are 1 if only the list is defined with no value (so the list is empty),
// otherwise 2 when we have a valid element value.
// Note: ParquetSharp currently only supports list columns that are optional, but here we're assuming all
// arrays are non-null. Null arrays should be represented with a 0 definition level.
var definitionLevels = values.SelectMany(
innerValues => innerValues.Length == 0
? new short[]{1}
: innerValues.Select(_ => (short) 2)).ToArray();
using (var columnWriter = (ColumnWriter<float>) rowGroupWriter.NextColumn())
{
var leafValues = values.SelectMany(
itemArray => itemArray.Select(item => item.Field1)).ToArray();
columnWriter.WriteBatch(
numValues, definitionLevels.AsSpan(), repetitionLevels.AsSpan(), leafValues.AsSpan());
}
using (var columnWriter = (ColumnWriter<float>) rowGroupWriter.NextColumn())
{
var leafValues = values.SelectMany(
itemArray => itemArray.Select(item => item.Field2)).ToArray();
columnWriter.WriteBatch(
numValues, definitionLevels.AsSpan(), repetitionLevels.AsSpan(), leafValues.AsSpan());
}
fileWriter.Close(); |
Beta Was this translation helpful? Give feedback.
Hi @nchabra, this isn't currently possible with the row-oriented API but we do want to improve support for nested data in the future. This can be done with the lower level API but gets pretty complicated as you have to manage specifying repetition levels and definition levels yourself.
If you then want to read data back with ParquetSharp you can either use the low level
ColumnReader.ReadBatch
method to reconstruct the data yourself using the definition and repetition levels, or you can use theLogicalColumnReader
API which will let you read the leaf level column data (eg. allfield1
values as an array of arrays and a separate column offield2
values) but there are some limitations to inte…