Column of complex type array #275

nchabra · 2022-05-26T02:42:59Z

nchabra
May 26, 2022

Is it possible to write a Parquet file with a column of type array of struct/class?

For instance:

struct MyColumItemType{ float field1; float field2;}
Column<MyColumItemType[]> myColumn;

Answered by adamreeve

May 27, 2022

Hi @nchabra, this isn't currently possible with the row-oriented API but we do want to improve support for nested data in the future. This can be done with the lower level API but gets pretty complicated as you have to manage specifying repetition levels and definition levels yourself.

If you then want to read data back with ParquetSharp you can either use the low level ColumnReader.ReadBatch method to reconstruct the data yourself using the definition and repetition levels, or you can use the LogicalColumnReader API which will let you read the leaf level column data (eg. all field1 values as an array of arrays and a separate column of field2 values) but there are some limitations to inte…

View full answer

adamreeve · 2022-05-27T01:00:52Z

adamreeve
May 27, 2022
Collaborator

Hi @nchabra, this isn't currently possible with the row-oriented API but we do want to improve support for nested data in the future. This can be done with the lower level API but gets pretty complicated as you have to manage specifying repetition levels and definition levels yourself.

If you then want to read data back with ParquetSharp you can either use the low level ColumnReader.ReadBatch method to reconstruct the data yourself using the definition and repetition levels, or you can use the LogicalColumnReader API which will let you read the leaf level column data (eg. all field1 values as an array of arrays and a separate column of field2 values) but there are some limitations to interpreting the data when you have multiple levels of nullability (see #221 (comment)).

Here's an example of how you could write data like this:

// Create column data, each row has an array of MyColumItemType
var values = new []
{
    new []
    {
        new MyColumnItemType(0.0f, 1.0f),
        new MyColumnItemType(2.0f, 3.0f),
    },
    Array.Empty<MyColumnItemType>(),
    new []
    {
        new MyColumnItemType(4.0f, 5.0f),
        new MyColumnItemType(6.0f, 7.0f),
        new MyColumnItemType(8.0f, 9.0f),
    }
};

// Construct Parquet file schema,
// see https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists for the required schema for lists.
var float1 = new PrimitiveNode("float1", Repetition.Required, LogicalType.None(), PhysicalType.Float);
var float2 = new PrimitiveNode("float2", Repetition.Required, LogicalType.None(), PhysicalType.Float);
var element = new GroupNode("element", Repetition.Required, new[] {float1, float2});
var list = new GroupNode("list", Repetition.Repeated, new[] {element});
var column1 = new GroupNode("column_name", Repetition.Optional, new[] {list}, LogicalType.List());
var schemaNode = new GroupNode("schema", Repetition.Required, new[] {column1});

using var builder = new WriterPropertiesBuilder();
using var fileWriter = new ParquetFileWriter(filePath, schemaNode, builder.Build());
using var rowGroupWriter = fileWriter.AppendRowGroup();

// numValues is the length of definition level and repetition level data,
// so we need to add an extra 1 for each empty array
var numValues = values.Select(innerValues => Math.Max(innerValues.Length, 1)).Sum();
// Repetition levels are 0 for the start of a new list or 1 for a subsequent value.
// We also need to include a 0 value for the start of an empty list.
var repetitionLevels = values.SelectMany(
    innerValues => innerValues.Length == 0
        ? new short[]{0}
        : innerValues.Select((_, idx) => idx == 0 ? (short) 0 : (short) 1)).ToArray();
// Definition levels are 1 if only the list is defined with no value (so the list is empty),
// otherwise 2 when we have a valid element value.
// Note: ParquetSharp currently only supports list columns that are optional, but here we're assuming all
// arrays are non-null. Null arrays should be represented with a 0 definition level.
var definitionLevels = values.SelectMany(
    innerValues => innerValues.Length == 0
        ? new short[]{1}
        : innerValues.Select(_ => (short) 2)).ToArray();

using (var columnWriter = (ColumnWriter<float>) rowGroupWriter.NextColumn())
{
    var leafValues = values.SelectMany(
        itemArray => itemArray.Select(item => item.Field1)).ToArray();
    columnWriter.WriteBatch(
        numValues, definitionLevels.AsSpan(), repetitionLevels.AsSpan(), leafValues.AsSpan());
}

using (var columnWriter = (ColumnWriter<float>) rowGroupWriter.NextColumn())
{
    var leafValues = values.SelectMany(
        itemArray => itemArray.Select(item => item.Field2)).ToArray();
    columnWriter.WriteBatch(
        numValues, definitionLevels.AsSpan(), repetitionLevels.AsSpan(), leafValues.AsSpan());
}

fileWriter.Close();

1 reply

nchabra May 27, 2022
Author

@adamreeve First, thank you very much for you elaborate answer. I am very grateful.

Following your example, I was able to implement what I needed. Great, thank you!🙏

PS: I don't mind using the lower level API, I actually do not use the row-oriented API for writing files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Column of complex type array #275

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Column of complex type array #275

nchabra May 26, 2022

Replies: 1 comment · 1 reply

adamreeve May 27, 2022 Collaborator

nchabra May 27, 2022 Author

nchabra
May 26, 2022

Replies: 1 comment 1 reply

adamreeve
May 27, 2022
Collaborator

nchabra May 27, 2022
Author