[Go] How to handle large lists? #33875

minyoung · 2023-01-26T02:59:19Z

Describe the usage question you have. Please include as many useful details as possible.

I'm encountering a panic when trying to generate a parquet with a very large list of (nested) strings in it:

panic: runtime error: slice bounds out of range [:-2147483636]

goroutine 22 [running]:
github.com/apache/arrow/go/v10/parquet/pqarrow.writeDenseArrow(0x1400059b4a0, {0x105684320, 0x14008189c80}, {0x105686490?, 0x14009402410?}, {0x14657c7c000, 0x4000b00, 0x8000000}, {0x14647c7a000, 0x4000b00, ...}, ...)
        /Users/minyoung/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/encode_arrow.go:442 +0x22b0
github.com/apache/arrow/go/v10/parquet/pqarrow.WriteArrowToColumn({0x10567d4a0, 0x140008667e0}, {0x105684320, 0x14008189c80}, {0x105686490?, 0x14009402410?}, {0x14657c7c000, 0x4000b00, 0x8000000}, {0x14647c7a000, ...}, ...)
        /Users/minyoung/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/encode_arrow.go:227 +0x1e8
github.com/apache/arrow/go/v10/parquet/pqarrow.(*ArrowColumnWriter).Write(0x14008879a08, {0x10567d4a0, 0x140008667e0})
        /Users/minyoung/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/encode_arrow.go:195 +0x420
github.com/apache/arrow/go/v10/parquet/pqarrow.(*FileWriter).WriteColumnChunked(0x14000976000, 0x1400166c840?, 0x14008879ae0?, 0x1?)
        /Users/minyoung/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/file_writer.go:282 +0xb8
github.com/apache/arrow/go/v10/parquet/pqarrow.(*FileWriter).WriteColumnData(0x14000976000?, {0x1056861f0, 0x1400059b4f0})
        /Users/minyoung/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/file_writer.go:291 +0xb0
github.com/apache/arrow/go/v10/parquet/pqarrow.(*FileWriter).Write(0x14000976000, {0x105685228, 0x14000964330})
        /Users/minyoung/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/file_writer.go:206 +0x1f4
main.generateParquet.func1()
        /Users/minyoung/repo/core/services/gorilla/cmd/adhoc/main.go:578 +0x3f8
golang.org/x/sync/errgroup.(*Group).Go.func1()
        /Users/minyoung/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x5c
created by golang.org/x/sync/errgroup.(*Group).Go
        /Users/minyoung/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:72 +0xa4
exit status 2

I can recreate the error with:

func (ps *ParquetIOTestSuite) TestLargeListOfStrings() {
	bldr := array.NewStructBuilder(memory.DefaultAllocator, arrow.StructOf(
		arrow.Field{
			Nullable: true,
			Name:     "l",
			Type: arrow.ListOf(arrow.BinaryTypes.String),
		},
	))
	defer bldr.Release()

	lBldr := bldr.FieldBuilder(0).(*array.ListBuilder)
	sBldr := lBldr.ValueBuilder().(*array.StringBuilder)

	for i := 0; i < 10; i++ {
		bldr.Append(true)
		for j := 0; j < 5; j++ {
			lBldr.Append(true)
			sBldr.Append(strings.Repeat(fmt.Sprintf("%d:%d|", i, j), 250000000))
		}
	}

	arr := bldr.NewArray()
	defer arr.Release()

	field := arrow.Field{Name: "x", Type: arr.DataType(), Nullable: true}
	expected := array.NewTable(arrow.NewSchema([]arrow.Field{field}, nil),
		[]arrow.Column{*arrow.NewColumn(field, arrow.NewChunked(field.Type, []arrow.Array{arr}))}, -1)
	defer expected.Release()

	ps.roundTripTable(expected, false)
}

From skimming through the code and some breakpointing, it seems like the int32 offsets used has overflowed, hence the error with a number that's around -2^31. It seems like maybe I should use LargeListOf or maybe LargeString in the schema, but when attempting to do so I get a not implemented yet error.

How much work would it be to implement LargeListOf/LargeString? And if that were done, would it resolve the panic?

Writing out smaller row groups does work (presumably because the buffers and offsets are reset), but we would like to keep the number of row groups small (we only access a very small number of columns at a time, and our datasets can get very very wide).

Component(s)

Go

The text was updated successfully, but these errors were encountered:

zeroshade · 2023-01-26T16:20:16Z

@minyoung You're absolutely correct that the solution here would be to leverage LargeList and/or LargeString. There's probably more work than just modifying the schema generation (there's likely spots that check just for String or List and should also handle LargeList and LargeString by utilizing the ListLike interface and so on). But in theory it shouldn't be too much work to actually support them properly. It'd be a good exercise in test driven development honestly.

I'll add this to my list of stuff to do but if you want to give it a shot to implement, just tag me to review the PR if you file one!

minyoung · 2023-02-01T01:19:11Z

@zeroshade I had a go at implementing support for LargeString/LargeBinary: #33965

A potential gotcha I stumbled on though is that while we can now write out array.LargeString, on reads we get array.String back. If you're only reading a subset of the dataset at a time (e.g. using the RecordReader), then things are fine. If you try to read the entire dataset (or column) at once though, then we get back to the original error in this issue.

Presumably writing with: pqarrow.NewArrowWriterProperties(pqarrow.WithStoreSchema()) is supposed to fix this but just hasn't been implemented yet?

Aside: I found it humorous that the cpp side of things also errors out

minyoung · 2023-02-01T18:06:42Z

Updated the PR to now read into array.LargeString/array.LargeBinary if the schema is stored.

If the schema is not stored, then we'll read into array.String/array.Binary. Not sure if we should try do something fancy like looking at the column chunk sizes (sum of?) and automatically use LargeString/LargeBinary if they're >2G?

zeroshade · 2023-02-01T20:57:07Z

I think that it's fine for now to have it store the schema and use that properly, if you feel like being fancy and looking to automatically use it based on the column sizes, feel free to add that to the PR but i don't think it's necessary for a first pass here. Thanks for the work!

### Rationale for this change Handle writing `array.LargeString` and `array.LargeBinary` data types. This allows parquet files to contain more than 2G worth of binary data in a single column chunk. ### Are these changes tested? Unit tests included ### Are there any user-facing changes? * Closes: #33875 Authored-by: Min-Young Wu <[email protected]> Signed-off-by: Matt Topol <[email protected]>

…apache#33965) ### Rationale for this change Handle writing `array.LargeString` and `array.LargeBinary` data types. This allows parquet files to contain more than 2G worth of binary data in a single column chunk. ### Are these changes tested? Unit tests included ### Are there any user-facing changes? * Closes: apache#33875 Authored-by: Min-Young Wu <[email protected]> Signed-off-by: Matt Topol <[email protected]>

minyoung added the Type: usage Issue is a user question label Jan 26, 2023

github-actions bot added the Component: Go label Jan 26, 2023

zeroshade added Type: enhancement good-second-issue labels Jan 26, 2023

minyoung added a commit to minyoung/arrow that referenced this issue Feb 1, 2023

apacheGH-33875: [Go] Handle writing LargeString and LargeBinary types

8f66be1

minyoung mentioned this issue Feb 1, 2023

GH-33875: [Go] Handle writing LargeString and LargeBinary types #33965

Merged

github-actions bot assigned minyoung Feb 1, 2023

zeroshade closed this as completed in #33965 Feb 2, 2023

zeroshade added this to the 12.0.0 milestone Feb 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Go] How to handle large lists? #33875

[Go] How to handle large lists? #33875

minyoung commented Jan 26, 2023

zeroshade commented Jan 26, 2023

minyoung commented Feb 1, 2023

minyoung commented Feb 1, 2023

zeroshade commented Feb 1, 2023

[Go] How to handle large lists? #33875

[Go] How to handle large lists? #33875

Comments

minyoung commented Jan 26, 2023

Describe the usage question you have. Please include as many useful details as possible.

Component(s)

zeroshade commented Jan 26, 2023

minyoung commented Feb 1, 2023

minyoung commented Feb 1, 2023

zeroshade commented Feb 1, 2023