Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Go] How to handle large lists? #33875

Closed
minyoung opened this issue Jan 26, 2023 · 4 comments · Fixed by #33965
Closed

[Go] How to handle large lists? #33875

minyoung opened this issue Jan 26, 2023 · 4 comments · Fixed by #33965

Comments

@minyoung
Copy link
Contributor

Describe the usage question you have. Please include as many useful details as possible.

I'm encountering a panic when trying to generate a parquet with a very large list of (nested) strings in it:

panic: runtime error: slice bounds out of range [:-2147483636]

goroutine 22 [running]:
github.com/apache/arrow/go/v10/parquet/pqarrow.writeDenseArrow(0x1400059b4a0, {0x105684320, 0x14008189c80}, {0x105686490?, 0x14009402410?}, {0x14657c7c000, 0x4000b00, 0x8000000}, {0x14647c7a000, 0x4000b00, ...}, ...)
        /Users/minyoung/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/encode_arrow.go:442 +0x22b0
github.com/apache/arrow/go/v10/parquet/pqarrow.WriteArrowToColumn({0x10567d4a0, 0x140008667e0}, {0x105684320, 0x14008189c80}, {0x105686490?, 0x14009402410?}, {0x14657c7c000, 0x4000b00, 0x8000000}, {0x14647c7a000, ...}, ...)
        /Users/minyoung/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/encode_arrow.go:227 +0x1e8
github.com/apache/arrow/go/v10/parquet/pqarrow.(*ArrowColumnWriter).Write(0x14008879a08, {0x10567d4a0, 0x140008667e0})
        /Users/minyoung/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/encode_arrow.go:195 +0x420
github.com/apache/arrow/go/v10/parquet/pqarrow.(*FileWriter).WriteColumnChunked(0x14000976000, 0x1400166c840?, 0x14008879ae0?, 0x1?)
        /Users/minyoung/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/file_writer.go:282 +0xb8
github.com/apache/arrow/go/v10/parquet/pqarrow.(*FileWriter).WriteColumnData(0x14000976000?, {0x1056861f0, 0x1400059b4f0})
        /Users/minyoung/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/file_writer.go:291 +0xb0
github.com/apache/arrow/go/v10/parquet/pqarrow.(*FileWriter).Write(0x14000976000, {0x105685228, 0x14000964330})
        /Users/minyoung/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/pqarrow/file_writer.go:206 +0x1f4
main.generateParquet.func1()
        /Users/minyoung/repo/core/services/gorilla/cmd/adhoc/main.go:578 +0x3f8
golang.org/x/sync/errgroup.(*Group).Go.func1()
        /Users/minyoung/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x5c
created by golang.org/x/sync/errgroup.(*Group).Go
        /Users/minyoung/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:72 +0xa4
exit status 2

I can recreate the error with:

func (ps *ParquetIOTestSuite) TestLargeListOfStrings() {
	bldr := array.NewStructBuilder(memory.DefaultAllocator, arrow.StructOf(
		arrow.Field{
			Nullable: true,
			Name:     "l",
			Type: arrow.ListOf(arrow.BinaryTypes.String),
		},
	))
	defer bldr.Release()

	lBldr := bldr.FieldBuilder(0).(*array.ListBuilder)
	sBldr := lBldr.ValueBuilder().(*array.StringBuilder)

	for i := 0; i < 10; i++ {
		bldr.Append(true)
		for j := 0; j < 5; j++ {
			lBldr.Append(true)
			sBldr.Append(strings.Repeat(fmt.Sprintf("%d:%d|", i, j), 250000000))
		}
	}

	arr := bldr.NewArray()
	defer arr.Release()

	field := arrow.Field{Name: "x", Type: arr.DataType(), Nullable: true}
	expected := array.NewTable(arrow.NewSchema([]arrow.Field{field}, nil),
		[]arrow.Column{*arrow.NewColumn(field, arrow.NewChunked(field.Type, []arrow.Array{arr}))}, -1)
	defer expected.Release()

	ps.roundTripTable(expected, false)
}

From skimming through the code and some breakpointing, it seems like the int32 offsets used has overflowed, hence the error with a number that's around -2^31. It seems like maybe I should use LargeListOf or maybe LargeString in the schema, but when attempting to do so I get a not implemented yet error.

How much work would it be to implement LargeListOf/LargeString? And if that were done, would it resolve the panic?

Writing out smaller row groups does work (presumably because the buffers and offsets are reset), but we would like to keep the number of row groups small (we only access a very small number of columns at a time, and our datasets can get very very wide).

Component(s)

Go

@minyoung minyoung added the Type: usage Issue is a user question label Jan 26, 2023
@zeroshade
Copy link
Member

@minyoung You're absolutely correct that the solution here would be to leverage LargeList and/or LargeString. There's probably more work than just modifying the schema generation (there's likely spots that check just for String or List and should also handle LargeList and LargeString by utilizing the ListLike interface and so on). But in theory it shouldn't be too much work to actually support them properly. It'd be a good exercise in test driven development honestly.

I'll add this to my list of stuff to do but if you want to give it a shot to implement, just tag me to review the PR if you file one!

@minyoung
Copy link
Contributor Author

minyoung commented Feb 1, 2023

@zeroshade I had a go at implementing support for LargeString/LargeBinary: #33965

A potential gotcha I stumbled on though is that while we can now write out array.LargeString, on reads we get array.String back. If you're only reading a subset of the dataset at a time (e.g. using the RecordReader), then things are fine. If you try to read the entire dataset (or column) at once though, then we get back to the original error in this issue.

Presumably writing with: pqarrow.NewArrowWriterProperties(pqarrow.WithStoreSchema()) is supposed to fix this but just hasn't been implemented yet?

Aside: I found it humorous that the cpp side of things also errors out

@minyoung
Copy link
Contributor Author

minyoung commented Feb 1, 2023

Updated the PR to now read into array.LargeString/array.LargeBinary if the schema is stored.

If the schema is not stored, then we'll read into array.String/array.Binary. Not sure if we should try do something fancy like looking at the column chunk sizes (sum of?) and automatically use LargeString/LargeBinary if they're >2G?

@zeroshade
Copy link
Member

I think that it's fine for now to have it store the schema and use that properly, if you feel like being fancy and looking to automatically use it based on the column sizes, feel free to add that to the PR but i don't think it's necessary for a first pass here. Thanks for the work!

zeroshade pushed a commit that referenced this issue Feb 2, 2023
### Rationale for this change

Handle writing `array.LargeString` and `array.LargeBinary` data types. This allows parquet files to contain more than 2G worth of binary data in a single column chunk.

### Are these changes tested?

Unit tests included

### Are there any user-facing changes?

* Closes: #33875

Authored-by: Min-Young Wu <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
@zeroshade zeroshade added this to the 12.0.0 milestone Feb 2, 2023
sjperkins pushed a commit to sjperkins/arrow that referenced this issue Feb 10, 2023
…apache#33965)

### Rationale for this change

Handle writing `array.LargeString` and `array.LargeBinary` data types. This allows parquet files to contain more than 2G worth of binary data in a single column chunk.

### Are these changes tested?

Unit tests included

### Are there any user-facing changes?

* Closes: apache#33875

Authored-by: Min-Young Wu <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
gringasalpastor pushed a commit to gringasalpastor/arrow that referenced this issue Feb 17, 2023
…apache#33965)

### Rationale for this change

Handle writing `array.LargeString` and `array.LargeBinary` data types. This allows parquet files to contain more than 2G worth of binary data in a single column chunk.

### Are these changes tested?

Unit tests included

### Are there any user-facing changes?

* Closes: apache#33875

Authored-by: Min-Young Wu <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
fatemehp pushed a commit to fatemehp/arrow that referenced this issue Feb 24, 2023
…apache#33965)

### Rationale for this change

Handle writing `array.LargeString` and `array.LargeBinary` data types. This allows parquet files to contain more than 2G worth of binary data in a single column chunk.

### Are these changes tested?

Unit tests included

### Are there any user-facing changes?

* Closes: apache#33875

Authored-by: Min-Young Wu <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants