Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Panic in fileRowGroup.init when reading parquet file #382

Open
asubiotto opened this issue Oct 19, 2022 · 2 comments
Open

Panic in fileRowGroup.init when reading parquet file #382

asubiotto opened this issue Oct 19, 2022 · 2 comments
Assignees
Labels
bug Something isn't working duplicate This issue or pull request already exists

Comments

@asubiotto
Copy link
Contributor

To reproduce, download the parquet file with the data that causes this and run:

func TestPanic(t *testing.T) {
	f, err := os.OpenFile("panicdata.parquet", os.O_RDONLY, 0655)
	if err != nil {
		t.Fatal(err)
	}
	defer f.Close()
	i, err := f.Stat()
	if err != nil {
		t.Fatal(err)
	}
	_, err = parquet.OpenFile(f, i.Size())
	if err != nil {
		t.Fatal(err)
	}
}

This causes:

panic: runtime error: index out of range [-425984] [recovered]
	panic: runtime error: index out of range [-425984]

goroutine 19 [running]:
testing.tRunner.func1.2({0x1010df6c0, 0x1401a25e000})
	/opt/homebrew/opt/go/libexec/src/testing/testing.go:1396 +0x1c8
testing.tRunner.func1()
	/opt/homebrew/opt/go/libexec/src/testing/testing.go:1399 +0x378
panic({0x1010df6c0, 0x1401a25e000})
	/opt/homebrew/opt/go/libexec/src/runtime/panic.go:884 +0x204
github.com/segmentio/parquet-go.(*fileRowGroup).init(0x14021d30000, 0x140001e0360, 0x1011306a0?, {0x1401c5d9960, 0xd, 0x101023fe0?}, 0x14002fc0000)
	/Users/asubiotto/Developer/github.com/segmentio/parquet-go/file.go:381 +0x4b0
github.com/segmentio/parquet-go.OpenFile({0x101120280?, 0x1400019a068}, 0xb4f6054, {0x0, 0x0, 0x0})
	/Users/asubiotto/Developer/github.com/segmentio/parquet-go/file.go:100 +0x760
github.com/segmentio/parquet-go_test.TestPanic(0x14000186ea0)
	/Users/asubiotto/Developer/github.com/segmentio/parquet-go/file_test.go:157 +0x10c
@asubiotto asubiotto changed the title Panic when reading parquet file Panic in fileRowGroup.init when reading parquet file Oct 19, 2022
@achille-roussel
Copy link
Contributor

I believe that this issue might be another occurrence of #366

The panic happens here: https://github.com/segmentio/parquet-go/blob/main/file.go#L380-L381, the value of j should only be negative if rowGroup.Ordinal is negative.

This would be caused by generating a parquet file with more than 32K row groups, and should not be possible anymore after #379

Looking at the file metadata, there are indeed more than 32K row groups:

parquet-tools meta panicdata.parquet
...
row group 40790:        RC:91 TS:6107 OFFSET:121651610
--------------------------------------------------------------------------------
duration:                INT64 UNCOMPRESSED DO:121651610 FPO:121652007 SZ:503/503/1.00 VC:91 ENC:PLAIN,RLE_DICTIONARY ST:[no stats for this column]
labels.instance:         BINARY UNCOMPRESSED DO:121652113 FPO:121652150 SZ:75/75/1.00 VC:91 ENC:RLE,PLAIN,RLE_DICTIONARY ST:[no stats for this column]
labels.job:              BINARY UNCOMPRESSED DO:121652188 FPO:121652218 SZ:68/68/1.00 VC:91 ENC:RLE,PLAIN,RLE_DICTIONARY ST:[no stats for this column]
name:                    BINARY UNCOMPRESSED DO:121652256 FPO:121652290 SZ:67/67/1.00 VC:91 ENC:PLAIN,RLE_DICTIONARY ST:[no stats for this column]
period:                  INT64 UNCOMPRESSED DO:121652323 FPO:121652350 SZ:60/60/1.00 VC:91 ENC:PLAIN,RLE_DICTIONARY ST:[no stats for this column]
period_type:             BINARY UNCOMPRESSED DO:121652383 FPO:121652409 SZ:59/59/1.00 VC:91 ENC:PLAIN,RLE_DICTIONARY ST:[no stats for this column]
period_unit:             BINARY UNCOMPRESSED DO:121652442 FPO:121652476 SZ:67/67/1.00 VC:91 ENC:PLAIN,RLE_DICTIONARY ST:[no stats for this column]
pprof_num_labels.bytes:  INT64 UNCOMPRESSED DO:121652509 FPO:121652522 SZ:51/51/1.00 VC:91 ENC:RLE,PLAIN,RLE_DICTIONARY ST:[no stats for this column]
sample_type:             BINARY UNCOMPRESSED DO:121652560 FPO:121652590 SZ:63/63/1.00 VC:91 ENC:PLAIN,RLE_DICTIONARY ST:[no stats for this column]
sample_unit:             BINARY UNCOMPRESSED DO:121652623 FPO:121652651 SZ:61/61/1.00 VC:91 ENC:PLAIN,RLE_DICTIONARY ST:[no stats for this column]
stacktrace:              BINARY ZSTD DO:121652684 FPO:121654026 SZ:1428/4405/3.08 VC:91 ENC:PLAIN,RLE_DICTIONARY ST:[no stats for this column]
timestamp:               INT64 ZSTD DO:0 FPO:121654112 SZ:531/544/1.02 VC:91 ENC:DELTA_BINARY_PACKED ST:[no stats for this column]
value:                   INT64 ZSTD DO:0 FPO:121654643 SZ:93/84/0.90 VC:91 ENC:DELTA_BINARY_PACKED ST:[no stats for this column]

@achille-roussel achille-roussel self-assigned this Oct 31, 2022
@achille-roussel achille-roussel added bug Something isn't working duplicate This issue or pull request already exists labels Oct 31, 2022
@achille-roussel
Copy link
Contributor

I believe that we addressed the root issue here by preventing parquet-go from generating files with more row groups than technically supported by the parquet format. That being said, it's possible that we could try to handle this better, either being flexible in accepting files which contain more row groups than the maximum ordinal value, or returning an error instead of panicking.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

2 participants