Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Add "WithError" versions of NewReader/NewGenericReader #305

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

SgtCoDFish
Copy link

It's currently difficult to use NewReader / NewGenericReader in a situation where an invalid file might be read, since they can panic when trying to read an invalid file.

This PR adds WithError versions of NewReader and NewGenericReader, which return an error if something goes wrong rather than panicing.

This PR also adds a testdata file which causes a New(Generic)Reader to panic, and adds tests ensuring that NewReaderWithError and NewGenericReaderWithError return an error when trying to read that file. It was that file which alerted me to this issue!

I figured it was easier to raise a PR for discussion than to create an issue for this - if this needs any work or if this isn't something you want to add to the library, please let me know!

Alternative Approach

Another approach would be to break the API and change the signatures to e.g. func NewReader(...) (*Reader, error) - given that this library isn't at v1 yet, that might be an option which some libraries would take.

I assumed that breaking existing code wouldn't be desirable here, though - if this library ever publishes a v2, that might be a thing to do then.

That said, if you'd be happy to break the API I'd gladly change this PR to do that instead!

@Pryz
Copy link
Contributor

Pryz commented Aug 19, 2022

Hi @SgtCoDFish thanks for your PR! This is certainly a change we can accept. I agree that it's better to not break the API just yet.

Something I'm not sure tho is the name of the function. Usually (in the stdlib for example) functions with WithError will expect the user to pass an error in parameter. Example: #313.

Wondering if we should use something else like NewReaderOrError. What do you think ? Have you seen WithError used in other package already ?

@Pryz Pryz self-requested a review August 19, 2022 00:36
New(Generic)Reader calls can fail at runtime if they encounter an
invalid parquet file, which means that this library is difficult to use
in a situation where an arbitrary or potentially untrusted parquet file
might need to be read.

This adds OrError versions of these instantiation functions, which
enables easier handling of errors without having to recover from a panic

Also add tests for creating a reader with a broken file, which illustrates
the need for New(Generic)ReaderOrError
@SgtCoDFish
Copy link
Author

Wondering if we should use something else like NewReaderOrError. What do you think ? Have you seen WithError used in other package already ?

Ah I much prefer OrError, yeah!

I thought about the name a bit and ended up just going with my first idea; I'm definitely happy to change it. I've made that change 😁

@achille-roussel
Copy link
Contributor

Something I had documented (but maybe not highlighted enough) is the pattern which is currently supported to handle configuration errors; here is an example https://pkg.go.dev/github.com/segmentio/parquet-go#NewBuffer

I wonder if this would be an acceptable approach, or whether we feel like the new APIs are worth introducing still?

@achille-roussel achille-roussel self-assigned this Aug 23, 2022
@achille-roussel achille-roussel self-requested a review August 23, 2022 21:08
@SgtCoDFish
Copy link
Author

I wonder if this would be an acceptable approach, or whether we feel like the new APIs are worth introducing still?

I think if configuration errors were the only way that creating a (Generic)Reader could fail that would be an acceptable tradeoff - the issue I encountered though was that creating the reader will panic on an invalid parquet file, which I think is demonstrated by the test that I added (not_enough_columns.invalid.parquet). I don't think the configuration approach you've shown there would solve that issue?

@achille-roussel
Copy link
Contributor

achille-roussel commented Aug 29, 2022

Thanks for sharing these details.

You could open the file ahead of time to handle errors caused by invalid files, for example:

f, err := parquet.OpenFile(...)
if err != nil {
    ...
}
r := parquet.NewGenericReader[Row](f)

Then you have the guarantee that the reader constructor won't panic.

Let me know if this is an acceptable solution for your use case.

@SgtCoDFish
Copy link
Author

You could open the file ahead of time to handle errors caused by invalid files, for example:
Let me know if this is an acceptable solution for your use case.

This would probably work in our use case in the short term but it's quite verbose and fiddly. Using OpenFile means I need to manually calculate the size of my file, and it's just more boilerplate overall.

I like that NewGenericReader does all of this for me, so it would still be nice to be able to simply use NewGenericReaderOrError instead.

Another issue with panics is that there's no easy way for me to know whether a new failure mode which could cause a panic might be added to NewGenericReader in the future, which could cause my application to start to panic after I upgrade to a newer version of parquet-go. If it instead returns errors, my existing error handling code will handle those gracefully.

@SgtCoDFish
Copy link
Author

@achille-roussel gentle bump - does my reasoning above seem reasonable?

@SgtCoDFish
Copy link
Author

@Pryz @achille-roussel Hi, another gentle bump - does this look like it might be acceptable? Just looking to tie off the open PRs I have but of course I recognise that people are busy 😁

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants