Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COMPRESS-514: SevenZFile header buffers over 2G #98

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

akelday
Copy link

@akelday akelday commented May 11, 2020

This is a simple way to enable reading of SevenZFile headers over 2GiB. It also allows a much smaller memory footprint for even the largest headers. The CRC handling needs work/input, because it's only supported for headers held entirely in memory.

When the complete header is not entirely in memory, that leaves several options - for example read the header in two passes, read/parse the header fully while computing CRC (which sort of defeats the purpose) or simply ignore it.

@coveralls
Copy link

coveralls commented May 11, 2020

Coverage Status

Coverage decreased (-0.08%) to 87.162% when pulling 54be6c4 on akelday:COMPRESS-514-SevenZFile into a5ccbd6 on apache:master.

@PeterAlfredLee
Copy link
Member

What's preventing us from implementing the CRC part in HeaderChannelBuffer? Can we update the CRC while reading from the channel?
You can refer CRC32VerifyingInputStream which has a similar implemention.

@akelday
Copy link
Author

akelday commented May 12, 2020

What's preventing us from implementing the CRC part in HeaderChannelBuffer? Can we update the CRC while reading from the channel?
You can refer CRC32VerifyingInputStream which has a similar implemention.

I'll assume that is the preferred option and will have a look later then.
Two downsides I can think of:

  • It assumes the header will eventually be fully read. Is that always the case?
  • We're acting on the data as the CRC is computed. That means if the CRC was wrong, your code probably already broke by the time you get to the CRC. Perhaps marginally better than completely ignoring though.

@PeterAlfredLee
Copy link
Member

PeterAlfredLee commented May 12, 2020

It assumes the header will eventually be fully read. Is that always the case?

I believe so. The 7z header is fully parsed in the constructor of SevenZFile, which means the header is always fully read. Currently we do not provide any method that only read some part of 7z headers.

We're acting on the data as the CRC is computed.

What do you mean by acting?
Currently we only read from the ByteBuffer, not changing any of them.

From my view, a CRC check is a good way (and maybe the only way) to verify the corrupted 7z headers - especially for large 7z archives like yours.

@akelday
Copy link
Author

akelday commented May 12, 2020

What do you mean by acting?

Just that branching is decided by the content of the data (e.g. checking header ID bytes then deciding what to do). That implies something quite unexpected could occur long before you reach the CRC value.

I suppose I'm saying from a code safety point of view, you need to confirm CRC before you use the data.

@PeterAlfredLee
Copy link
Member

Just that branching is decided by the content of the data (e.g. checking header ID bytes then deciding what to do).

So you are talking about whether or not read the encoded header by the nid in header. Am I right?

That implies something quite unexpected could occur long before you reach the CRC value.

A little confused here. From my understanding, you are talking about when reading encodedHeader, the buf is changed before the original HeaderChannelBuffer variable buf is fully read - which may not check the original CRC of the whole header, as the HeaderChannelBuffer is not fully read out. Am I right here?

@akelday
Copy link
Author

akelday commented May 12, 2020

Run out of time for now so I'll be brief (will be back in some hours)!

So you are talking about whether or not read the encoded header by the nid in header

Yes, that's one possible place a decision is made based on the data of the header. There may be more; I haven't checked all possibilities.

you are talking about when reading encodedHeader

Actually it applies to a "normal" header too, because that's also swapped for a HeaderBuffer.

the buf is changed before the original HeaderChannelBuffer variable buf is fully read

I think there are more places than that (again, needs checking).

@PeterAlfredLee
Copy link
Member

PeterAlfredLee commented May 12, 2020

I see. You are talking about that in some cases the HeaderChannelBuffer may be changed before it's fully read, then the CRC check sum is not able to be computed anyway.

@akelday
Copy link
Author

akelday commented May 12, 2020

in some cases the HeaderChannelBuffer may be changed before it's fully read

Not really, no. After digging a little deeper into 7z format it seems that particular CRC only applies to the nextHeader which is most often (maybe always?) the header at the very end of the archive a.k.a the "End Header". That's only a 42 byte header in the 1.2TB archive I had trouble with.

If I understand correctly, we could retain existing capability (a CRC check for the end header) by simply enforcing a HeaderInMemoryBuffer for that part and still keep the ability to handle a much bigger Compressed Metadata Block.

I'm going to come back to this again later so I can think about it (and get some sleep). Comments welcome in the mean time of course!

Edit: That's done. CRC check is now exactly as before, but we still keep large header support.

@PeterAlfredLee
Copy link
Member

Hey @akelday
Could you please rebase the commits to squash some commits? Some commits are modifying the same code and should be squashed.

@akelday akelday force-pushed the COMPRESS-514-SevenZFile branch from 95f8e3d to 54be6c4 Compare May 17, 2020 21:51
@akelday
Copy link
Author

akelday commented May 17, 2020

Hi @PeterAlfredLee - squashed to a single commit from latest master, I hope that's OK. Not sure what's going on with coverage, I'll have to look later because it's late now...

Edit: possibly resource (inputstream) is no longer closed correctly so will need to investigate!

Update: definite problem with that (will put more detail in the jira), so this is not OK to merge.

@PeterAlfredLee
Copy link
Member

PeterAlfredLee commented Sep 10, 2020

Not sure what's going on with coverage, I'll have to look later because it's late now...

Take it easy. The coverage sometimes report a coverage rate that is not accurate enough.

@garydgregory
Copy link
Member

Please rebase on master. Recent changes should allow all builds to be green including Java 16 and 17-EA.

private boolean refilled(final int remainingBytes) throws IOException {
if (remainingBytes <= 0) {
return false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix formatting please.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, do we think this PR is still appropriate? I currently have no files large enough to test the issue it resolved, and haven't found time to create one yet. No problem with closing it, but if it's still useful I'll try to make time for it again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants