Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Interpret data descriptors when reading zip file from (read, nonseek) stream #197

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

0xCCF4
Copy link

@0xCCF4 0xCCF4 commented Jun 21, 2024

When reading a zipfile from within a zipfile, the inside zipfile stream is not seekable. Given the case that we are dealing with large files, it might be unfeasible to buffer the whole nested file to RAM. It may be better to stream the zip file instead. Relying on the local file header and data descriptors to determine the entries of the zip file. Completely ignoring the central directory at the file end.

Current project state:

  • There exists a method for streaming a non-seekable zip file stream.
  • This method does not handle data descriptors
  • Zip files with a data descriptor can not be processed by this function

My contribution:

  • Changed this method to allow parsing a zip file with data descriptors
  • Added tests for this feature

Current state:

  • Prepare the library for reading without requiring a Take object
  • Change the CRC checker to allow setting the CRC to check in the end
  • Security considerations regarding this zip reading approach
  • Add a custom reader the looks ahead for a data descriptor and then terminates the reading of a zip file content

Security Considerations:
Reading a zip archive that way without knowing the file size in advance may result in parsing inconsistency. Since the file content of a single file may be attacker controlled, we must assume that an attacker may craft a file that contains a sequence looking like a data descriptor. After that an attacker might put the header of a new zip file entry.
Parsing the file in streaming fashion will return two files. Parsing with the central directory information will results in just one file read. This should be regarded as security risk. Still, allowing to read zip files with data descriptors in a streamed fashion is useful for some applications. I therefore introduced the types UntrustedValue and MaybeUnstrusted. See https://github.com/0xCCF4/zip2/blob/f6b5da958028c5cb90d6de7b5b56a378de52fc95/src/result.rs#L110
If a library user would like to use this streaming functionality this security risk must be explicitly accepted.

Related issues:
#162

…reader: &mut S) that reads the next zipfile entry from a stream by potential parsing the data descriptor

This is an alternative method to read a zip file. If possible, use the ZipArchive functions as some information will be missing when reading this manner.

This method extends the existing read_zipfile_from_stream method when the stream is seekable.

This method is superior to ZipArchive in the special case that the underlying stream implementation must buffer all seen/read data to provide seek back support and the memory consumption must be kept small. This could be the case when reading a ZipFile B.zip nested within a zip file A.zip, that is stored on disk. Since A is seekable when stored as file, A.zip can be read using ZipArchive. Be B a zip file stored in A, then we can read B's content using a decompressing reader. The problem is that the decompressing reader will not be seekable (due to decompression). When one would like to still read contents of B.zip without extracting A to disk, the file B.zip must be buffered in RAM. When using ZipArchive to read B.zip from RAM, the whole B.zip file must be buffered to RAM because the central directory of B.zip is located at the end of the file B.zip. This method will read B.zip from the start of the file and returning the first file entry found in B.zip. After the execution of this function and the ZipFile return value is dropped, the reader will be positioned at the end of the file entry. Since this function will never seek back to before the initial position of the stream when the function was called, the underlying stream implementation may discard, after dropping ZipFile, all buffered data before the current position of the stream. Summarizing: In given scenario, this method must not buffer the whole B.zip file to RAM, but only the first file entry.
@0xCCF4 0xCCF4 marked this pull request as draft June 21, 2024 07:42
@0xCCF4 0xCCF4 changed the title draft: Add means to read a zip file part by part draft: feat: Interpret data descriptors when reading zip file from (read, nonseek) stream Jun 24, 2024
Before: The CRC32 checksum had to be supplied before the stream is read. Though checked when EOF occurred.
Now: The CRC32 checksum can be supplied before starting to read or after finishing reading from the stream.
@0xCCF4 0xCCF4 marked this pull request as ready for review June 24, 2024 12:19
@0xCCF4 0xCCF4 changed the title draft: feat: Interpret data descriptors when reading zip file from (read, nonseek) stream feat: Interpret data descriptors when reading zip file from (read, nonseek) stream Jun 24, 2024
@0xCCF4
Copy link
Author

0xCCF4 commented Jun 24, 2024

The changes kind of exploded 🙈 😅 - but now it finally works. Looking forward to a review.
Best regards

Copy link
Member

@Pr0methean Pr0methean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decompressing readers only exist when the feature flag for their compression method is enabled. I may have more comments later.

src/crc32.rs Show resolved Hide resolved
src/crc32.rs Show resolved Hide resolved
src/crc32.rs Show resolved Hide resolved
src/crc32.rs Show resolved Hide resolved
src/crc32.rs Show resolved Hide resolved
src/crc32.rs Show resolved Hide resolved
src/read.rs Outdated Show resolved Hide resolved
Copy link
Member

@Pr0methean Pr0methean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems mostly functionally correct, but I have some performance concerns.

src/read.rs Outdated Show resolved Hide resolved
src/read.rs Outdated Show resolved Hide resolved
}

let limit_reader = (reader as &'a mut dyn Read).take(result.compressed_size);
fn read_a_byte(&mut self) -> io::Result<Option<u8>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading one byte at a time like this won't be efficient. Instead, read a block into the look-ahead buffer and use memchr::memmem::find to check for a data descriptor signature. (Be sure to leave size_of::<Magic>() bytes in the buffer, in case the descriptor signature straddles two blocks!)

src/read.rs Outdated
Comment on lines 1771 to 1785
if spec::Magic::from_first_le_bytes(&self.look_ahead_buffer)
== spec::Magic::DATA_DESCRIPTOR_SIGNATURE
{
// potentially found a data descriptor
// check if size matches
let data_descriptor = match ZipDataDescriptor::interpret(&self.look_ahead_buffer) {
Ok(data_descriptor) => data_descriptor,
Err(_) => return None,
};
let data_descriptor_size = data_descriptor.compressed_size;

if data_descriptor_size == self.number_read_total_actual as u32 {
return Some(data_descriptor);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interpret already checks whether the magic matches.

Suggested change
if spec::Magic::from_first_le_bytes(&self.look_ahead_buffer)
== spec::Magic::DATA_DESCRIPTOR_SIGNATURE
{
// potentially found a data descriptor
// check if size matches
let data_descriptor = match ZipDataDescriptor::interpret(&self.look_ahead_buffer) {
Ok(data_descriptor) => data_descriptor,
Err(_) => return None,
};
let data_descriptor_size = data_descriptor.compressed_size;
if data_descriptor_size == self.number_read_total_actual as u32 {
return Some(data_descriptor);
}
}
let Ok(data_descriptor) = ZipDataDescriptor::interpret(&self.look_ahead_buffer) else {
return None;
};
let data_descriptor_size = data_descriptor.compressed_size;
if data_descriptor_size == self.number_read_total_actual as u32 {
Some(data_descriptor)
} else {
None
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the CRC32 here as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably needs some architectural changes. The function above operates on the compressed zip file. The CRC is calculated from the uncompressed data. With the current architecture checking CRC in this function is not possible.

Currently I have no good idea to solve this. Still, the CRC is checked, so if an "data descriptor" occurs in the stream - Which is only the case if the file size in the data descriptor is correct - the CRC is likely wrong. In this case only the first half/or part of the file is returned to the user, but then an error is thrown when the CRC check fails; reaching the stream EOF.

src/read.rs Outdated Show resolved Hide resolved
@Pr0methean Pr0methean added Please address review comments Some review comments are still open. and removed review in progress labels Jul 6, 2024
@0xCCF4
Copy link
Author

0xCCF4 commented Jul 8, 2024

Review todos:

  • Simple review suggestions
  • Convert Vec to VecDequeue
  • Optimize read_a_byte to read more than a byte at once 😄
  • Check CRC once a potential data descriptor is found

let data_descriptor_size = data_descriptor.compressed_size;

if data_descriptor_size == self.number_read_total_actual as u32 {
// TODO: check CRC32 here as well

Check notice

Code scanning / devskim

A "TODO" or similar was left in source code, possibly indicating incomplete functionality Note

Suspicious comment
src/read.rs Fixed Show resolved Hide resolved
@Pr0methean Pr0methean added the Please fix rebase conflicts Please rebase this PR against master and fix the conflicts. label Jul 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Please address review comments Some review comments are still open. Please fix rebase conflicts Please rebase this PR against master and fix the conflicts.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants