Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GzipReader: gets sometimes returns incomplete multi-byte characters #92

Open
GaspPredict opened this issue Jan 3, 2025 · 0 comments
Open

Comments

@GaspPredict
Copy link

GaspPredict commented Jan 3, 2025

Calling gets with limit in certain cases results in incomplete multi-byte characters as the returned string ends in the middle of a character.

To reproduce

Example file: example.txt.gz (uncompressed version example.txt)

require 'zlib'
# File example.txt.gz is attached to the issue
test_file = Zlib::GzipReader.open("example.txt.gz")
test_file.gets(3125)
test_file.gets(3125)
test_file.gets(3125)
test_file.gets(2513)
test_file.gets(2)
test_file.gets(100)
test_file.gets(8017)
test_file.gets(2)
test_file.gets(4579)
test_file.gets(3127)
test_file.gets(3127)
test_file.gets(3126)
test_file.gets(1)
result = test_file.gets(14020)
result.valid_encoding?
# => false
# Printing out the result shows that it contains two bytes at the end without the final byte to complete the character
# ...南部から、つづ\xE3\x81"

# Reading next chunk with gets and combining strings results in a string that is once again complete
next_chunk = test_file.gets(100)
(result + next_chunk).valid_encoding?
# => true

This issue can occur when reading compressed CSV files since CSV reader reads a chunk of IO and then checks whether returned string has valid encoding: https://github.com/ruby/csv/blob/f33c613ca94013db0667c4311311c98b6c20d5ea/lib/csv/parser.rb#L310-L312

Possible cause

This issue seems to occur when gets is called with limit, which is 1 byte smaller than the contents of currently loaded buffer.

In function gzreader_charboundary rb_enc_precise_mbclen is used to identify number of missing bytes at the end, but then gzfile_fill is called with n + MBCLEN_NEEDMORE_LEN(n_bytes) where n can be 1 byte smaller than current loaded buffer size, and as a result no additional data is read.

https://github.com/ruby/zlib/blob/master/ext/zlib/zlib.c#L4309

static long
gzreader_charboundary(struct gzfile *gz, long n)
{
    // EX: In case of the example above:
    // EX: n = 14020
    // EX: ZSTREAM_BUF_FILLED(&gz->z) = 14021
    // EX: last four bytes of s:
    // EX: ... |  0xA5 |  0xE3 |  0x81 |  0x00 | (byte)
    // EX: ... | 14019 | 14020 | 14021 | 14022 | (index)
    char *s = RSTRING_PTR(gz->z.buf);
    char *e = s + ZSTREAM_BUF_FILLED(&gz->z);
    // EX: p correctly points to 0xE3 (byte 14020 of s)
    char *p = rb_enc_left_char_head(s, s + n - 1, e, gz->enc);
    long l = p - s;
    if (l < n) {
        // EX: rb_enc_precise_mbclen correctly determines that 1 more byte is needed (after 0xE3 and 0x81)
	int n_bytes = rb_enc_precise_mbclen(p, e, gz->enc);
	if (MBCLEN_NEEDMORE_P(n_bytes)) {
            // EX: n + MBCLEN_NEEDMORE_LEN(n_bytes) = 14020 + 1 which is NOT correct:
            // EX: 14021 is already size of the buffer, the value should be 14022
	    if ((l = gzfile_fill(gz, n + MBCLEN_NEEDMORE_LEN(n_bytes), Qnil)) > 0) {
		return l;
	    }
	}
	else if (MBCLEN_CHARFOUND_P(n_bytes)) {
	    return l + MBCLEN_CHARFOUND_LEN(n_bytes);
	}
    }
    return n;
}

Possible solution

Since rb_enc_precise_mbclen determines how many bytes are needed based on the entire buffer (until e pointer), gzfile_fill should also request current buffer size + number of missing bytes to be read:

diff --git a/ext/zlib/zlib.c b/ext/zlib/zlib.c
index 0b9c4d6..5d7bd1b 100644
--- a/ext/zlib/zlib.c
+++ b/ext/zlib/zlib.c
@@ -4316,7 +4316,7 @@ gzreader_charboundary(struct gzfile *gz, long n)
     if (l < n) {
        int n_bytes = rb_enc_precise_mbclen(p, e, gz->enc);
        if (MBCLEN_NEEDMORE_P(n_bytes)) {
-           if ((l = gzfile_fill(gz, n + MBCLEN_NEEDMORE_LEN(n_bytes), Qnil)) > 0) {
+           if ((l = gzfile_fill(gz, ZSTREAM_BUF_FILLED(&gz->z) + MBCLEN_NEEDMORE_LEN(n_bytes), Qnil)) > 0) {
                return l;
            }
        }

As far as I tested this indeed seems to fix the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant