GzipReader: gets sometimes returns incomplete multi-byte characters #92

GaspPredict · 2025-01-03T13:46:20Z

Calling gets with limit in certain cases results in incomplete multi-byte characters as the returned string ends in the middle of a character.

To reproduce

Example file: example.txt.gz (uncompressed version example.txt)

require 'zlib'
# File example.txt.gz is attached to the issue
test_file = Zlib::GzipReader.open("example.txt.gz")
test_file.gets(3125)
test_file.gets(3125)
test_file.gets(3125)
test_file.gets(2513)
test_file.gets(2)
test_file.gets(100)
test_file.gets(8017)
test_file.gets(2)
test_file.gets(4579)
test_file.gets(3127)
test_file.gets(3127)
test_file.gets(3126)
test_file.gets(1)
result = test_file.gets(14020)
result.valid_encoding?
# => false
# Printing out the result shows that it contains two bytes at the end without the final byte to complete the character
# ...南部から、つづ\xE3\x81"

# Reading next chunk with gets and combining strings results in a string that is once again complete
next_chunk = test_file.gets(100)
(result + next_chunk).valid_encoding?
# => true

This issue can occur when reading compressed CSV files since CSV reader reads a chunk of IO and then checks whether returned string has valid encoding: https://github.com/ruby/csv/blob/f33c613ca94013db0667c4311311c98b6c20d5ea/lib/csv/parser.rb#L310-L312

Possible cause

This issue seems to occur when gets is called with limit, which is 1 byte smaller than the contents of currently loaded buffer.

In function gzreader_charboundary rb_enc_precise_mbclen is used to identify number of missing bytes at the end, but then gzfile_fill is called with n + MBCLEN_NEEDMORE_LEN(n_bytes) where n can be 1 byte smaller than current loaded buffer size, and as a result no additional data is read.

https://github.com/ruby/zlib/blob/master/ext/zlib/zlib.c#L4309

static long
gzreader_charboundary(struct gzfile *gz, long n)
{
    // EX: In case of the example above:
    // EX: n = 14020
    // EX: ZSTREAM_BUF_FILLED(&gz->z) = 14021
    // EX: last four bytes of s:
    // EX: ... |  0xA5 |  0xE3 |  0x81 |  0x00 | (byte)
    // EX: ... | 14019 | 14020 | 14021 | 14022 | (index)
    char *s = RSTRING_PTR(gz->z.buf);
    char *e = s + ZSTREAM_BUF_FILLED(&gz->z);
    // EX: p correctly points to 0xE3 (byte 14020 of s)
    char *p = rb_enc_left_char_head(s, s + n - 1, e, gz->enc);
    long l = p - s;
    if (l < n) {
        // EX: rb_enc_precise_mbclen correctly determines that 1 more byte is needed (after 0xE3 and 0x81)
	int n_bytes = rb_enc_precise_mbclen(p, e, gz->enc);
	if (MBCLEN_NEEDMORE_P(n_bytes)) {
            // EX: n + MBCLEN_NEEDMORE_LEN(n_bytes) = 14020 + 1 which is NOT correct:
            // EX: 14021 is already size of the buffer, the value should be 14022
	    if ((l = gzfile_fill(gz, n + MBCLEN_NEEDMORE_LEN(n_bytes), Qnil)) > 0) {
		return l;
	    }
	}
	else if (MBCLEN_CHARFOUND_P(n_bytes)) {
	    return l + MBCLEN_CHARFOUND_LEN(n_bytes);
	}
    }
    return n;
}

Possible solution

Since rb_enc_precise_mbclen determines how many bytes are needed based on the entire buffer (until e pointer), gzfile_fill should also request current buffer size + number of missing bytes to be read:

diff --git a/ext/zlib/zlib.c b/ext/zlib/zlib.c
index 0b9c4d6..5d7bd1b 100644
--- a/ext/zlib/zlib.c
+++ b/ext/zlib/zlib.c
@@ -4316,7 +4316,7 @@ gzreader_charboundary(struct gzfile *gz, long n)
     if (l < n) {
        int n_bytes = rb_enc_precise_mbclen(p, e, gz->enc);
        if (MBCLEN_NEEDMORE_P(n_bytes)) {
-           if ((l = gzfile_fill(gz, n + MBCLEN_NEEDMORE_LEN(n_bytes), Qnil)) > 0) {
+           if ((l = gzfile_fill(gz, ZSTREAM_BUF_FILLED(&gz->z) + MBCLEN_NEEDMORE_LEN(n_bytes), Qnil)) > 0) {
                return l;
            }
        }

As far as I tested this indeed seems to fix the issue

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GzipReader: gets sometimes returns incomplete multi-byte characters #92

GzipReader: gets sometimes returns incomplete multi-byte characters #92

GaspPredict commented Jan 3, 2025 •

edited

Loading

GzipReader: gets sometimes returns incomplete multi-byte characters #92

GzipReader: gets sometimes returns incomplete multi-byte characters #92

Comments

GaspPredict commented Jan 3, 2025 • edited Loading

To reproduce

Possible cause

Possible solution

GaspPredict commented Jan 3, 2025 •

edited

Loading