You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
require'zlib'# File example.txt.gz is attached to the issuetest_file=Zlib::GzipReader.open("example.txt.gz")test_file.gets(3125)test_file.gets(3125)test_file.gets(3125)test_file.gets(2513)test_file.gets(2)test_file.gets(100)test_file.gets(8017)test_file.gets(2)test_file.gets(4579)test_file.gets(3127)test_file.gets(3127)test_file.gets(3126)test_file.gets(1)result=test_file.gets(14020)result.valid_encoding?# => false# Printing out the result shows that it contains two bytes at the end without the final byte to complete the character# ...南部から、つづ\xE3\x81"# Reading next chunk with gets and combining strings results in a string that is once again completenext_chunk=test_file.gets(100)(result + next_chunk).valid_encoding?# => true
This issue seems to occur when gets is called with limit, which is 1 byte smaller than the contents of currently loaded buffer.
In function gzreader_charboundaryrb_enc_precise_mbclen is used to identify number of missing bytes at the end, but then gzfile_fill is called with n + MBCLEN_NEEDMORE_LEN(n_bytes) where n can be 1 byte smaller than current loaded buffer size, and as a result no additional data is read.
staticlonggzreader_charboundary(structgzfile*gz, longn)
{
// EX: In case of the example above:// EX: n = 14020// EX: ZSTREAM_BUF_FILLED(&gz->z) = 14021// EX: last four bytes of s:// EX: ... | 0xA5 | 0xE3 | 0x81 | 0x00 | (byte)// EX: ... | 14019 | 14020 | 14021 | 14022 | (index)char*s=RSTRING_PTR(gz->z.buf);
char*e=s+ZSTREAM_BUF_FILLED(&gz->z);
// EX: p correctly points to 0xE3 (byte 14020 of s)char*p=rb_enc_left_char_head(s, s+n-1, e, gz->enc);
longl=p-s;
if (l<n) {
// EX: rb_enc_precise_mbclen correctly determines that 1 more byte is needed (after 0xE3 and 0x81)intn_bytes=rb_enc_precise_mbclen(p, e, gz->enc);
if (MBCLEN_NEEDMORE_P(n_bytes)) {
// EX: n + MBCLEN_NEEDMORE_LEN(n_bytes) = 14020 + 1 which is NOT correct:// EX: 14021 is already size of the buffer, the value should be 14022if ((l=gzfile_fill(gz, n+MBCLEN_NEEDMORE_LEN(n_bytes), Qnil)) >0) {
returnl;
}
}
elseif (MBCLEN_CHARFOUND_P(n_bytes)) {
returnl+MBCLEN_CHARFOUND_LEN(n_bytes);
}
}
returnn;
}
Possible solution
Since rb_enc_precise_mbclen determines how many bytes are needed based on the entire buffer (until e pointer), gzfile_fill should also request current buffer size + number of missing bytes to be read:
diff --git a/ext/zlib/zlib.c b/ext/zlib/zlib.c
index 0b9c4d6..5d7bd1b 100644
--- a/ext/zlib/zlib.c+++ b/ext/zlib/zlib.c@@ -4316,7 +4316,7 @@ gzreader_charboundary(struct gzfile *gz, long n)
if (l < n) {
int n_bytes = rb_enc_precise_mbclen(p, e, gz->enc);
if (MBCLEN_NEEDMORE_P(n_bytes)) {
- if ((l = gzfile_fill(gz, n + MBCLEN_NEEDMORE_LEN(n_bytes), Qnil)) > 0) {+ if ((l = gzfile_fill(gz, ZSTREAM_BUF_FILLED(&gz->z) + MBCLEN_NEEDMORE_LEN(n_bytes), Qnil)) > 0) {
return l;
}
}
As far as I tested this indeed seems to fix the issue
The text was updated successfully, but these errors were encountered:
Calling
gets
with limit in certain cases results in incomplete multi-byte characters as the returned string ends in the middle of a character.To reproduce
Example file: example.txt.gz (uncompressed version example.txt)
This issue can occur when reading compressed CSV files since CSV reader reads a chunk of IO and then checks whether returned string has valid encoding: https://github.com/ruby/csv/blob/f33c613ca94013db0667c4311311c98b6c20d5ea/lib/csv/parser.rb#L310-L312
Possible cause
This issue seems to occur when
gets
is called with limit, which is 1 byte smaller than the contents of currently loaded buffer.In function
gzreader_charboundary
rb_enc_precise_mbclen
is used to identify number of missing bytes at the end, but thengzfile_fill
is called withn + MBCLEN_NEEDMORE_LEN(n_bytes)
wheren
can be 1 byte smaller than current loaded buffer size, and as a result no additional data is read.https://github.com/ruby/zlib/blob/master/ext/zlib/zlib.c#L4309
Possible solution
Since
rb_enc_precise_mbclen
determines how many bytes are needed based on the entire buffer (untile
pointer),gzfile_fill
should also request current buffer size + number of missing bytes to be read:As far as I tested this indeed seems to fix the issue
The text was updated successfully, but these errors were encountered: