-
Notifications
You must be signed in to change notification settings - Fork 821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert some panics that happen on invalid parquet files to error results #6738
base: main
Are you sure you want to change the base?
Changes from 7 commits
f481dff
a4f8286
a88bc81
3c97bbc
a7db494
78994df
55f3f64
2597c9a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -556,7 +556,8 @@ impl<'a> PrimitiveTypeBuilder<'a> { | |
} | ||
} | ||
PhysicalType::FIXED_LEN_BYTE_ARRAY => { | ||
let max_precision = (2f64.powi(8 * self.length - 1) - 1f64).log10().floor() as i32; | ||
let length = self.length.checked_mul(8).unwrap_or(i32::MAX); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should the overflow error instead of falling through to max precision? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree it would be better to return an error: I double checked that using i32::MAX results in
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sg, changed to returning error, thanks! |
||
let max_precision = (2f64.powi(length - 1) - 1f64).log10().floor() as i32; | ||
|
||
if self.precision > max_precision { | ||
return Err(general_err!( | ||
|
@@ -1171,9 +1172,25 @@ pub fn from_thrift(elements: &[SchemaElement]) -> Result<TypePtr> { | |
)); | ||
} | ||
|
||
if !schema_nodes[0].is_group() { | ||
return Err(general_err!("Expected root node to be a group type")); | ||
} | ||
|
||
Ok(schema_nodes.remove(0)) | ||
} | ||
|
||
/// Checks if the logical type is valid. | ||
fn check_logical_type(logical_type: &Option<LogicalType>) -> Result<()> { | ||
if let Some(LogicalType::Integer { bit_width, .. }) = *logical_type { | ||
if bit_width != 8 && bit_width != 16 && bit_width != 32 && bit_width != 64 { | ||
return Err(general_err!( | ||
"Bit width must be 8, 16, 32, or 64 for Integer logical type" | ||
)); | ||
} | ||
} | ||
Ok(()) | ||
} | ||
|
||
/// Constructs a new Type from the `elements`, starting at index `index`. | ||
/// The first result is the starting index for the next Type after this one. If it is | ||
/// equal to `elements.len()`, then this Type is the last one. | ||
|
@@ -1198,6 +1215,9 @@ fn from_thrift_helper(elements: &[SchemaElement], index: usize) -> Result<(usize | |
.logical_type | ||
.as_ref() | ||
.map(|value| LogicalType::from(value.clone())); | ||
|
||
check_logical_type(&logical_type)?; | ||
|
||
let field_id = elements[index].field_id; | ||
match elements[index].num_children { | ||
// From parquet-format: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -67,7 +67,7 @@ impl<'a> TCompactSliceInputProtocol<'a> { | |
let mut shift = 0; | ||
loop { | ||
let byte = self.read_byte()?; | ||
in_progress |= ((byte & 0x7F) as u64) << shift; | ||
in_progress |= ((byte & 0x7F) as u64).wrapping_shl(shift); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is the purpose of this change? As I understand it It seems to me that this change now avoids panic'ing in debug builds too, which isn't obviously better to me There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks for bringing this up! I'm not sure if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note the C++ code for this loop actually validates [total bytes decoded] (https://github.com/apache/thrift/blob/master/lib/cpp/src/thrift/protocol/TCompactProtocol.tcc#L759) which is probably a good idea (I think this prevents the panic that the original checked shift was meant to do? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks for the pointer. iiuc the c++ one may still have an overflow on the 10th bytes where it shifts 63 bits? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I might be misunderstanding but I thoght shifting by 63 bits is well defined on u64? So overflow yes but I wouldn't expect a panic? Also I think silently passing here if there is an overflow would be data corrupt data? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we can consider this change independently if we are concerned about perf impacts There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. overflow yes, i agree we can add check on # of bytes decoded separately. For this cl let's just solve the panic ? |
||
shift += 7; | ||
if byte & 0x80 == 0 { | ||
return Ok(in_progress); | ||
|
@@ -96,13 +96,22 @@ impl<'a> TCompactSliceInputProtocol<'a> { | |
} | ||
} | ||
|
||
macro_rules! thrift_unimplemented { | ||
() => { | ||
Err(thrift::Error::Protocol(thrift::ProtocolError { | ||
kind: thrift::ProtocolErrorKind::NotImplemented, | ||
message: "not implemented".to_string(), | ||
})) | ||
}; | ||
} | ||
|
||
impl TInputProtocol for TCompactSliceInputProtocol<'_> { | ||
fn read_message_begin(&mut self) -> thrift::Result<TMessageIdentifier> { | ||
unimplemented!() | ||
} | ||
|
||
fn read_message_end(&mut self) -> thrift::Result<()> { | ||
unimplemented!() | ||
thrift_unimplemented!() | ||
} | ||
|
||
fn read_struct_begin(&mut self) -> thrift::Result<Option<TStructIdentifier>> { | ||
|
@@ -147,7 +156,21 @@ impl TInputProtocol for TCompactSliceInputProtocol<'_> { | |
), | ||
_ => { | ||
if field_delta != 0 { | ||
self.last_read_field_id += field_delta as i16; | ||
self.last_read_field_id = self | ||
.last_read_field_id | ||
.checked_add(field_delta as i16) | ||
.map_or_else( | ||
|| { | ||
Err(thrift::Error::Protocol(thrift::ProtocolError { | ||
kind: thrift::ProtocolErrorKind::InvalidData, | ||
message: format!( | ||
"cannot add {} to {}", | ||
field_delta, self.last_read_field_id | ||
), | ||
})) | ||
}, | ||
Ok, | ||
)?; | ||
} else { | ||
self.last_read_field_id = self.read_i16()?; | ||
}; | ||
|
@@ -226,15 +249,15 @@ impl TInputProtocol for TCompactSliceInputProtocol<'_> { | |
} | ||
|
||
fn read_set_begin(&mut self) -> thrift::Result<TSetIdentifier> { | ||
unimplemented!() | ||
thrift_unimplemented!() | ||
} | ||
|
||
fn read_set_end(&mut self) -> thrift::Result<()> { | ||
unimplemented!() | ||
thrift_unimplemented!() | ||
} | ||
|
||
fn read_map_begin(&mut self) -> thrift::Result<TMapIdentifier> { | ||
unimplemented!() | ||
thrift_unimplemented!() | ||
} | ||
|
||
fn read_map_end(&mut self) -> thrift::Result<()> { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize this would currently panic, but would one ever prefer to just set
column_orders
toNone
and continue? The only impact AFAIK would be statistics being unusable, which would only matter if predicates were in use.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point! i agree with the
setting to None
idea. but i guess this worths a separate issue to discuss and fix.