Merge remote-tracking branch 'upstream/master' into provide-access-to…

…-inner-parquet-writers
apache · Mar 8, 2024 · 4b5a5c5 · 4b5a5c5
2 parents 7609ed3 + 79634c0
commit 4b5a5c5
Show file tree

Hide file tree

Showing 33 changed files with 292 additions and 142 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -92,19 +92,31 @@ export ARROW_TEST_DATA=$(cd ../testing/data; pwd)
 
 From here on, this is a pure Rust project and `cargo` can be used to run tests, benchmarks, docs and examples as usual.
 
-### Running the tests
+## Running the tests
 
 Run tests using the Rust standard `cargo test` command:
 
 ```bash
-# run all tests.
+# run all unit and integration tests
 cargo test
 
-
-# run only tests for the arrow crate
+# run tests for the arrow crate
 cargo test -p arrow
 ```
 
+For some changes, you may want to run additional tests. You can find up-to-date information on the current CI tests in [.github/workflows](https://github.com/apache/arrow-rs/tree/master/.github/workflows). Here are some examples of additional tests you may want to run:
+
+```bash
+# run tests for the parquet crate
+cargo test -p parquet
+
+# run arrow tests with all features enabled
+cargo test -p arrow --all-features
+
+# run the doc tests
+cargo test --doc
+```
+
 ## Code Formatting
 
 Our CI uses `rustfmt` to check code formatting. Before submitting a
@@ -118,10 +130,19 @@ cargo +stable fmt --all -- --check
 
 We recommend using `clippy` for checking lints during development. While we do not yet enforce `clippy` checks, we recommend not introducing new `clippy` errors or warnings.
 
-Run the following to check for clippy lints.
+Run the following to check for `clippy` lints:
 
 ```bash
+# run clippy with default settings
 cargo clippy
+
+```
+
+More comprehensive `clippy` checks can be run by adding flags:
+
+```bash
+# run clippy on the arrow crate with all features enabled, targeting all tests, examples, and benchmarks
+cargo clippy -p arrow --all-features --all-targets
 ```
 
 If you use Visual Studio Code with the `rust-analyzer` plugin, you can enable `clippy` to run each time you save a file. See https://users.rust-lang.org/t/how-to-use-clippy-in-vs-code-with-rust-analyzer/41881.
@@ -134,6 +155,33 @@ Search for `allow(clippy::` in the codebase to identify lints that are ignored/a
 - If you have several lints on a function or module, you may disable the lint on the function or module.
 - If a lint is pervasive across multiple modules, you may disable it at the crate level.
 
+## Running Benchmarks
+
+Running benchmarks are a good way to test the performance of a change. As benchmarks usually take a long time to run, we recommend running targeted tests instead of the full suite.
+
+```bash
+# run all benchmarks
+cargo bench
+
+# run arrow benchmarks
+cargo bench -p arrow
+
+# run benchmark for the parse_time function within the arrow-cast crate
+cargo bench -p arrow-cast --bench parse_time
+```
+
+To set the baseline for your benchmarks, use the --save-baseline flag:
+
+```bash
+git checkout master
+
+cargo bench --bench parse_time -- --save-baseline master
+
+git checkout feature
+
+cargo bench --bench parse_time -- --baseline master
+```
+
 ## Git Pre-Commit Hook
 
 We can use [git pre-commit hook](https://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks) to automate various kinds of git pre-commit checking/formatting.

diff --git a/arrow-array/src/array/primitive_array.rs b/arrow-array/src/array/primitive_array.rs
@@ -1557,7 +1557,10 @@ mod tests {
                 // roundtrip to and from datetime
                 assert_eq!(
                     1550902545147,
-                    arr.value_as_datetime(i).unwrap().timestamp_millis()
+                    arr.value_as_datetime(i)
+                        .unwrap()
+                        .and_utc()
+                        .timestamp_millis()
                 );
             } else {
                 assert!(arr.is_null(i));

diff --git a/arrow-array/src/record_batch.rs b/arrow-array/src/record_batch.rs
@@ -236,6 +236,11 @@ impl RecordBatch {
         self.schema.clone()
     }
 
+    /// Returns a reference to the [`Schema`] of the record batch.
+    pub fn schema_ref(&self) -> &SchemaRef {
+        &self.schema
+    }
+
     /// Projects the schema onto the specified columns
     pub fn project(&self, indices: &[usize]) -> Result<RecordBatch, ArrowError> {
         let projected_schema = self.schema.project(indices)?;

diff --git a/arrow-array/src/temporal_conversions.rs b/arrow-array/src/temporal_conversions.rs
@@ -43,20 +43,21 @@ pub const EPOCH_DAYS_FROM_CE: i32 = 719_163;
 /// converts a `i32` representing a `date32` to [`NaiveDateTime`]
 #[inline]
 pub fn date32_to_datetime(v: i32) -> Option<NaiveDateTime> {
-    NaiveDateTime::from_timestamp_opt(v as i64 * SECONDS_IN_DAY, 0)
+    Some(DateTime::from_timestamp(v as i64 * SECONDS_IN_DAY, 0)?.naive_utc())
 }
 
 /// converts a `i64` representing a `date64` to [`NaiveDateTime`]
 #[inline]
 pub fn date64_to_datetime(v: i64) -> Option<NaiveDateTime> {
     let (sec, milli_sec) = split_second(v, MILLISECONDS);
 
-    NaiveDateTime::from_timestamp_opt(
+    let datetime = DateTime::from_timestamp(
         // extract seconds from milliseconds
         sec,
         // discard extracted seconds and convert milliseconds to nanoseconds
         milli_sec * MICROSECONDS as u32,
-    )
+    )?;
+    Some(datetime.naive_utc())
 }
 
 /// converts a `i32` representing a `time32(s)` to [`NaiveDateTime`]
@@ -130,45 +131,48 @@ pub fn time_to_time64ns(v: NaiveTime) -> i64 {
 /// converts a `i64` representing a `timestamp(s)` to [`NaiveDateTime`]
 #[inline]
 pub fn timestamp_s_to_datetime(v: i64) -> Option<NaiveDateTime> {
-    NaiveDateTime::from_timestamp_opt(v, 0)
+    Some(DateTime::from_timestamp(v, 0)?.naive_utc())
 }
 
 /// converts a `i64` representing a `timestamp(ms)` to [`NaiveDateTime`]
 #[inline]
 pub fn timestamp_ms_to_datetime(v: i64) -> Option<NaiveDateTime> {
     let (sec, milli_sec) = split_second(v, MILLISECONDS);
 
-    NaiveDateTime::from_timestamp_opt(
+    let datetime = DateTime::from_timestamp(
         // extract seconds from milliseconds
         sec,
         // discard extracted seconds and convert milliseconds to nanoseconds
         milli_sec * MICROSECONDS as u32,
-    )
+    )?;
+    Some(datetime.naive_utc())
 }
 
 /// converts a `i64` representing a `timestamp(us)` to [`NaiveDateTime`]
 #[inline]
 pub fn timestamp_us_to_datetime(v: i64) -> Option<NaiveDateTime> {
     let (sec, micro_sec) = split_second(v, MICROSECONDS);
 
-    NaiveDateTime::from_timestamp_opt(
+    let datetime = DateTime::from_timestamp(
         // extract seconds from microseconds
         sec,
         // discard extracted seconds and convert microseconds to nanoseconds
         micro_sec * MILLISECONDS as u32,
-    )
+    )?;
+    Some(datetime.naive_utc())
 }
 
 /// converts a `i64` representing a `timestamp(ns)` to [`NaiveDateTime`]
 #[inline]
 pub fn timestamp_ns_to_datetime(v: i64) -> Option<NaiveDateTime> {
     let (sec, nano_sec) = split_second(v, NANOSECONDS);
 
-    NaiveDateTime::from_timestamp_opt(
+    let datetime = DateTime::from_timestamp(
         // extract seconds from nanoseconds
         sec, // discard extracted seconds
         nano_sec,
-    )
+    )?;
+    Some(datetime.naive_utc())
 }
 
 #[inline]
@@ -179,13 +183,13 @@ pub(crate) fn split_second(v: i64, base: i64) -> (i64, u32) {
 /// converts a `i64` representing a `duration(s)` to [`Duration`]
 #[inline]
 pub fn duration_s_to_duration(v: i64) -> Duration {
-    Duration::seconds(v)
+    Duration::try_seconds(v).unwrap()
 }
 
 /// converts a `i64` representing a `duration(ms)` to [`Duration`]
 #[inline]
 pub fn duration_ms_to_duration(v: i64) -> Duration {
-    Duration::milliseconds(v)
+    Duration::try_milliseconds(v).unwrap()
 }
 
 /// converts a `i64` representing a `duration(us)` to [`Duration`]
@@ -272,57 +276,57 @@ mod tests {
         date64_to_datetime, split_second, timestamp_ms_to_datetime, timestamp_ns_to_datetime,
         timestamp_us_to_datetime, NANOSECONDS,
     };
-    use chrono::NaiveDateTime;
+    use chrono::DateTime;
 
     #[test]
     fn negative_input_timestamp_ns_to_datetime() {
         assert_eq!(
             timestamp_ns_to_datetime(-1),
-            NaiveDateTime::from_timestamp_opt(-1, 999_999_999)
+            DateTime::from_timestamp(-1, 999_999_999).map(|x| x.naive_utc())
         );
 
         assert_eq!(
             timestamp_ns_to_datetime(-1_000_000_001),
-            NaiveDateTime::from_timestamp_opt(-2, 999_999_999)
+            DateTime::from_timestamp(-2, 999_999_999).map(|x| x.naive_utc())
         );
     }
 
     #[test]
     fn negative_input_timestamp_us_to_datetime() {
         assert_eq!(
             timestamp_us_to_datetime(-1),
-            NaiveDateTime::from_timestamp_opt(-1, 999_999_000)
+            DateTime::from_timestamp(-1, 999_999_000).map(|x| x.naive_utc())
         );
 
         assert_eq!(
             timestamp_us_to_datetime(-1_000_001),
-            NaiveDateTime::from_timestamp_opt(-2, 999_999_000)
+            DateTime::from_timestamp(-2, 999_999_000).map(|x| x.naive_utc())
         );
     }
 
     #[test]
     fn negative_input_timestamp_ms_to_datetime() {
         assert_eq!(
             timestamp_ms_to_datetime(-1),
-            NaiveDateTime::from_timestamp_opt(-1, 999_000_000)
+            DateTime::from_timestamp(-1, 999_000_000).map(|x| x.naive_utc())
         );
 
         assert_eq!(
             timestamp_ms_to_datetime(-1_001),
-            NaiveDateTime::from_timestamp_opt(-2, 999_000_000)
+            DateTime::from_timestamp(-2, 999_000_000).map(|x| x.naive_utc())
         );
     }
 
     #[test]
     fn negative_input_date64_to_datetime() {
         assert_eq!(
             date64_to_datetime(-1),
-            NaiveDateTime::from_timestamp_opt(-1, 999_000_000)
+            DateTime::from_timestamp(-1, 999_000_000).map(|x| x.naive_utc())
         );
 
         assert_eq!(
             date64_to_datetime(-1_001),
-            NaiveDateTime::from_timestamp_opt(-2, 999_000_000)
+            DateTime::from_timestamp(-2, 999_000_000).map(|x| x.naive_utc())
         );
     }