Skip to content

Commit

Permalink
Merge branch 'master' into add-parquet-derive-to-readme
Browse files Browse the repository at this point in the history
  • Loading branch information
konjac authored May 23, 2024
2 parents 8d87ad2 + 5e9919f commit abf82ae
Show file tree
Hide file tree
Showing 70 changed files with 2,479 additions and 2,592 deletions.
17 changes: 12 additions & 5 deletions .github/workflows/integration.yml
Original file line number Diff line number Diff line change
Expand Up @@ -57,15 +57,17 @@ jobs:
env:
ARROW_USE_CCACHE: OFF
ARROW_CPP_EXE_PATH: /build/cpp/debug
ARROW_NANOARROW_PATH: /build/nanoarrow
ARROW_RUST_EXE_PATH: /build/rust/debug
BUILD_DOCS_CPP: OFF
ARROW_INTEGRATION_CPP: ON
ARROW_INTEGRATION_CSHARP: ON
ARROW_INTEGRATION_GO: ON
ARROW_INTEGRATION_JAVA: ON
ARROW_INTEGRATION_JS: ON
ARCHERY_INTEGRATION_WITH_NANOARROW: "1"
# https://github.com/apache/arrow/pull/38403/files#r1371281630
ARCHERY_INTEGRATION_WITH_RUST: ON
ARCHERY_INTEGRATION_WITH_RUST: "1"
# These are necessary because the github runner overrides $HOME
# https://github.com/actions/runner/issues/863
RUSTUP_HOME: /root/.rustup
Expand Down Expand Up @@ -95,11 +97,16 @@ jobs:
with:
path: rust
fetch-depth: 0
# Workaround https://github.com/rust-lang/jobserver-rs/issues/87
# Can be removed once https://github.com/rust-lang/jobserver-rs/pull/88 is released
- name: Downgrade jobserver
- name: Checkout Arrow nanoarrow
uses: actions/checkout@v4
with:
repository: apache/arrow-nanoarrow
path: nanoarrow
fetch-depth: 0
# Workaround https://github.com/rust-lang/rust/issues/125067
- name: Downgrade rust
working-directory: rust
run: cargo update -p cc --precise 1.0.94 && cargo update -p jobserver --precise 0.1.28
run: rustup override set 1.77
- name: Build
run: conda run --no-capture-output ci/scripts/integration_arrow_build.sh $PWD /build
- name: Run
Expand Down
72 changes: 49 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,32 +17,41 @@
under the License.
-->

# Native Rust implementation of Apache Arrow and Parquet
# Native Rust implementation of Apache Arrow and Apache Parquet

[![Coverage Status](https://codecov.io/gh/apache/arrow-rs/rust/branch/master/graph/badge.svg)](https://codecov.io/gh/apache/arrow-rs?branch=master)

Welcome to the implementation of Arrow, the popular in-memory columnar format, in [Rust][rust].
Welcome to the [Rust][rust] implementation of [Apache Arrow], the popular in-memory columnar format.

This repo contains the following main components:

| Crate | Description | Latest API Docs | README |
| ------------- | --------------------------------------------------------------------------- | ----------------------------------------------- | ------------------------------- |
| arrow | Core functionality (memory layout, arrays, low level computations) | [docs.rs](https://docs.rs/arrow/latest) | [(README)][arrow-readme] |
| arrow-flight | Support for Arrow-Flight IPC protocol | [docs.rs](https://docs.rs/arrow-flight/latest) | [(README)][flight-readme] |
| object-store | Support for object store interactions (aws, azure, gcp, local, in-memory) | [docs.rs](https://docs.rs/object_store/latest) | [(README)][objectstore-readme] |
| parquet | Support for Parquet columnar file format | [docs.rs](https://docs.rs/parquet/latest) | [(README)][parquet-readme] |
| parquet_derive| A crate for deriving RecordWriter/RecordReader for arbitrary, simple structs| [docs.rs](https://docs.rs/parquet-derive/latest)| [(README)][parquet-derive-readme]|
| Crate | Description | Latest API Docs | README |
| ----------------- | --------------------------------------------------------------------------- | ----------------------------------------------- | -------------------------------- |
| [`arrow`] | Core functionality (memory layout, arrays, low level computations) | [docs.rs](https://docs.rs/arrow/latest) | [(README)][arrow-readme] |
| [`arrow-flight`] | Support for Arrow-Flight IPC protocol | [docs.rs](https://docs.rs/arrow-flight/latest) | [(README)][flight-readme] |
| [`object-store`] | Support for object store interactions (aws, azure, gcp, local, in-memory) | [docs.rs](https://docs.rs/object_store/latest) | [(README)][objectstore-readme] |
| [`parquet`] | Support for Parquet columnar file format | [docs.rs](https://docs.rs/parquet/latest) | [(README)][parquet-readme] |
| [`parquet_derive`]| A crate for deriving RecordWriter/RecordReader for arbitrary, simple structs| [docs.rs](https://docs.rs/parquet-derive/latest)| [(README)][parquet-derive-readme]|

The current development version the API documentation in this repo can be found [here](https://arrow.apache.org/rust).

[apache arrow]: https://arrow.apache.org/
[`arrow`]: https://crates.io/crates/arrow
[`parquet`]: https://crates.io/crates/parquet
[`parquet-derive`]: https://crates.io/crates/parquet-derive
[`arrow-flight`]: https://crates.io/crates/arrow-flight
[`object-store`]: https://crates.io/crates/object-store

## Release Versioning and Schedule

### `arrow` and `parquet` crates

The Arrow Rust project releases approximately monthly and follows [Semantic
Versioning](https://semver.org/).
Versioning].

Due to available maintainer and testing bandwidth, `arrow` crates (`arrow`,
`arrow-flight`, etc.) are released on the same schedule with the same versions
as the `parquet` and `parquet-derive` crates.
Due to available maintainer and testing bandwidth, [`arrow`] crates ([`arrow`],
[`arrow-flight`], etc.) are released on the same schedule with the same versions
as the [`parquet`] and [`parquet-derive`] crates.

Starting June 2024, we plan to release new major versions with potentially
breaking API changes at most once a quarter, and release incremental minor versions in
Expand All @@ -58,27 +67,44 @@ For example:
| Sep 2024 | `53.0.0` | Major, potentially breaking API changes |

[this ticket]: https://github.com/apache/arrow-rs/issues/5368
[semantic versioning]: https://semver.org/

### `object_store` crate

The [`object_store`] crate is released independently of the `arrow` and
`parquet` crates and follows [Semantic Versioning]. We aim to release new
versions approximately every 2 months.

[`object_store`]: https://crates.io/crates/object_store

## Related Projects

There are two related crates in different repositories

| Crate | Description | Documentation |
| ---------- | --------------------------------------- | ----------------------------- |
| DataFusion | In-memory query engine with SQL support | [(README)][datafusion-readme] |
| Ballista | Distributed query execution | [(README)][ballista-readme] |
| Crate | Description | Documentation |
| -------------- | --------------------------------------- | ----------------------------- |
| [`datafusion`] | In-memory query engine with SQL support | [(README)][datafusion-readme] |
| [`ballista`] | Distributed query execution | [(README)][ballista-readme] |

[`datafusion`]: https://crates.io/crates/datafusion
[`ballista`]: https://crates.io/crates/ballista

Collectively, these crates support a vast array of functionality for analytic computations in Rust.
Collectively, these crates support a wider array of functionality for analytic computations in Rust.

For example, you can write an SQL query or a `DataFrame` (using the `datafusion` crate), run it against a parquet file (using the `parquet` crate), evaluate it in-memory using Arrow's columnar format (using the `arrow` crate), and send to another process (using the `arrow-flight` crate).
For example, you can write SQL queries or a `DataFrame` (using the
[`datafusion`] crate) to read a parquet file (using the [`parquet`] crate),
evaluate it in-memory using Arrow's columnar format (using the [`arrow`] crate),
and send to another process (using the [`arrow-flight`] crate).

Generally speaking, the `arrow` crate offers functionality for using Arrow arrays, and `datafusion` offers most operations typically found in SQL, including `join`s and window functions.
Generally speaking, the [`arrow`] crate offers functionality for using Arrow
arrays, and [`datafusion`] offers most operations typically found in SQL,
including `join`s and window functions.

You can find more details about each crate in their respective READMEs.

## Arrow Rust Community

The `[email protected]` mailing list serves as the core communication channel for the Arrow community. Instructions for signing up and links to the archives can be found at the [Arrow Community](https://arrow.apache.org/community/) page. All major announcements and communications happen there.
The `[email protected]` mailing list serves as the core communication channel for the Arrow community. Instructions for signing up and links to the archives can be found on the [Arrow Community](https://arrow.apache.org/community/) page. All major announcements and communications happen there.

The Rust Arrow community also uses the official [ASF Slack](https://s.apache.org/slack-invite) for informal discussions and coordination. This is
a great place to meet other contributors and get guidance on where to contribute. Join us in the `#arrow-rust` channel and feel free to ask for an invite via:
Expand All @@ -99,8 +125,8 @@ There is more information in the [contributing] guide.
[contributing]: CONTRIBUTING.md
[parquet-readme]: parquet/README.md
[flight-readme]: arrow-flight/README.md
[datafusion-readme]: https://github.com/apache/arrow-datafusion/blob/main/README.md
[ballista-readme]: https://github.com/apache/arrow-ballista/blob/main/README.md
[datafusion-readme]: https://github.com/apache/datafusion/blob/main/README.md
[ballista-readme]: https://github.com/apache/datafusion-ballista/blob/main/README.md
[objectstore-readme]: object_store/README.md
[parquet-derive-readme]: parquet_derive/README.md
[issues]: https://github.com/apache/arrow-rs/issues
Expand Down
43 changes: 20 additions & 23 deletions arrow-arith/src/numeric.rs
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ use arrow_array::cast::AsArray;
use arrow_array::timezone::Tz;
use arrow_array::types::*;
use arrow_array::*;
use arrow_buffer::ArrowNativeType;
use arrow_buffer::{ArrowNativeType, IntervalDayTime, IntervalMonthDayNano};
use arrow_schema::{ArrowError, DataType, IntervalUnit, TimeUnit};

use crate::arity::{binary, try_binary};
Expand Down Expand Up @@ -343,12 +343,12 @@ trait TimestampOp: ArrowTimestampType {
type Duration: ArrowPrimitiveType<Native = i64>;

fn add_year_month(timestamp: i64, delta: i32, tz: Tz) -> Option<i64>;
fn add_day_time(timestamp: i64, delta: i64, tz: Tz) -> Option<i64>;
fn add_month_day_nano(timestamp: i64, delta: i128, tz: Tz) -> Option<i64>;
fn add_day_time(timestamp: i64, delta: IntervalDayTime, tz: Tz) -> Option<i64>;
fn add_month_day_nano(timestamp: i64, delta: IntervalMonthDayNano, tz: Tz) -> Option<i64>;

fn sub_year_month(timestamp: i64, delta: i32, tz: Tz) -> Option<i64>;
fn sub_day_time(timestamp: i64, delta: i64, tz: Tz) -> Option<i64>;
fn sub_month_day_nano(timestamp: i64, delta: i128, tz: Tz) -> Option<i64>;
fn sub_day_time(timestamp: i64, delta: IntervalDayTime, tz: Tz) -> Option<i64>;
fn sub_month_day_nano(timestamp: i64, delta: IntervalMonthDayNano, tz: Tz) -> Option<i64>;
}

macro_rules! timestamp {
Expand All @@ -360,23 +360,23 @@ macro_rules! timestamp {
Self::add_year_months(left, right, tz)
}

fn add_day_time(left: i64, right: i64, tz: Tz) -> Option<i64> {
fn add_day_time(left: i64, right: IntervalDayTime, tz: Tz) -> Option<i64> {
Self::add_day_time(left, right, tz)
}

fn add_month_day_nano(left: i64, right: i128, tz: Tz) -> Option<i64> {
fn add_month_day_nano(left: i64, right: IntervalMonthDayNano, tz: Tz) -> Option<i64> {
Self::add_month_day_nano(left, right, tz)
}

fn sub_year_month(left: i64, right: i32, tz: Tz) -> Option<i64> {
Self::subtract_year_months(left, right, tz)
}

fn sub_day_time(left: i64, right: i64, tz: Tz) -> Option<i64> {
fn sub_day_time(left: i64, right: IntervalDayTime, tz: Tz) -> Option<i64> {
Self::subtract_day_time(left, right, tz)
}

fn sub_month_day_nano(left: i64, right: i128, tz: Tz) -> Option<i64> {
fn sub_month_day_nano(left: i64, right: IntervalMonthDayNano, tz: Tz) -> Option<i64> {
Self::subtract_month_day_nano(left, right, tz)
}
}
Expand Down Expand Up @@ -506,12 +506,12 @@ fn timestamp_op<T: TimestampOp>(
/// Note: these should be fallible (#4456)
trait DateOp: ArrowTemporalType {
fn add_year_month(timestamp: Self::Native, delta: i32) -> Self::Native;
fn add_day_time(timestamp: Self::Native, delta: i64) -> Self::Native;
fn add_month_day_nano(timestamp: Self::Native, delta: i128) -> Self::Native;
fn add_day_time(timestamp: Self::Native, delta: IntervalDayTime) -> Self::Native;
fn add_month_day_nano(timestamp: Self::Native, delta: IntervalMonthDayNano) -> Self::Native;

fn sub_year_month(timestamp: Self::Native, delta: i32) -> Self::Native;
fn sub_day_time(timestamp: Self::Native, delta: i64) -> Self::Native;
fn sub_month_day_nano(timestamp: Self::Native, delta: i128) -> Self::Native;
fn sub_day_time(timestamp: Self::Native, delta: IntervalDayTime) -> Self::Native;
fn sub_month_day_nano(timestamp: Self::Native, delta: IntervalMonthDayNano) -> Self::Native;
}

macro_rules! date {
Expand All @@ -521,23 +521,23 @@ macro_rules! date {
Self::add_year_months(left, right)
}

fn add_day_time(left: Self::Native, right: i64) -> Self::Native {
fn add_day_time(left: Self::Native, right: IntervalDayTime) -> Self::Native {
Self::add_day_time(left, right)
}

fn add_month_day_nano(left: Self::Native, right: i128) -> Self::Native {
fn add_month_day_nano(left: Self::Native, right: IntervalMonthDayNano) -> Self::Native {
Self::add_month_day_nano(left, right)
}

fn sub_year_month(left: Self::Native, right: i32) -> Self::Native {
Self::subtract_year_months(left, right)
}

fn sub_day_time(left: Self::Native, right: i64) -> Self::Native {
fn sub_day_time(left: Self::Native, right: IntervalDayTime) -> Self::Native {
Self::subtract_day_time(left, right)
}

fn sub_month_day_nano(left: Self::Native, right: i128) -> Self::Native {
fn sub_month_day_nano(left: Self::Native, right: IntervalMonthDayNano) -> Self::Native {
Self::subtract_month_day_nano(left, right)
}
}
Expand Down Expand Up @@ -1346,13 +1346,10 @@ mod tests {
IntervalMonthDayNanoType::make_value(35, -19, 41899000000000000)
])
);
let a = IntervalMonthDayNanoArray::from(vec![i64::MAX as i128]);
let b = IntervalMonthDayNanoArray::from(vec![1]);
let a = IntervalMonthDayNanoArray::from(vec![IntervalMonthDayNano::MAX]);
let b = IntervalMonthDayNanoArray::from(vec![IntervalMonthDayNano::ONE]);
let err = add(&a, &b).unwrap_err().to_string();
assert_eq!(
err,
"Compute error: Overflow happened on: 9223372036854775807 + 1"
);
assert_eq!(err, "Compute error: Overflow happened on: 2147483647 + 1");
}

fn test_duration_impl<T: ArrowPrimitiveType<Native = i64>>() {
Expand Down
14 changes: 12 additions & 2 deletions arrow-array/src/arithmetic.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
// specific language governing permissions and limitations
// under the License.

use arrow_buffer::{i256, ArrowNativeType};
use arrow_buffer::{i256, ArrowNativeType, IntervalDayTime, IntervalMonthDayNano};
use arrow_schema::ArrowError;
use half::f16;
use num::complex::ComplexFloat;
Expand Down Expand Up @@ -139,7 +139,10 @@ pub trait ArrowNativeTypeOp: ArrowNativeType {

macro_rules! native_type_op {
($t:tt) => {
native_type_op!($t, 0, 1, $t::MIN, $t::MAX);
native_type_op!($t, 0, 1);
};
($t:tt, $zero:expr, $one: expr) => {
native_type_op!($t, $zero, $one, $t::MIN, $t::MAX);
};
($t:tt, $zero:expr, $one: expr, $min: expr, $max: expr) => {
impl ArrowNativeTypeOp for $t {
Expand Down Expand Up @@ -284,6 +287,13 @@ native_type_op!(u32);
native_type_op!(u64);
native_type_op!(i256, i256::ZERO, i256::ONE, i256::MIN, i256::MAX);

native_type_op!(IntervalDayTime, IntervalDayTime::ZERO, IntervalDayTime::ONE);
native_type_op!(
IntervalMonthDayNano,
IntervalMonthDayNano::ZERO,
IntervalMonthDayNano::ONE
);

macro_rules! native_type_float_op {
($t:tt, $zero:expr, $one:expr, $min:expr, $max:expr) => {
impl ArrowNativeTypeOp for $t {
Expand Down
13 changes: 11 additions & 2 deletions arrow-array/src/array/dictionary_array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@ use crate::cast::AsArray;
use crate::iterator::ArrayIter;
use crate::types::*;
use crate::{
make_array, Array, ArrayAccessor, ArrayRef, ArrowNativeTypeOp, PrimitiveArray, StringArray,
make_array, Array, ArrayAccessor, ArrayRef, ArrowNativeTypeOp, PrimitiveArray, Scalar,
StringArray,
};
use arrow_buffer::bit_util::set_bit;
use arrow_buffer::buffer::NullBuffer;
Expand Down Expand Up @@ -312,6 +313,14 @@ impl<K: ArrowDictionaryKeyType> DictionaryArray<K> {
})
}

/// Create a new [`Scalar`] from `value`
pub fn new_scalar<T: Array + 'static>(value: Scalar<T>) -> Scalar<Self> {
Scalar::new(Self::new(
PrimitiveArray::new(vec![K::Native::usize_as(0)].into(), None),
Arc::new(value.into_inner()),
))
}

/// Create a new [`DictionaryArray`] without performing validation
///
/// # Safety
Expand Down Expand Up @@ -937,7 +946,7 @@ where
/// return Ok(d.with_values(r));
/// }
/// downcast_primitive_array! {
/// a => Ok(Arc::new(a.iter().map(|x| x.map(|x| x.to_string())).collect::<StringArray>())),
/// a => Ok(Arc::new(a.iter().map(|x| x.map(|x| format!("{x:?}"))).collect::<StringArray>())),
/// d => Err(ArrowError::InvalidArgumentError(format!("{d:?} not supported")))
/// }
/// }
Expand Down
14 changes: 13 additions & 1 deletion arrow-array/src/array/fixed_size_binary_array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

use crate::array::print_long_array;
use crate::iterator::FixedSizeBinaryIter;
use crate::{Array, ArrayAccessor, ArrayRef, FixedSizeListArray};
use crate::{Array, ArrayAccessor, ArrayRef, FixedSizeListArray, Scalar};
use arrow_buffer::buffer::NullBuffer;
use arrow_buffer::{bit_util, ArrowNativeType, BooleanBuffer, Buffer, MutableBuffer};
use arrow_data::{ArrayData, ArrayDataBuilder};
Expand Down Expand Up @@ -68,6 +68,12 @@ impl FixedSizeBinaryArray {
Self::try_new(size, values, nulls).unwrap()
}

/// Create a new [`Scalar`] from `value`
pub fn new_scalar(value: impl AsRef<[u8]>) -> Scalar<Self> {
let v = value.as_ref();
Scalar::new(Self::new(v.len() as _, Buffer::from(v), None))
}

/// Create a new [`FixedSizeBinaryArray`] from the provided parts, returning an error on failure
///
/// # Errors
Expand Down Expand Up @@ -551,6 +557,12 @@ impl From<Vec<&[u8]>> for FixedSizeBinaryArray {
}
}

impl<const N: usize> From<Vec<&[u8; N]>> for FixedSizeBinaryArray {
fn from(v: Vec<&[u8; N]>) -> Self {
Self::try_from_iter(v.into_iter()).unwrap()
}
}

impl std::fmt::Debug for FixedSizeBinaryArray {
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
write!(f, "FixedSizeBinaryArray<{}>\n[\n", self.value_length())?;
Expand Down
3 changes: 2 additions & 1 deletion arrow-array/src/array/fixed_size_list_array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,8 @@ impl FixedSizeListArray {
|| nulls
.as_ref()
.map(|n| n.expand(size as _).contains(&a))
.unwrap_or_default();
.unwrap_or_default()
|| (nulls.is_none() && a.null_count() == 0);

if !nulls_valid {
return Err(ArrowError::InvalidArgumentError(format!(
Expand Down
Loading

0 comments on commit abf82ae

Please sign in to comment.