-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Dummy PR to check maint-16.1.0 status #41437
Conversation
…cal extension types (#41053) ### Rationale for this change Two specifications and one implementation of canonical extension types were added and this should be documented. ### What changes are included in this PR? This represents current state of canonical extension types. ### Are these changes tested? No, docs only. ### Are there any user-facing changes? In so much as they read docs. * GitHub Issue: #40799 Authored-by: Rok Mihevc <[email protected]> Signed-off-by: Rok Mihevc <[email protected]>
…41048) ### Rationale for this change The motivation here is to address #41047. There is severe performance drawback in reading a StringArray as value array of a DictionaryArray, because of repeated and unnecessary UTF 8 string decoding. ### What changes are included in this PR? - Added a new function Materialize() to materialize the values to a list. When materialized, GetString() reads from the vector directly. - Added test coverage. ### Are these changes tested? Yes ### Are there any user-facing changes? No. This change maintains backwards compatibility on the API surface. It is up to the client application to decide whether to materialize the array and gain performance. * GitHub Issue: #41047 Authored-by: Keshuang Shen <[email protected]> Signed-off-by: Curt Hagenlocher <[email protected]>
…is now empty (#40682) ### What changes are included in this PR? `_normalize_slice` now relies on `slice.indices` (https://docs.python.org/3/reference/datamodel.html#slice.indices). ### Are these changes tested? Yes. ### Are there any user-facing changes? Fixing wrong data returned in an edge case. * GitHub Issue: #40642 * GitHub Issue: #38768 Lead-authored-by: LucasG0 <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
### Rationale for this change nanoarrow now supports most types and reading IPC streams. As discussed in the issue, there is interest in adding it to the implementation status page for visibility. ### What changes are included in this PR? A column was added to the table with the appropriate values characterizing the C library implementation status. ### Are these changes tested? Not needed (docs!) ### Are there any user-facing changes? No (docs!) * GitHub Issue: #40689 Lead-authored-by: Dewey Dunnington <[email protected]> Co-authored-by: Dewey Dunnington <[email protected]> Signed-off-by: Dewey Dunnington <[email protected]>
…1091) ### Rationale for this change We can't use multiple top-level `env:` in workflow. GH-40949 introduced a top-level `env:` by `macros.github_header()`. It broke workflows that already have top-level `env:`. ### What changes are included in this PR? Omit top-level `env:` key and reuse the top-level `env:` key generated by `macros.github_header()` in workflows. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #41088 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
… for Cython 2 (#41059) ### Rationale for this change `test_make_write_options_error` has been failing on Cython 2 crossbow build because in older versions of Cython the methods were "regular" C extension method had type check automatically built in. In Cython 3 that is not the case, see cython/cython#6127 and so the check for `ParquetFileFormat` was added in #40976. ### What changes are included in this PR? Checking the error raised for both messages, type check and the check for `ParquetFileFormat` added in #40976. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #41043 Authored-by: AlenkaF <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>
… C Device data interface (#41101) ### Rationale for this change It is not explicit what the value of the `ArrowDeviceArray::device_id` should be when a given device type has no notion of a device identifier (e.g., there is always only one). ### What changes are included in this PR? The text was clarified to recommend a value of -1. This was the value already used by Arrow C++. ### Are these changes tested? No tests needed (documentation) ### Are there any user-facing changes? No * GitHub Issue: #40801 Authored-by: Dewey Dunnington <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>
…or - add support for row-major (#40867) ### Rationale for this change The conversion from `RecordBatch` to `Tensor` class now exists but it doesn't support row-major `Tensor` as an output. This PR adds support for an option to construct row-major `Tensor`. ### What changes are included in this PR? This PR adds a `row_major` option in `RecordBatch::ToTensor` so that row-major `Tensor` can be constructed. The default conversion will be row-major. This for example works: ```python >>> import pyarrow as pa >>> import numpy as np >>> arr1 = [1, 2, 3, 4, 5, 6, 7, 8, 9] >>> arr2 = [10, 20, 30, 40, 50, 60, 70, 80, 90] >>> batch = pa.RecordBatch.from_arrays( ... [ ... pa.array(arr1, type=pa.uint16()), ... pa.array(arr2, type=pa.int16()), ... ... ], ["a", "b"] ... ) # Row-major >>> batch.to_tensor() <pyarrow.Tensor> type: int32 shape: (9, 2) strides: (8, 4) >>> batch.to_tensor().to_numpy().flags C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False # Column-major >>> batch.to_tensor(row_major=False) <pyarrow.Tensor> type: int32 shape: (9, 2) strides: (4, 36) >>> batch.to_tensor(row_major=False).to_numpy().flags C_CONTIGUOUS : False F_CONTIGUOUS : True OWNDATA : False WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False ``` ### Are these changes tested? Yes, in C++ and Python. ### Are there any user-facing changes? No. * GitHub Issue: #40866 Lead-authored-by: AlenkaF <[email protected]> Co-authored-by: Alenka Frim <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…ker on Windows on archery (#41120) ### Rationale for this change Windows wheels are currently failing due to the change on `ARCHERY_DEBUG=1` by default. This uses `--progress` on `docker build` which is not supported on Windows. ### What changes are included in this PR? Do not use `--progress` on the Windows builds. ### Are these changes tested? Yes on CI via archery. ### Are there any user-facing changes? No * GitHub Issue: #41119 Lead-authored-by: Raúl Cumplido <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…InfoSelector behaviors against Azure for generic filesystem tests (#41068) ### Rationale for this change They are failing: ```text [ FAILED ] TestAzureHierarchicalNSGeneric.DeleteDir [ FAILED ] TestAzureHierarchicalNSGeneric.DeleteDirContents [ FAILED ] TestAzureHierarchicalNSGeneric.GetFileInfoSelector ``` ### What changes are included in this PR? * `DeleteDir()`: Check not a directory case * `DeleteDirContents()`: Check not a directory case * `GetFileInfoSelector()`: * Add not a directory check for input * Add support for returning metadata for directory ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: #41034 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
### Rationale for this change See #40695 ### What changes are included in this PR? This PR does a few things: * Substrait is upgraded to the latest version * Support is added for the parameterized timestamp type (but not literals due to substrait-io/substrait#611). * Support is added for the following arrow-specific types: * fp16 * date_millis * time_seconds * time_millis * time_nanos * large_string * large_binary When adding support for the new timestamp types I also relaxed the restrictions on the time zone column. Substrait puts time zone information in the function and not the type. In other words, to print the "America/New York" value of a column of instants one would do something like `to_char(my_timestamp, "America/New York")` instead of `to_char(cast(my_timestamp, timestamp("nanos", "America/New York")`. However, the current implementation makes it impossible to produce or consume a plan with `to_char(my_timestamp, "America/New York")` because it would reject the type because it has a non-UTC time zone. With this latest change, we treat any non-empty timezone as a timezone_tz type. In addition, I have enabled conversions from "encoded types" to their unencoded representation. E.g. a type of `DICTIONARY<INT32>` will convert to `INT32`. At a logical expression / plan perspective these encodings are irrelevant. If anything, they may belong in a more physical plan representation. Should a need for them arise we can dig into it more later. However, I believe it is better to err on the side of generating "something" rather than failing in these cases. I don't consider this last change critical and can back it out if need be. ### Are these changes tested? Yes, I added new unit tests ### Are there any user-facing changes? Yes, via the Substrait conversion. These changes should be backwards compatible in that they only add functionality in places that previously reported "Not Supported". * GitHub Issue: #40695 Lead-authored-by: Weston Pace <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Weston Pace <[email protected]>
### What changes are included in this PR? Use LLVM 15 on Ubuntu 24.04, as LLVM 14 packages seem not always available. ### Are these changes tested? Yes, on CI. ### Are there any user-facing changes? No. * GitHub Issue: #41147 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change An error is received installing R duckdb: ``` #15 18.13 > remotes::install_github('duckdb/duckdb-r', build = FALSE) #15 18.27 Error: Failed to install 'unknown package' from **GitHub:** #15 18.27 Line starting 'Roxyg ...' is malformed! ``` Some searching seems to suggest that this is because R cannot process UTF-8 characters in DESCRIPTION files if the `LANG` is set to `C`. ### What changes are included in this PR? The `LANG` is set to `C.UTF-8` in the dockerfile for this CI job ### Are these changes tested? The change only affects a test ### Are there any user-facing changes? No * GitHub Issue: #41145 Authored-by: Weston Pace <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>
…41155) ### Rationale for this change Only do the final conversion to float16 on success, to avoid conditional jump on uninitialized values. Note: this is a benign error. But we want our Valgrind CI job to be as successful as possible. ### Are these changes tested? Yes, on CI. ### Are there any user-facing changes? No. * GitHub Issue: #41154 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…er-tests (#41153) We don't want to maintain multiple CI platforms to reduce maintenance cost. Use GitHub Actions for docker-tests. Yes. No. * GitHub Issue: #41127 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…Valgrind (#41163) ### Rationale for this change `GetFileInfo()` with generator reports false positive memory leak in Azure SDK for C++. ### What changes are included in this PR? Don't run `TestGetFileInfoGenerator()` with Valgrind. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #41004 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
### Rationale for this change vcpkg doesn't work with CMake 3.29.1. See also: microsoft/vcpkg#37968 ### What changes are included in this PR? Use CMake 3.29.0 temporary. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #41124 Lead-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Jacob Wujciak-Jens <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…41178) ### Rationale for this change #37821 changed to use `add_test()` usage from old style to new style: 1a1d2c8?diff=unified&w=1#diff-1ce47eec54afaee769086e1a720c5ed65bc347cd8fc60a233de67fd895dda329L763-R764 MSVC generators multi-config generators. With old style, all tests are run without specifying `--build-config` explicitly. With new style, we need to specify `--build-config` explicitly. See also: https://cmake.org/cmake/help/latest/command/add_test.html ### What changes are included in this PR? Specify `--build-config` explicitly. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #41169 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…8.1 (#41181) ### Rationale for this change GObject Introspection is merging into GLib. It's not completed yet. GLib related `.gir` files are moved to GLib from GObject Introspection since GLib/GObject Introspection 2.80.0. But glib 2.80.0 conda package doesn't support this merge yet: conda-forge/glib-feedstock#174 ### What changes are included in this PR? Pin gobject-introspection to 1.78.1 for now. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #41167 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…41177) ### Rationale for this change We already have `ARROW_VALGRIND`. ### What changes are included in this PR? Remove redundant macro. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #41176 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…rrow PyCapsule Protocol (#40818) ### Rationale for this change See #38010 (comment) for more context. Right now for _consuming_ ArrowSchema-compatible objects that implement the PyCapsule interface, we only have the private `_import_from_c_capsule` (on Schema, Field, DataType) and we check for the protocol in the public `pa.schema(..)`. But that means you currently can only consume objects that represent the schema of a batch (struct type), and not schemas of individual arrays. ### What changes are included in this PR? Expand the `pa.field(..)` constructor to accept objects implementing the protocol method. ### Are these changes tested? TODO * GitHub Issue: #38010 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…ead of nightly (#41097) ### Rationale for this change Now NumPy has released a first RC for 2.0, we should update our PyArrow wheels to build with this released version instead of with the nightly numpy packages, to ensure we don't build our release wheels with an unstable numpy version. ### What changes are included in this PR? Increased the version requirement for numpy for the installed packages at build time to `numpy>=2.0.0rc1`, to force installing this RC instead of numpy 1.26 ### Are these changes tested? The wheel tests ensure that those wheels still work with older versions of numpy * GitHub Issue: #39848 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>
… compatibility (#41071) ### Rationale for this change Adapting for changes in numpy 2.0 as decribed at https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword and future changes to pass copy=True (numpy/numpy#26208) ### What changes are included in this PR? Add a `copy=None` to the signatures of our `__array__` methods. This does have impact on the user facing behaviour, though. Questioning that upstream at numpy/numpy#25941 (comment) ### Are these changes tested? Yes ### Are there any user-facing changes? No (compared to usage with numpy<2) * GitHub Issue: #39532 * GitHub Issue: #41098 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…1070) ### Rationale for this change Loading the `null_count` attribute doesn't take into account the possible value of -1, leading to a code path where the validity buffer is accessed, but which is not necessarily present in that case. ### What changes are included in this PR? Use `data->MayHaveNulls()` instead of `data->null_count.load()` ### Are these changes tested? Yes * GitHub Issue: #41016 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change Since the left anti filter implementation is based on the left semi filter, and an assertion error occurs when the left semi filter rows are empty, this problem should be fixed. ### What changes are included in this PR? swiss_join.cc and hash_join_node_test.cc ### Are these changes tested? Yes ### Are there any user-facing changes? No * GitHub Issue: #41121 Lead-authored-by: light-city <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
) ### Rationale for this change Upstream problem conda-forge/glib-feedstock#174 has been fixed. ### What changes are included in this PR? Revert pinning. ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: #41227 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…pandas` (#40897) ### Rationale for this change Avoiding using pandas internals to create Block objects ourselves, using a new API for pandas>=3 * GitHub Issue: #35081 Authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…:string to avoid compiler interpreting char* -> bool (#41202) * GitHub Issue: #41201 Authored-by: David Li <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>
### Rationale for this change We can't throw warnings on cran. ### What changes are included in this PR? Update function to match changes in libarrow added in GH-39864 ### Are these changes tested? CI ### Are there any user-facing changes? No * GitHub Issue: #41398 Authored-by: Jacob Wujciak-Jens <[email protected]> Signed-off-by: Jacob Wujciak-Jens <[email protected]>
…vent ub (#41421) ### Rationale for this change In #40237, I introduced scalar scratch space filling in concrete scalar sub-class constructor, in which there is a static down-casting of `this` to sub-class pointer. Though this is common in CRTP, it happens in base cast constructor. And this is reported in #41407 to be UB by UBSAN's "vptr" sanitizing. I'm not a language lawyer to tell if this is a true/false-positive. So I proposed two approaches: 1. The easy way: add suppression in [1], like we already did for `shared_ptr`. But apparently this won't be feasible if this is a true-positive (need some language lawyer's help to confirm). 2. The hard way: totally avoid this so-to-speak UB but may introduce more boilerplate code. This PR is the hard way. [1] https://github.com/apache/arrow/blob/main/r/tools/ubsan.supp ### What changes are included in this PR? Make `FillScratchSpace` static. ### Are these changes tested? The existing UT should cover it well. ### Are there any user-facing changes? None. * GitHub Issue: #41407 Lead-authored-by: Ruoxi Sun <[email protected]> Co-authored-by: Rossi Sun <[email protected]> Co-authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
…ut threading support (#41461) ### Rationale for this change See #41463 and #40237 (comment) ### What changes are included in this PR? Skip test for platforms that have no threading support. ### Are these changes tested? Change is test. ### Are there any user-facing changes? None. * GitHub Issue: #41463 Authored-by: Ruoxi Sun <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
### Rationale for this change install_azurite.sh is failing to install the latest version of Azurite and azure tests were failing. ### What changes are included in this PR? Temporarily pin azurite to v3.29.0 to unblock 16.1.0 release. A follow up issue is tracked here: #41505 ### Are these changes tested? Yes via archery ### Are there any user-facing changes? No * GitHub Issue: #41462 Authored-by: Raúl Cumplido <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>
This comment was marked as outdated.
This comment was marked as outdated.
Revision: 3a54e68 Submitted crossbow builds: ursacomputing/crossbow @ maint-16.1.0-nightly-tests-1 |
Revision: 3a54e68 Submitted crossbow builds: ursacomputing/crossbow @ maint-16.1.0-nightly-packaging-1 |
…macOS arm64 (#41567) ### Rationale for this change Wheels for macOS catalina are failing ### What changes are included in this PR? Use macos-13 instead of (latest) ARM ### Are these changes tested? Yes, via archery ### Are there any user-facing changes? No * GitHub Issue: #41566 Authored-by: Raúl Cumplido <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>
…r changing runner on java-jars
Revision: b28633c Submitted crossbow builds: ursacomputing/crossbow @ maint-16.1.0-java-jars-0
|
### Rationale for this change See #41140. This makes a sliced union array behave as expected without having to manually account for the array offset unless accessing the underlying buffers. ### What changes are included in this PR? Accounts for the offset and length when getting type ids, value offsets and field arrays for sparse and dense union arrays. ### Are these changes tested? Yes, I've updated the union array tests to cover this. ### Are there any user-facing changes? Yes, this is a user facing bug fix. * GitHub Issue: #41140 Authored-by: Adam Reeve <[email protected]> Signed-off-by: Curt Hagenlocher <[email protected]>
…amSplitDecoder (#41565) ### Rationale for this change This problem is raised from #40094 . Original bug fixed here: #34140 , but this is corrupt in #40094 . ### What changes are included in this PR? Refine checking ### Are these changes tested? * [x] Will add ### Are there any user-facing changes? Bugfix * GitHub Issue: #41562 Authored-by: mwish <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Revision: 6cfebb9 Submitted crossbow builds: ursacomputing/crossbow @ maint-16.1.0-nuget-0
|
Revision: 6cfebb9 Submitted crossbow builds: ursacomputing/crossbow @ maint-16.1.0-nightly-verification-1 |
…-opens=java.base/java.nio
Revision: be62f6c Submitted crossbow builds: ursacomputing/crossbow @ maint-16.1.0-java-jars-1
|
…set (#41550) ### Rationale for this change When parquet dataset is reused to create multiple scanners, `FileMetaData` objects are cached to avoid parsing them again. However, these caused issues on encrypted files since internal file decryptors were no longer created by cached `FileMetaData` objects. ### What changes are included in this PR? Expose file_decryptor from FileMetaData and set it properly. ### Are these changes tested? Yes, modify the test to reproduce the issue and assure fixed. ### Are there any user-facing changes? No. * GitHub Issue: #41431 Authored-by: Gang Wu <[email protected]> Signed-off-by: Gang Wu <[email protected]>
…-like types (#41595) This PR includes 2 fixes: 1. support reading `date64` columns (as write is supported) 2. properly validate list-like data types (list of unsupported is unsupported) ### Rationale for this change See #41594 ### What changes are included in this PR? 1. Added `date64` reading & conversion funcs similar to `date32` 2. Refactored date type validation ### Are these changes tested? a55cd53 ### Are there any user-facing changes? No. * GitHub Issue: #41594 Authored-by: candiduslynx <[email protected]> Signed-off-by: Matt Topol <[email protected]>
@ursabot please benchmark |
Benchmark runs are scheduled for commit 7dd1d34. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
Thanks for your patience. Conbench analyzed the 6 benchmarking runs that have been run so far on PR commit 7dd1d34. There were 82 benchmark results indicating a performance regression:
The full Conbench report has more details. |
16.1.0 was released |
DO NOT MERGE.
This PR is to track some crossbow jobs to validate status of maintenance branch before creating the first RC.