-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41885: [C#] Add initial GObject based Dataset library #41886
base: main
Are you sure you want to change the base?
Conversation
|
I've opened this draft PR to see if anyone has any feedback on the approach so far (@CurtHagenlocher or @kou?). There are still a few issues to sort out before this could be considered ready though. I've taken the approach of using gir.core to generate C# code based on the GIR files that are created in the There are two GObject based projects that get generated, one for the core Arrow library and one for the Dataset library, and I've created a higher level Dataset library that doesn't directly expose the GObject classes. This adds more work compared to just providing the GObject library and adding additional methods to the GObject based classes to add extra functionality like in the Ruby library, but I think it's nicer this way to avoid having a mix of GObject and non-GObject based classes in the same API, eg. one method to get a GObject I'm still on the fence about this approach though, and this is also a bit awkward for types in the core Arrow GLib library that don't have a corresponding type in the main .NET Arrow library, like the FileSystem class. I've created a FileSystem wrapper class in the high level Dataset library but this could cause issues if we later want to use a FileSystem in another GObject based library. Remaining issues:
|
0216c00
to
3dc03c5
Compare
It seems that using gir.core makes sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Building in CI: gir.core needs separate Linux, Windows and MacOS gir files. For local development I just generate a single file but for testing and releasing in CI we'll probably want to generate per-OS files in separate jobs and then combine them. gir.core is designed to work with gir files that are checked in to source control, but I don't think that makes sense for us when the c_glib and C# code live in the same repository.
It makes sense.
We already have CI jobs that use the approach:
- https://github.com/apache/arrow/blob/main/dev/tasks/java-jars/github.yml
- https://github.com/apache/arrow/blob/main/dev/tasks/matlab/github.yml
- Generating .gir files is failing with MSVC. I have the c_glib libraries building after [GLib] Add support for MSVC with vcpkg #41134, but generating the gir and typelib files doesn't work yet and fails with linker errors (eg. see https://github.com/adamreeve/arrow/actions/runs/9261584329/job/25479186007)
Could you try this?
diff --git a/c_glib/arrow-glib/enums.h.template b/c_glib/arrow-glib/enums.h.template
index b7d3c99c0b..e49b717fb3 100644
--- a/c_glib/arrow-glib/enums.h.template
+++ b/c_glib/arrow-glib/enums.h.template
@@ -22,6 +22,8 @@
#include <glib-object.h>
+#include <arrow-glib/version.h>
+
G_BEGIN_DECLS
/*** END file-header ***/
@@ -31,6 +33,7 @@ G_BEGIN_DECLS
/*** END file-production ***/
/*** BEGIN value-header ***/
+GARROW_AVAILABLE_IN_ALL
GType @enum_name@_get_type(void) G_GNUC_CONST;
#define @ENUMPREFIX@_TYPE_@ENUMSHORT@ (@enum_name@_get_type())
/*** END value-header ***/
- The gir.core GObject and GLib libraries are not signed/strong named. I've had to add
<SignAssembly>false</SignAssembly>
to the projects that depend on these for now. I'm not sure whether this should be a blocker, and we could ask for signed versions of the packages to be published or publish our own bindings to these that are signed if needed.
Sorry. I'm not familiar with C# packaging.
- A lot of properties and methods aren't generated with errors like "Did not generate property 'FileSystemDatasetFactory.file-system': The type ?? / Arrow.FileSystem has not been resolved."
I think this is caused by Arrow
-> Apache.Arrow.GLibBindings
namespace substitution by sed
.
Can we hide symbols generated by Gir.Core instead of renaming?
- gir.core doesn't have a way to generate methods with "new", "virtual" or "override" modifiers. There are a lot of methods in the c_glib libraries that exist in both parent and derived classes and I've patched the generated code to add modifiers to fix this. There are also a few "ToString" methods that hide
object.ToString
that I've had to add "override" to. I'm not sure whether this should be supported by gir.core or maybe we just want a more robust way to fix these with post processing.
Could you ask Gir.Core developers about this? (It seems that Gir.Core should support this case...)
- gir.core is used as a git submodule. It would be nice if this was available as a tool that can be installed via NuGet, but this is probably not a blocker.
Could you ask Gir.Core developers to release GirTool
as a NuGet package?
namespaces=(Apache.Arrow.GLibBindings Apache.Arrow.Dataset.GLibBindings) | ||
|
||
os_names=(linux macos windows) | ||
for os_name in ${os_names[@]}; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, we should use "${ARRAY[@]}"
style to avoid unexpected.
for os_name in ${os_names[@]}; do | |
for os_name in "${os_names[@]}"; do |
e.g.:
a=(aaa "bbb ccc" ddd)
echo "quoted"
for x in "${a[@]}"; do
echo "<${x}>"
done
echo "not quoted"
for x in ${a[@]}; do
echo "<${x}>"
done
$ bash /tmp/xxx.sh
quoted
<aaa>
<bbb ccc>
<ddd>
not quoted
<aaa>
<bbb>
<ccc>
<ddd>
os_names=(linux macos windows) | ||
for os_name in ${os_names[@]}; do | ||
mkdir -p "${gir_dir}/${os_name}" | ||
for gir_dep in ${gir_dependencies[@]}; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for gir_dep in ${gir_dependencies[@]}; do | |
for gir_dep in "${gir_dependencies[@]}"; do |
|
||
# TODO: Make this select the appropriate platform and merge all platforms in CI build | ||
os_name="linux" | ||
for index in ${!arrow_gobject_libs[*]}; do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for index in ${!arrow_gobject_libs[*]}; do | |
for index in "${!arrow_gobject_libs[@]}"; do |
cp "${arrow_gir_dir}/${arrow_lib}.gir" "${gir_dir}/${os_name}" | ||
namespace=${namespaces[$index]} | ||
sed -i.bak "s/<namespace name=\"\([^\"]*\)\"/<namespace name=\"${namespace}\"/" "${gir_dir}/${os_name}/${arrow_lib}.gir" | ||
rm -r "${gir_dir}/${os_name}/${arrow_lib}.gir.bak" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cp "${arrow_gir_dir}/${arrow_lib}.gir" "${gir_dir}/${os_name}" | |
namespace=${namespaces[$index]} | |
sed -i.bak "s/<namespace name=\"\([^\"]*\)\"/<namespace name=\"${namespace}\"/" "${gir_dir}/${os_name}/${arrow_lib}.gir" | |
rm -r "${gir_dir}/${os_name}/${arrow_lib}.gir.bak" | |
namespace=${namespaces[$index]} | |
sed -e "s/<namespace name=\"\([^\"]*\)\"/<namespace name=\"${namespace}\"/" \ | |
"${arrow_gir_dir}/${arrow_lib}.gir" > "${gir_dir}/${os_name}" |
+ if (keys == null) | ||
+ { | ||
+ throw new ArgumentNullException(nameof(keys)); | ||
+ } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this?
This is annotated as nullable
:
arrow/c_glib/arrow-glib/compute.cpp
Line 1380 in 255dbf9
* @keys: (nullable) (array length=n_keys): Group keys. |
If this is a current Gir.Core limitation, could you report this to Gir.Core?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened gircore/gir.core#1077 for this
return; | ||
} | ||
|
||
GObject.Module.Initialize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove this because Apache.Arrow.GLibBindings.Module.Initialize()
calls this?
Thanks for the thorough feedback @kou! I will follow up on all of these points but it might take a while.
I meant that storing the .gir files checked in to git doesn't make sense, not that merging outputs from different platforms didn't make sense. It's helpful to see some examples of this though thank you. I guess we could store the .gir files in git to simplify local development but it seems redundant and we'd need checks to ensure they're kept updated so would still need to build them in CI anyway.
Yes you're correct, I debugged this issue and found that it failed to match the types with their references due to the namespaces not matching (in https://github.com/gircore/gir.core/blob/ca3866730435b989e73ce2a9a05a81386ce9e332/src/Generation/GirLoader/Output/Class.cs#L49). I don't think there's an easy way to keep the same names and hide the generated symbols but I'll have a look into this. |
Ah, sorry. I agree that we don't want to push generated BTW, we may be able to use |
Just leaving an update here as it's been a while with no progress. I'd still like to make this work but have been busy with other priorities. I made some changes to gir.core to work better with the Arrow GObject library and the maintainer is on board with the idea of releasing |
Rationale for this change
Provide dataset functionality to .NET users.
What changes are included in this PR?
Adds a GObject based Dataset library with minimal functionality.
Are these changes tested?
Yes
Are there any user-facing changes?
Not yet as this isn't published to NuGet