Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41885: [C#] Add initial GObject based Dataset library #41886

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

adamreeve
Copy link
Contributor

@adamreeve adamreeve commented May 30, 2024

Rationale for this change

Provide dataset functionality to .NET users.

What changes are included in this PR?

Adds a GObject based Dataset library with minimal functionality.

Are these changes tested?

Yes

Are there any user-facing changes?

Not yet as this isn't published to NuGet

Copy link

⚠️ GitHub issue #41885 has been automatically assigned in GitHub to PR creator.

@adamreeve
Copy link
Contributor Author

adamreeve commented May 30, 2024

I've opened this draft PR to see if anyone has any feedback on the approach so far (@CurtHagenlocher or @kou?). There are still a few issues to sort out before this could be considered ready though.

I've taken the approach of using gir.core to generate C# code based on the GIR files that are created in the c_glib build. Alternatively we could use GtkSharp, which is more mature but less active now and only really focused on building Gtk bindings, whereas gir.core describes itself as "A C# binding generator for GObject based libraries providing a C# friendly API surface", and says it "Allows 3rd party developers to write bindings for other GObject-based libraries." GtkSharp also has its own custom XML format that's different to gir. I haven't had much luck using their parser with the c_glib code but it might be possible to convert to their format from gir files. Another possible approach would be to manually write code to call into the C libraries, which would be a lot more labour intensive.

There are two GObject based projects that get generated, one for the core Arrow library and one for the Dataset library, and I've created a higher level Dataset library that doesn't directly expose the GObject classes. This adds more work compared to just providing the GObject library and adding additional methods to the GObject based classes to add extra functionality like in the Ruby library, but I think it's nicer this way to avoid having a mix of GObject and non-GObject based classes in the same API, eg. one method to get a GObject RecordBatchReader and another to get an Apache.Arrow.IArrowArrayStream, or having to first get the GObject RecordBatchReader and then import it as an IArrowArrayStream.

I'm still on the fence about this approach though, and this is also a bit awkward for types in the core Arrow GLib library that don't have a corresponding type in the main .NET Arrow library, like the FileSystem class. I've created a FileSystem wrapper class in the high level Dataset library but this could cause issues if we later want to use a FileSystem in another GObject based library.

Remaining issues:

  • Building in CI: gir.core needs separate Linux, Windows and MacOS gir files. For local development I just generate a single file but for testing and releasing in CI we'll probably want to generate per-OS files in separate jobs and then combine them. gir.core is designed to work with gir files that are checked in to source control, but I don't think that makes sense for us when the c_glib and C# code live in the same repository.
  • Generating .gir files is failing with MSVC. I have the c_glib libraries building after [GLib] Add support for MSVC with vcpkg #41134, but generating the gir and typelib files doesn't work yet and fails with linker errors (eg. see https://github.com/adamreeve/arrow/actions/runs/9261584329/job/25479186007)
  • The gir.core GObject and GLib libraries are not signed/strong named. I've had to add <SignAssembly>false</SignAssembly> to the projects that depend on these for now. I'm not sure whether this should be a blocker, and we could ask for signed versions of the packages to be published or publish our own bindings to these that are signed if needed.
  • A lot of properties and methods aren't generated with errors like "Did not generate property 'FileSystemDatasetFactory.file-system': The type ?? / Arrow.FileSystem has not been resolved."
  • gir.core doesn't have a way to generate methods with "new", "virtual" or "override" modifiers. There are a lot of methods in the c_glib libraries that exist in both parent and derived classes and I've patched the generated code to add modifiers to fix this. There are also a few "ToString" methods that hide object.ToString that I've had to add "override" to. I'm not sure whether this should be supported by gir.core or maybe we just want a more robust way to fix these with post processing.
  • gir.core is used as a git submodule. It would be nice if this was available as a tool that can be installed via NuGet, but this is probably not a blocker.
  • The code generation is currently handled by a bash script. Integrating this into MSBuild or as a source generator would be nicer, although gir.core generates code for multiple projects at once so this could be a bit fiddly.

@adamreeve adamreeve force-pushed the dotnet-dataset-gir branch from 0216c00 to 3dc03c5 Compare May 30, 2024 04:14
@kou
Copy link
Member

kou commented May 30, 2024

It seems that using gir.core makes sense.
(I'll add more comments later...)

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Building in CI: gir.core needs separate Linux, Windows and MacOS gir files. For local development I just generate a single file but for testing and releasing in CI we'll probably want to generate per-OS files in separate jobs and then combine them. gir.core is designed to work with gir files that are checked in to source control, but I don't think that makes sense for us when the c_glib and C# code live in the same repository.

It makes sense.

We already have CI jobs that use the approach:

Could you try this?

diff --git a/c_glib/arrow-glib/enums.h.template b/c_glib/arrow-glib/enums.h.template
index b7d3c99c0b..e49b717fb3 100644
--- a/c_glib/arrow-glib/enums.h.template
+++ b/c_glib/arrow-glib/enums.h.template
@@ -22,6 +22,8 @@
 
 #include <glib-object.h>
 
+#include <arrow-glib/version.h>
+
 G_BEGIN_DECLS
 /*** END file-header ***/
 
@@ -31,6 +33,7 @@ G_BEGIN_DECLS
 /*** END file-production ***/
 
 /*** BEGIN value-header ***/
+GARROW_AVAILABLE_IN_ALL
 GType @enum_name@_get_type(void) G_GNUC_CONST;
 #define @ENUMPREFIX@_TYPE_@ENUMSHORT@ (@enum_name@_get_type())
 /*** END value-header ***/
  • The gir.core GObject and GLib libraries are not signed/strong named. I've had to add <SignAssembly>false</SignAssembly> to the projects that depend on these for now. I'm not sure whether this should be a blocker, and we could ask for signed versions of the packages to be published or publish our own bindings to these that are signed if needed.

Sorry. I'm not familiar with C# packaging.

  • A lot of properties and methods aren't generated with errors like "Did not generate property 'FileSystemDatasetFactory.file-system': The type ?? / Arrow.FileSystem has not been resolved."

I think this is caused by Arrow -> Apache.Arrow.GLibBindings namespace substitution by sed.
Can we hide symbols generated by Gir.Core instead of renaming?

  • gir.core doesn't have a way to generate methods with "new", "virtual" or "override" modifiers. There are a lot of methods in the c_glib libraries that exist in both parent and derived classes and I've patched the generated code to add modifiers to fix this. There are also a few "ToString" methods that hide object.ToString that I've had to add "override" to. I'm not sure whether this should be supported by gir.core or maybe we just want a more robust way to fix these with post processing.

Could you ask Gir.Core developers about this? (It seems that Gir.Core should support this case...)

  • gir.core is used as a git submodule. It would be nice if this was available as a tool that can be installed via NuGet, but this is probably not a blocker.

Could you ask Gir.Core developers to release GirTool as a NuGet package?

namespaces=(Apache.Arrow.GLibBindings Apache.Arrow.Dataset.GLibBindings)

os_names=(linux macos windows)
for os_name in ${os_names[@]}; do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we should use "${ARRAY[@]}" style to avoid unexpected.

Suggested change
for os_name in ${os_names[@]}; do
for os_name in "${os_names[@]}"; do

e.g.:

a=(aaa "bbb ccc" ddd)

echo "quoted"
for x in "${a[@]}"; do
  echo "<${x}>"
done

echo "not quoted"
for x in ${a[@]}; do
  echo "<${x}>"
done
$ bash /tmp/xxx.sh
quoted
<aaa>
<bbb ccc>
<ddd>
not quoted
<aaa>
<bbb>
<ccc>
<ddd>

os_names=(linux macos windows)
for os_name in ${os_names[@]}; do
mkdir -p "${gir_dir}/${os_name}"
for gir_dep in ${gir_dependencies[@]}; do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for gir_dep in ${gir_dependencies[@]}; do
for gir_dep in "${gir_dependencies[@]}"; do


# TODO: Make this select the appropriate platform and merge all platforms in CI build
os_name="linux"
for index in ${!arrow_gobject_libs[*]}; do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for index in ${!arrow_gobject_libs[*]}; do
for index in "${!arrow_gobject_libs[@]}"; do

Comment on lines +34 to +37
cp "${arrow_gir_dir}/${arrow_lib}.gir" "${gir_dir}/${os_name}"
namespace=${namespaces[$index]}
sed -i.bak "s/<namespace name=\"\([^\"]*\)\"/<namespace name=\"${namespace}\"/" "${gir_dir}/${os_name}/${arrow_lib}.gir"
rm -r "${gir_dir}/${os_name}/${arrow_lib}.gir.bak"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cp "${arrow_gir_dir}/${arrow_lib}.gir" "${gir_dir}/${os_name}"
namespace=${namespaces[$index]}
sed -i.bak "s/<namespace name=\"\([^\"]*\)\"/<namespace name=\"${namespace}\"/" "${gir_dir}/${os_name}/${arrow_lib}.gir"
rm -r "${gir_dir}/${os_name}/${arrow_lib}.gir.bak"
namespace=${namespaces[$index]}
sed -e "s/<namespace name=\"\([^\"]*\)\"/<namespace name=\"${namespace}\"/" \
"${arrow_gir_dir}/${arrow_lib}.gir" > "${gir_dir}/${os_name}"

Comment on lines +37 to +40
+ if (keys == null)
+ {
+ throw new ArgumentNullException(nameof(keys));
+ }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

This is annotated as nullable:

* @keys: (nullable) (array length=n_keys): Group keys.

If this is a current Gir.Core limitation, could you report this to Gir.Core?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened gircore/gir.core#1077 for this

return;
}

GObject.Module.Initialize();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this because Apache.Arrow.GLibBindings.Module.Initialize() calls this?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Jun 1, 2024
@adamreeve
Copy link
Contributor Author

Thanks for the thorough feedback @kou! I will follow up on all of these points but it might take a while.

It makes sense.
We already have CI jobs that use the approach:

I meant that storing the .gir files checked in to git doesn't make sense, not that merging outputs from different platforms didn't make sense. It's helpful to see some examples of this though thank you. I guess we could store the .gir files in git to simplify local development but it seems redundant and we'd need checks to ensure they're kept updated so would still need to build them in CI anyway.

I think this is caused by Arrow -> Apache.Arrow.GLibBindings namespace substitution by sed.
Can we hide symbols generated by Gir.Core instead of renaming?

Yes you're correct, I debugged this issue and found that it failed to match the types with their references due to the namespaces not matching (in https://github.com/gircore/gir.core/blob/ca3866730435b989e73ce2a9a05a81386ce9e332/src/Generation/GirLoader/Output/Class.cs#L49).

I don't think there's an easy way to keep the same names and hide the generated symbols but I'll have a look into this.

@kou
Copy link
Member

kou commented Jun 2, 2024

Ah, sorry. I agree that we don't want to push generated .gir to this repository. For local development, we may improve our docker-compose.yaml or something instead of pushing generated .gir to this repository. We'll use CI to build NuGet packages for release.

BTW, we may be able to use .gir files generated on Linux for all platforms because we don't have platform dependent APIs.

@adamreeve
Copy link
Contributor Author

Just leaving an update here as it's been a while with no progress. I'd still like to make this work but have been busy with other priorities.

I made some changes to gir.core to work better with the Arrow GObject library and the maintainer is on board with the idea of releasing GirTool as a NuGet package, but I haven't got around to doing that yet. Being able to change the namespace was a bit of a sticking point though so we might have to deal with using a namespace like Arrow instead of Apache.Arrow in the generated libraries (gircore/gir.core#1090).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants