Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A simple plugin system #380

Merged
merged 10 commits into from
Nov 13, 2020
Merged

A simple plugin system #380

merged 10 commits into from
Nov 13, 2020

Conversation

roveo
Copy link
Contributor

@roveo roveo commented Nov 9, 2020

Right now, if I need to add my own custom stream nodes, I have to do this:

import mypackage.streamz_extras  # noqa: F401

It would be nice to have a way to distribute additional functionality as separate packages that can just be installed via pip. This can be done with entry_points, similar to the way it works in airflow. This is a super bare-bones implementation of this mechanism. Check out https://github.com/roveo/streamz_example_plugin for an example of a plugin.

Problems:

  • this should be tested, but I have no idea how, short of shipping the code for example plugin with tests
  • plugged-in classes should be checked for validity in some way. I can add a simple check e.g. isinstance(plugin, Stream), but there is probably something else I haven't thought of

@roveo roveo changed the title Proposal: A simple plugin system A simple plugin system Nov 9, 2020
@martindurant
Copy link
Member

Thank you. I totally agree that using entry_points is a great alternative or addition to the registration system.
@CJ-Wright @jsmaupin @chinmaychandak - anyone have time and interest to give this a once-over?

@jsmaupin
Copy link
Contributor

jsmaupin commented Nov 9, 2020

I think this is a great idea. I was hoping that we could figure out how do this in the future for sources and sinks where you could connect Streamz to external services using packages install-able via pip or conda because, these connections often have a lot of dependencies that you don't necessarily want in your core package. I hadn't considered it for functionality that operates in the middle of the pipeline. I think this is the step that gets us in that direction.

@CJ-Wright
Copy link
Member

I think this is great! How should we document all these plugins so people can discover them?

@martindurant
Copy link
Member

If we go with entry_points as a way of declaring plugins, then we do not document them, at least not exhaustively. Intake faces this: we have a page of known plugins, which people can submit PRs for changing, but there's nothing to stop new plugins appearing which we don't know about. It would be possible to have a configuration that only loads a select number of plugins, or instead to not load the plugins except upon explicit user invocation.

@chinmaychandak
Copy link
Contributor

chinmaychandak commented Nov 9, 2020

I think this is really nice!

I think for validating the plugins, we would definitely need isinstance(plugin, Stream) (similar to Airflow which tests against one of its subclasses called PluginManager which I don't think is necessary at this point).

Also, for testing an example plugin, having a separate repository/page called streamz-plugins could be an option? I think that's what Airflow does, too. The from_iterable node you wrote can be a good plugin for testing.

As for tests for future plugins, I think if we consider adding a plugin to setup.py in streamz, then we would need to verify the plugin behavior separately beforehand; if, however, we allow the plugin to be loaded lazily (only when the user explicitly invokes them, either through CLI or otherwise), then I think we don't need to worry about having tests in streamz itself (the plugin repo can have those tests, I mean). I'm just speaking my mind here, I may be missing something!

[EDIT] I see Martin beat me to it. :)

@roveo
Copy link
Contributor Author

roveo commented Nov 9, 2020

Also, for testing an example plugin, having a separate repository/page called streamz-plugins could be an option?

I think this is the way to go. Then I'll add tests, so that they are not bound to the repo in my account.

As for tests for future plugins, I think if we consider adding a plugin to setup.py in streamz, then we would need to verify the plugin behavior separately beforehand

So, plugins in streamz-plugins repo that are tested and "verified" can be provided via extras_require in setup.py, but nothing is stopping anyone from creating their own plugin package and installing it with pip.

@chinmaychandak
Copy link
Contributor

This seems reasonable to me, although I would defer to Martin and CJ to make the final call whether or not this is the best way to proceed.

@chinmaychandak
Copy link
Contributor

I see Travis builds timing out. Is this due to it being overloaded? Should we try something like install: travis_wait 30 mvn install temporarily?

@martindurant
Copy link
Member

We need to transfer our tests away from travis...

@martindurant
Copy link
Member

Yes, we should do whatever it takes to get travis to pass for now...

I am only now looking at the actual code. I wonder, is there any appetite for making this system lazy? As it is, streamz will import all packages that claim to have relevant plugins - but the node classes still need to be annotated with decorators as before.

I could imagine entry points spelled like:

streamz.plugins: [
  "transform=streamz_addon.module.TransformNode"
]

which adds "transform" to the dir() of Stream, but only imports and registers when the .transform is first used (i.e., via a __getattr__).

Furthermore, we may want to be more specific with our entrypoint group naming:

  • streamz.node
  • streamz.source
  • streamz.sink
  • ...

What do you think?

btw: does this require the package entrypoints as a dependency?

@CJ-Wright
Copy link
Member

I like having them lazy, since that allows us to not need the plugins to be installed until we actually need them

@CJ-Wright
Copy link
Member

@martindurant would it be worthwhile to setup the CI using conda-forge's CI provisioning system? That way if we need to move to something we just rerender?

@martindurant
Copy link
Member

Sorry @CJ-Wright , you're the expert there, I really don't know what's involved.

@CJ-Wright
Copy link
Member

I'll put up a PR if we're interested, but the rough outline is here: https://conda-forge.org/docs/user/ci-skeleton.html

@martindurant
Copy link
Member

Does it work for github actions CI?

@CJ-Wright
Copy link
Member

Not yet, since CF doesn't use GH actions as a CI (yet). The hope of this approach would be that we don't need to care about what particular CI we are using, since we could rerender and the CIs could move.

We could add GH actions to CF and then it would support that CI.

@martindurant
Copy link
Member

In that case I am hesitant, since the dask repos are all going to github. I don't have a circle or azure account myself - is it all free like Travis used to be?

@CJ-Wright
Copy link
Member

Sorry, can we move the CI conversation to an issue? I don't want to derail things here. #382

@roveo
Copy link
Contributor Author

roveo commented Nov 10, 2020

As it is, streamz will import all packages that claim to have relevant plugins - but the node classes still need to be annotated with decorators as before.

Do you mean that with lazy loading they won't have to be annotated?

adds "transform" to the dir() of Stream, but only imports and registers when the .transform is first used (i.e., via a getattr)

This can be done with a @property that looks it up in a class attribute dict (e.g. Stream._plugins) and loads the entrypoint if it's not already loaded.

Furthermore, we may want to be more specific with our entrypoint group naming

Do these types of nodes require different treatment when being loaded?

btw: does this require the package entrypoints as a dependency?

I don't think so, this is built into setuptools. entrypoints looks like a different package (with the last 2 releases in 2017 and 2019).

@martindurant
Copy link
Member

Do you mean that with lazy loading they won't have to be annotated?

Indeed, this would be an alternative to the annotation; although doing both should be allowed for backward compatibility.

This can be done with a @Property that looks it up in a class attribute dict (e.g. Stream._plugins) and loads the entrypoint if it's not already loaded.

Something like that, yes. There are many ways to achieve it. You would call register_api on the class once it's imported, so it gets added to the namespace and you don't need to import it again.

Do these types of nodes require different treatment when being loaded?

maybe. This gives us more flexibility for the future. We could argue that, for the time being, all possible plugins as "nodes" (i.e., they appear as attributes of Stream). My motivation here is, that in Intake we started off having things called "plugins", but then made other parts of the library pluggable and had to, painfully, rename the original plugins to "drivers".

I don't think so, this is built into setuptools.

Better check that setuptools has this functionality for all python versions of interest. Also, setuptools had better be included in the deps (yes, I know everyone already has it).

@roveo
Copy link
Contributor Author

roveo commented Nov 10, 2020

You would call register_api on the class once it's imported

Then register_api would have to allow us to specify which attribute the class will be registered at. Right now it's setattr(cls, func.__name__, modifier(wrapped)), so it's always the class name.

Better check that setuptools has this functionality for all python versions of interest. Also, setuptools had better be included in the deps (yes, I know everyone already has it).

OK, will do.

@martindurant
Copy link
Member

Correct. I imagine we'll have redundancy for now "name=package.module.name", but there might be a case for when we want the class name and method name to be different, which is not yet possible.

Actually, there is a difference between normal nodes and sources: the latter are registered as staticmethod.

@chinmaychandak
Copy link
Contributor

chinmaychandak commented Nov 11, 2020

@roveo, would you be interested to try something like install: travis_wait 30 mvn install in travis.yml, until we can switch from Travis to GitHub Actions / CF? I can't guarantee it will work (because the issue is due to Travis ending free builds), but I think it's worth a shot.

@roveo
Copy link
Contributor Author

roveo commented Nov 11, 2020

Better check that setuptools has this functionality for all python versions of interest.

Setuptools supported entry_points since 0.6a1 (year 2009), so I don't think this should be a problem.

@codecov
Copy link

codecov bot commented Nov 11, 2020

Codecov Report

Merging #380 (6e39c65) into master (dffe408) will increase coverage by 0.07%.
The diff coverage is 86.95%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #380      +/-   ##
==========================================
+ Coverage   95.77%   95.84%   +0.07%     
==========================================
  Files          16       17       +1     
  Lines        2508     2529      +21     
==========================================
+ Hits         2402     2424      +22     
+ Misses        106      105       -1     
Impacted Files Coverage Δ
streamz/dataframe/core.py 93.39% <ø> (+0.91%) ⬆️
streamz/plugins.py 62.50% <62.50%> (ø)
streamz/__init__.py 81.81% <100.00%> (+4.04%) ⬆️
streamz/core.py 95.81% <100.00%> (+0.05%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e9a2545...fe0f9d9. Read the comment docs.

@roveo roveo mentioned this pull request Nov 11, 2020
@roveo
Copy link
Contributor Author

roveo commented Nov 12, 2020

Added Plugins page to docs (I'm not a native speaker, so edits and additions are very welcome).

Note that there's a table commented out in Known plugins section. I think at this point we should think about which existing functionality should go into extras (the leading candidate being kafka).

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this PR

streamz/core.py Show resolved Hide resolved
streamz/tests/test_plugins.py Show resolved Hide resolved

Stream.register_plugin_entry_point(entry_point)

assert Stream.test_double.__name__ == "stub"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm not sure whether the method should have been overwritten - presumably Stream.test_double already existed before the call to register_plugin_entry_point.

In normal usage, the stubs always come first, since they happen at import time, and streamz should always be imported before derived classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I guess this test actually doesn't test anything. I wanted to test that running Stream.register_api twice on the same class doesn't break anything, in case the implementation of register_api changes. We can just check for this in register_plugin_entry_point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is OK, but not comprehensive. The cases are:

  • register_api, then register entrypoint
  • register entrypoint then register_api
  • either of the to methods called twice

In every case, calling the method should result in the same test class.

We could check for whether the name is being overwritten, but I don't think that's essential.

Copy link
Contributor Author

@roveo roveo Nov 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In every case, calling the method should result in the same test class.

The class is stored in Stream.method.__wrapped__, but I don't see how anything else but the test class can end up there in any of these cases.

Btw, what should we do if a plugin wants to override a built-in method? On the one hand, this would allow people to create useful extensions of built-in functionality (my version of partition could be provided as a plugin long before it's merged into core), on the other — it could break things and lead to confusion.

Copy link
Member

@CJ-Wright CJ-Wright Nov 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would have the system balk at a built-in override. My hope would be that names are cheap, so adding more text onto a would be partition describing how it is different would add less burden than issues around which partition was used. This could get particularly thorny as envs get updated, code that previously worked and relied on the built-in are now producing errors in subtle ways that could be difficult to find.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean we should have a clobber=False argument for the register methods?
I wonder in that case, if there is a sane way to tell when an entrypoint and register_api refer to the same thing, which would not be a problem even without clobber.

Copy link
Contributor Author

@roveo roveo Nov 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder in that case, if there is a sane way to tell when an entrypoint and register_api refer to the same thing, which would not be a problem even without clobber.

We can just check that hasattr(Stream, entry_point.name). But it would have to happen during plugin load (so that we're not overwriting built-in methods with stubs) and so importing streamz with a bad plugin would result in an error. Or it could be a warning that we skipped an entrypoint because of name collision.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this will fail when someone adds the entrypoint to a class that already has register_api, no?

Note @CJ-Wright , that the current implementation does allow you to use register_api to overwrite methods at will. Of course this is python, so we can't ever disable someone who wants to do that anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. To some extent this boils down to locality for me. If the author of the code is the user of the env then if they override things then they are in a better place to clean up. When installing 3rd party things that state may not hold as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the author of the code is the user of the env then if they override things then they are in a better place to clean up. When installing 3rd party things that state may not hold as well.

I agree. With register_api things are explicit and visible. A plugin is a black box.

@martindurant please look at the last commit to see what I mean

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this. @CJ-Wright ?

cls.register_api(
modifier=modifier, attribute_name=entry_point.name
)(node)
if getattr(cls, entry_point.name).__name__ == "stub":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this can be False, but it doesn't hurt to check.

Copy link
Contributor Author

@roveo roveo Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is false when the plugin class returned from entry_point.load() is decorated with @Stream.register_api. Then it is registered right away when loaded, and so at the moment of this check stub has already been overwritten with the actual class.

Copy link
Member

@CJ-Wright CJ-Wright left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! One thing to consider is scraping the conda-forge packages/data for packages who declare the streamz plugin, if we are interested in building a plugin registry.

@martindurant
Copy link
Member

Right, but there aren't any yet :)
I actually would expect any plugin developers to contact us, if they wanted to be listed on the streamz documentation somewhere.

@martindurant martindurant merged commit dca7370 into python-streamz:master Nov 13, 2020
@martindurant
Copy link
Member

@chinmaychandak : that kafka test is failing too often! (and giving an incorrect message, due to using positional rather than keyword arguments to wait)

@CJ-Wright
Copy link
Member

It was more of a general idea, since other plugin mediated systems could use that approach as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants