Replies: 1 comment 4 replies
-
We currently have three formal levels of scoping, defined by two boundaries:
The boundary between 2 and 3 is the Python convention of leading underscores: if a module/function/class/method is not hidden behind an underscore at some level, we can assume that external users or developers will start using it and depending on it. If we want to change such a thing, we can either declare the old behavior a bug (never intended) or go through a deprecation cycle. The boundary between 1 and 2 is less sharp: It's important for the boundary between 2 and 3 to be enforced and based on a rule that we don't make up: we can't assume that anybody reads a warning we put in the documentation before they start depending on our public API. Even if they're contentious and intend to read all the disclaimers, discovery is hard and they might not find it. We can assume that developers are aware of Python conventions (because otherwise, what can we assume?). The boundary between 1 and 2 is less important; it's more for organizing user experience, so that we can give powerful tools to developers and also not overwhelm interactive users. I agree that it would be useful to go further and subdivide level 3. Encapsulation is useful for programming—it keeps a codebase from becoming spaghetti. Both are important, but encapsulation within a codebase is qualitatively different from encapsulation between codebases maintained by people who might not know each other. Within a codebase, we can create a convention and follow it, and even if we get PRs from new contributors, they'll be reviewed by people who know the conventions and we can make sure that the code adheres to them before it gets merged. Dependencies between codebases, however, are much stickier: we don't review how others use our code, or even know about it (though I'm trying to address that with GitHub searches). They might end up building huge castles on it that would be hard to fix if we change anything in our API. And even when a fix is possible, version X of Awkward wouldn't work with version Y of their package (where Y is before the fix). So yes, we should have good encapsulation between the organelles of our codebase, but it is more important to maintain good encapsulation between cells—different codebases/packages. The consequences of those mistakes are more dire. Internally refactoring our own codebase may be difficult, but we don't need to get anyone's permission to do it; we can do it at any time. Refactoring among codebases/packages requires coordination among people who might not even be in communication (yet), and we might have conflicting requirements that can't even be satisfied without someone giving up on their goals. The cell membranes are a different order of thing than the organelle boundaries because we don't know what other cells we're going to meet in our travels; we always know what organelles we have internally. (I've rambled; sorry!) About organizing within our codebase, breaking up level 3 into multiple levels, we already use some encapsulation practices. By convention (so far, and only 95% followed), a
Also (about 95% followed), all data attributes are private (start with an underscore) with read-only public properties. Those public properties (no underscore) are part of our guaranteed API (levels 1 & 2). I don't think we have any private properties, since we also prefer plain argument lists for our internal functions, without defaults, as we would have in the public API. A property is just a method without " With the above rules, a deep call into some object's content of content of content would look like this: self._content.content.content # only the first gets an underscore and you see that pattern quite a bit. One thing that we use a lot that violates the normal Python notion of public/private is that our def ak_whatever_impl(array, *args):
layout = ak.operations.ak_to_layout._impl(array, ...)
out = layout._start_recursion(*args)
return ak._util.wrap(out, highlevel, behavior) The Python's standard convention has only one way of making a public-private distinction (the underscore), and we've already used it up in making that more-important distinction: communicating "do not use" to users and downstream developers. We'll need another marker if we're going to make a distinction between Perhaps they could be free functions in a hidden submodule? We have no qualms about calling functions in hidden submodules such as It's probably clear from the above that I disagree with solution 1, "Don't hide things from external developers." Whatever we do with our internal organelles, the cell membranes are more important and must be maintained. You also suggested listing the public API in documentation, but we have to assume that users and downstream developers—even in good faith—won't find the relevant documentation. However we signify the boundary between "level 2" and "level 3," it has to be based on rules that are familiar to the whole Python community, not a rule we make up and post somewhere. The only public/private rule the Python community has is the underscore rule. As for solution 2, "Use a custom naming convention," yes, it would have to be something we impose on ourselves like this, not something we impose on the world, but we have more options than just names. The hidden submodule is a thing that we can do that has more structure than a name. But even if we go with names, how about reversing the default, so that "level 4 private-to-class" are just underscores but "level 3 private-to-codebase" have special prefixes: underscore + some letter? Like " class BitMaskedArray(Content):
def _something(self, ...):
...
def _a_something(self, ...):
... where Let's not go crazy introducing many levels of privateness, but I see the value of having a level 3/level 4 distinction. Incidentally, Scala has this concept of private-to-class, private-to-module/namespace, and one of those nested namespaces is your whole codebase, so we're not the first to need something like this. (Though, of course, Scala actually enforces that encapsulation at compile-time, but we live in Python. Maybe, if we're happy with whatever rule we come up with, we can enforce it as a flake8 extension. There's already a placeholder for that, enforcing the AK001 exception-raising rule.) |
Beta Was this translation helpful? Give feedback.
-
This discussion is inspired by #1968, but more generally is something I've been thinking about for a while.
Disclaimer: I have a strong opinion here, but I don't think there's necessarily a clear objective solution, so whatever comes out of this discussion is fine-by-me™
The Problem
Our current API strategy for the
ak.contents.Content
objects sees us breaking convention, and using private attributes of other classes, e.g.layout._backend
orlayout._getitem_next()
. This brings with it some problems:@property
is not conventionally added to private methods. Therefore, either we stick with this convention and cannot use these for our private impl, or we need to break convention.I understand this convention to be solving the following problem: we don't want external developers to rely on the same set of API methods that we use to implement Awkward. We want a smaller development burden by reducing scope.
I could write a lot of text here, but the short version is that it feels like we're compromising our internal API for the sake of an external one. I do feel like the public-private conventions outlined above have value, both for readability and coupling. Whilst we have highly-coupled layouts (all layouts need to maintain some awareness of the others), this dependence is increasingly limited to a number of well-defined interfaces such as
simplified
andstarts
,stops
, etc. Although coupling is usually bad from a "can I introduce a new type here", it's also bad from a "how hard is it to change design assumptions". We can't avoid the former (it's a design decision, and there's really no other way to do the is in OOP), but we can address the latter.Usually, one doesn't have to make this compromise; either anything public can be called by users, or a private API is distinguished from the public interface (e.g. our
NumpyLike
object, if it weren't exposed viabackend.nplike
). We can't easily do the latter, because the layout objects are a fundamental type, and there's an overwhelming disadvantage to wrapping/unwrapping them in some other protected API for the sake of reducing API scope. We have elected not to do the former, for reasons that I do think have merit.The Solution(s)?
Don't hide things from external developers
If it were exclusively my decision for a personal project, I'd favour the internal API following the "usual" conventions, and use e.g. documentation to make it clear to external developers that they should only use certain API methods. In my experience, developers will ignore a
_
prefix if the method works; they'll just more readily accept a breakage down the line. It would be nice if we could e.g. move all of the_reduce_next
functions toreduce_next_
, given that they're called by other classes. Whereas,ListOffsetArray._offsets
should not be writable by other classes.Use a custom naming convention
I think the outstanding argument against "Don't hide things from external developers" is the one outlined above: we don't want to maintain the entire layout API for N versions without strong motivation. Therefore, I wonder if we could consider defining our own "private-public" convention. One option is to explicitly enumerate the public methods in the documentation. This is the most elegant solution from a code perspective, but I suspect a valid criticism is that some developers will never open the documentation, and it's unreasonable to define that as "wrong". Thus, my compromise would be to define a naming convention for internal "public" attributes of layout classes.
I don't love this - it's slightly ugly. But, it makes it clear that these are public methods of the layout, just "internal" to Awkward.
Beta Was this translation helpful? Give feedback.
All reactions