Distinguishing public and public-internal APIs #1969

agoose77 · 2022-12-07T11:42:28Z

agoose77
Dec 7, 2022
Maintainer

This discussion is inspired by #1968, but more generally is something I've been thinking about for a while.

Disclaimer: I have a strong opinion here, but I don't think there's necessarily a clear objective solution, so whatever comes out of this discussion is fine-by-me™

The Problem

Our current API strategy for the ak.contents.Content objects sees us breaking convention, and using private attributes of other classes, e.g. layout._backend or layout._getitem_next(). This brings with it some problems:

Conventionally, private attributes are not descriptors e.g. @property is not conventionally added to private methods. Therefore, either we stick with this convention and cannot use these for our private impl, or we need to break convention.
"Actually private" state is harder to see. It's hard to scope APIs when we have no clear scoping rules. We would need to resort to name mangling if we wanted to make this clear, and then add "private" descriptors, which incurs the above problem.
We use private attributes as fast-paths (i.e. avoid the public property), which means private is not private.

I understand this convention to be solving the following problem: we don't want external developers to rely on the same set of API methods that we use to implement Awkward. We want a smaller development burden by reducing scope.

I could write a lot of text here, but the short version is that it feels like we're compromising our internal API for the sake of an external one. I do feel like the public-private conventions outlined above have value, both for readability and coupling. Whilst we have highly-coupled layouts (all layouts need to maintain some awareness of the others), this dependence is increasingly limited to a number of well-defined interfaces such as simplified and starts, stops, etc. Although coupling is usually bad from a "can I introduce a new type here", it's also bad from a "how hard is it to change design assumptions". We can't avoid the former (it's a design decision, and there's really no other way to do the is in OOP), but we can address the latter.

Usually, one doesn't have to make this compromise; either anything public can be called by users, or a private API is distinguished from the public interface (e.g. our NumpyLike object, if it weren't exposed via backend.nplike). We can't easily do the latter, because the layout objects are a fundamental type, and there's an overwhelming disadvantage to wrapping/unwrapping them in some other protected API for the sake of reducing API scope. We have elected not to do the former, for reasons that I do think have merit.

The Solution(s)?

Don't hide things from external developers

If it were exclusively my decision for a personal project, I'd favour the internal API following the "usual" conventions, and use e.g. documentation to make it clear to external developers that they should only use certain API methods. In my experience, developers will ignore a _ prefix if the method works; they'll just more readily accept a breakage down the line. It would be nice if we could e.g. move all of the _reduce_next functions to reduce_next_, given that they're called by other classes. Whereas, ListOffsetArray._offsets should not be writable by other classes.

Use a custom naming convention

I think the outstanding argument against "Don't hide things from external developers" is the one outlined above: we don't want to maintain the entire layout API for N versions without strong motivation. Therefore, I wonder if we could consider defining our own "private-public" convention. One option is to explicitly enumerate the public methods in the documentation. This is the most elegant solution from a code perspective, but I suspect a valid criticism is that some developers will never open the documentation, and it's unreasonable to define that as "wrong". Thus, my compromise would be to define a naming convention for internal "public" attributes of layout classes.

class ListOffsetArray:

    @property
    def offsets(self):
        return self._offsets


    def internal_reduce_next(self):
        ...

    def internal_getitem_at(self):
        ...

I don't love this - it's slightly ugly. But, it makes it clear that these are public methods of the layout, just "internal" to Awkward.

jpivarski · 2022-12-07T17:52:18Z

jpivarski
Dec 7, 2022
Maintainer

We currently have three formal levels of scoping, defined by two boundaries:

public, high-level: for data analysts and interactive users
public, low-level: for downstream developers
internal: for our own use

The boundary between 2 and 3 is the Python convention of leading underscores: if a module/function/class/method is not hidden behind an underscore at some level, we can assume that external users or developers will start using it and depending on it. If we want to change such a thing, we can either declare the old behavior a bug (never intended) or go through a deprecation cycle.

The boundary between 1 and 2 is less sharp: ak.Array, ak.Record, ak.ArrayBuilder, the ak.* functions, slicing, and ufuncs are high-level; anything behind the layout property and names in submodules (not the top ak namespace) are low-level but still public.

It's important for the boundary between 2 and 3 to be enforced and based on a rule that we don't make up: we can't assume that anybody reads a warning we put in the documentation before they start depending on our public API. Even if they're contentious and intend to read all the disclaimers, discovery is hard and they might not find it. We can assume that developers are aware of Python conventions (because otherwise, what can we assume?).

The boundary between 1 and 2 is less important; it's more for organizing user experience, so that we can give powerful tools to developers and also not overwhelm interactive users.

I agree that it would be useful to go further and subdivide level 3. Encapsulation is useful for programming—it keeps a codebase from becoming spaghetti. Both are important, but encapsulation within a codebase is qualitatively different from encapsulation between codebases maintained by people who might not know each other. Within a codebase, we can create a convention and follow it, and even if we get PRs from new contributors, they'll be reviewed by people who know the conventions and we can make sure that the code adheres to them before it gets merged.

Dependencies between codebases, however, are much stickier: we don't review how others use our code, or even know about it (though I'm trying to address that with GitHub searches). They might end up building huge castles on it that would be hard to fix if we change anything in our API. And even when a fix is possible, version X of Awkward wouldn't work with version Y of their package (where Y is before the fix).

So yes, we should have good encapsulation between the organelles of our codebase, but it is more important to maintain good encapsulation between cells—different codebases/packages. The consequences of those mistakes are more dire. Internally refactoring our own codebase may be difficult, but we don't need to get anyone's permission to do it; we can do it at any time. Refactoring among codebases/packages requires coordination among people who might not even be in communication (yet), and we might have conflicting requirements that can't even be satisfied without someone giving up on their goals. The cell membranes are a different order of thing than the organelle boundaries because we don't know what other cells we're going to meet in our travels; we always know what organelles we have internally.

(I've rambled; sorry!)

About organizing within our codebase, breaking up level 3 into multiple levels, we already use some encapsulation practices. By convention (so far, and only 95% followed), a Content instance's method can only access a Content instance's attributes if

it is the same object: self._whatever
it is another object of the same class, after verifying that that is so: if isinstance(other, MyClass): other._whatever (this will make it hard to change Content attributes, but we don't plan to)
a recursive function X is passing down calls to more X: def _something(self, *args): self._content._something(*args)
the attribute is common to all Content subclasses: other._parameters, other._backend
similarly for Form subclasses and Type subclasses, within their inheritance trees

Also (about 95% followed), all data attributes are private (start with an underscore) with read-only public properties. Those public properties (no underscore) are part of our guaranteed API (levels 1 & 2).

I don't think we have any private properties, since we also prefer plain argument lists for our internal functions, without defaults, as we would have in the public API. A property is just a method without "()" in its call syntax; there's no point in syntactic sugar for internal functions: self._something() is as easy as self._something and it's easier for us to maintain if all internal calls have the same form.

With the above rules, a deep call into some object's content of content of content would look like this:

self._content.content.content   # only the first gets an underscore

and you see that pattern quite a bit.

One thing that we use a lot that violates the normal Python notion of public/private is that our ak.* implementations, which are not part of the Content hierarchy at all, call private Content methods:

def ak_whatever_impl(array, *args):
    layout = ak.operations.ak_to_layout._impl(array, ...)

    out = layout._start_recursion(*args)

    return ak._util.wrap(out, highlevel, behavior)

The _start_recursion method is private/underscored/hidden so that it's not part of our guaranteed API (levels 1 & 2) and thus we can change it at any time. It is part of our internal protocol: every Content subclass is guaranteed to have this method, and that is a contract with ourselves, not with the outside world. We can change our internal contract in a PR, including one submitted by an outside contributor, and not have to do a deprecation cycle to inform users and downstream developers about the change.

Python's standard convention has only one way of making a public-private distinction (the underscore), and we've already used it up in making that more-important distinction: communicating "do not use" to users and downstream developers. We'll need another marker if we're going to make a distinction between Content methods that ak.* functions can call and Content methods that only the Content can call on itself. Double-underscore is not what we want because it has a "private vs protected" meaning within Python.

Perhaps they could be free functions in a hidden submodule? We have no qualms about calling functions in hidden submodules such as _util, _broadcasting, etc. The usual pattern is one function that starts the recursion—setting everything up, not reentrant, followed by Content._whatever that can call any other Content._whatever until it reaches the leaves of the tree. The distinction between _start_whatever and _whatever is currently made only in the names; perhaps the _start_whatevers could be functions in a hidden submodule and the _whatevers could remain methods. That would introduce a useful distinction in addition to the "level 3 internal" vs "level 4 internal".

It's probably clear from the above that I disagree with solution 1, "Don't hide things from external developers." Whatever we do with our internal organelles, the cell membranes are more important and must be maintained. You also suggested listing the public API in documentation, but we have to assume that users and downstream developers—even in good faith—won't find the relevant documentation. However we signify the boundary between "level 2" and "level 3," it has to be based on rules that are familiar to the whole Python community, not a rule we make up and post somewhere. The only public/private rule the Python community has is the underscore rule.

As for solution 2, "Use a custom naming convention," yes, it would have to be something we impose on ourselves like this, not something we impose on the world, but we have more options than just names. The hidden submodule is a thing that we can do that has more structure than a name.

But even if we go with names, how about reversing the default, so that "level 4 private-to-class" are just underscores but "level 3 private-to-codebase" have special prefixes: underscore + some letter? Like "_a_" for Awkward Array?

class BitMaskedArray(Content):
    def _something(self, ...):
        ...

    def _a_something(self, ...):
        ...

where _something is "level 4 private-to-class" and can only be used by the class, while _a_something is "level 3 private-to-codebase" and can be used anywhere in Awkward Array, but not outside of Awkward Array. Outside users and developers will know to avoid it because it starts with an underscore.

Let's not go crazy introducing many levels of privateness, but I see the value of having a level 3/level 4 distinction. Incidentally, Scala has this concept of private-to-class, private-to-module/namespace, and one of those nested namespaces is your whole codebase, so we're not the first to need something like this. (Though, of course, Scala actually enforces that encapsulation at compile-time, but we live in Python. Maybe, if we're happy with whatever rule we come up with, we can enforce it as a flake8 extension. There's already a placeholder for that, enforcing the AK001 exception-raising rule.)

4 replies

agoose77 Dec 8, 2022
Maintainer Author

Long reply, thank you for your thoughtful response.

One of my concerns here is that doing nothing compromises our ability to encapsulate, but also that any solution we come up with might also do that.

With that in mind, you've provided a comprehensive solution in #1972. This is particularly useful in that you enumerate the near-final public API for our content classes (and caught some bugs along the way).

Introducing a formal convention for a restricted set of _ prefixed names moves the burden of remembering these onto us, the developers. The benefit of the Python convention is that seeing XXX._YYY is always a mistake unless XXX is self or YYY is a magic method. My reservation with making this our solution is that now we have no visual discriminator of what is correct and what is encapsulation-violating.

Memorising L3 rules also works better with L3 code that is a protocol, e.g. _getitem_next, than something that is unique to a particular class/set of classes, e.g. if we have isinstance(self._content, ListOffsetArray) followed by some code that expects a public member of ListOffsetArray. I can't think of any examples, but I wouldn't want to discount this case up-front.

My contention is that a non-prefixed naming convention should be fairly obvious. My original suggestion was e.g.

class Content:
    ....

    def private_getitem_range(self, where):
        ...

If we don't think the word "private" is sufficient, we could also use

class Content:
    ....

    def restricted_getitem_range(self, where):
        ...

or

class Content:
    ....

    def internal_getitem_range(self, where):
        ...

If these still don't feel strong enough, what about an inverse naming convention:

class Content:
    ....

    def _pub_getitem_range(self, where):
        ...

where "_pub" denotes internal methods / attributes that non-self can call?

It does mean that we would have funky-looking descriptors:

class Content:
    ....

    @property
    def _pub_backend(self) -> ak._backends.Backend:
        ...

But I also don't hate that.

I started an example in #1975

there's no point in syntactic sugar for internal functions

I don't find myself agreeing with this point - I think syntactic sugar is nearly always an improvement. Properties make it possible to replace a trivial attribute with a computed value on a per-object basis, for example. Again, we do less of that because we don't accept arbitrary objects in many places of the layout handling code (a new place would be the backend system, which is powered by protocols rather than direct types).

That said, I don't think we need many L3 public properties, and in the code that we're designing this convention for, we are unlikely to need many properties; most of our existing properties are L2 public, e.g. _starts doesn't exist for some layouts, but starts does. For e.g. ListOffsetArray it computes starts from _offsets, whilst for ListArray it just returns _starts. If these properties were L3, that would be a problem that we'd want to tackle, but fortunately these are L2.

agoose77 Dec 8, 2022
Maintainer Author

Even though I'm in favour of an in-class solution that makes it clear that an attribute is truly private, or internally private, I don't like the fact that it would warrant this kind of treatment for any class that we expose to users. I'm thinking particularly of Backend here. To stop users probing the nplike mechanism, we either need to make the nplike and index_nplike attributes L3, e.g.

class NumpyBackend:
    @property
    def _pub_nplike(self) -> Numpy:
        return Numpy.instance()
   ...

or we need to change the Content api so that .backend is an opaque object e.g. string. Then e.g. RecordArray would need to regularise its backend argument, which is not ideal - I'd prefer for the high level backend abstraction to remain at the high-level.

agoose77 Dec 8, 2022
Maintainer Author

After our awkward-uproot meeting, we discussed the problem-at-hand, and settled upon a solution. There is no perfect solution to this, so we're settling upon Python "private" meaning L3/L4. To distinguish between L3 and L4, we will use a set of rules outlined here which will be enforced by a linter.

agoose77 Jan 31, 2023
Maintainer Author

Further ideas were noted down here: #2108

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distinguishing public and public-internal APIs #1969

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Distinguishing public and public-internal APIs #1969

agoose77 Dec 7, 2022 Maintainer

The Problem

The Solution(s)?

Don't hide things from external developers

Use a custom naming convention

Replies: 1 comment · 4 replies

jpivarski Dec 7, 2022 Maintainer

agoose77 Dec 8, 2022 Maintainer Author

agoose77 Dec 8, 2022 Maintainer Author

agoose77 Dec 8, 2022 Maintainer Author

agoose77 Jan 31, 2023 Maintainer Author

agoose77
Dec 7, 2022
Maintainer

Replies: 1 comment 4 replies

jpivarski
Dec 7, 2022
Maintainer

agoose77 Dec 8, 2022
Maintainer Author

agoose77 Dec 8, 2022
Maintainer Author

agoose77 Dec 8, 2022
Maintainer Author

agoose77 Jan 31, 2023
Maintainer Author