Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hook for manifest loading #44

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

arcanis
Copy link
Contributor

@arcanis arcanis commented Oct 13, 2021

During the September meeting I mentioned how we would benefit from having a way to tell Node how to load package.json files, as in our case they don't necessarily exist on the disk. I think it was reasonably well received, so I open this PR to try to see what would be the next step (cc @bmeck who raised some points around security).

@GeoffreyBooth GeoffreyBooth changed the title Adds a loader for manifest loading Add hook for manifest loading Oct 20, 2021
@GeoffreyBooth
Copy link
Member

I think in general the issue was that we needed to keep the total number of hooks minimal in order to make chaining work, hence the big PR to collapse what we had before to resolve/load/globalPreload. I think rather than creating more hooks to override things that happen within resolve or load, we can instead create lots of helper functions so that you can write your own resolve and pull in helpers to reuse Node code for all the logic other than the part you want to override.

@arcanis
Copy link
Contributor Author

arcanis commented Oct 20, 2021

I think I'd need to see what this helper API would look like - I'm worried if every loader has to reimplement the whole resolve they'll quickly start conflicting (or the other way around, not integrating with each other), but perhaps with an example it'd be clearer.

@cspotcode
Copy link
Contributor

It sounds like we're describing sub-hooks of resolve. And yeah, if node doesn't implement the sub-hooks, then they'll need to be implemented in user-space. So we'll need some sort of a community standard for sub-hooks, and a standard sub-hooking library that's responsible for composing multiple sub-hooks into a single resolve hook for node. I imagine that will get messy.

@GeoffreyBooth
Copy link
Member

I’m not saying the decision has been made; it’s just that that’s my assumption of the direction we’re going based on the last big PR and on #26. I think there are arguments for the “lots of little hooks” approach too, but we would need a way to make it work with chaining. What would it mean to chain a theoretical resolvePackageMetadata hook when there’s also a chained resolve hook? Et cetera.

@arcanis
Copy link
Contributor Author

arcanis commented Nov 10, 2021

Updated this PR to be mentioned in the chaining proposals, as per #48 (comment). Should be ready for review.

@arcanis arcanis marked this pull request as ready for review November 10, 2021 09:49
@arcanis arcanis added the loaders-agenda Issues and PRs to discuss during the meetings of the Loaders team label Nov 12, 2021
Copy link
Member

@JakobJingleheimer JakobJingleheimer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a bit of preamble is needed describing the somewhat unique challenges this attempts to solve. Without, these seem like extra hooks.

doc/design/proposal-chaining-iterative.md Outdated Show resolved Hide resolved
doc/design/proposal-chaining-middleware.md Outdated Show resolved Hide resolved
doc/design/proposal-chaining-iterative.md Outdated Show resolved Hide resolved
doc/design/overview.md Outdated Show resolved Hide resolved
Co-authored-by: Antoine du Hamel <[email protected]>
Copy link
Member

@JakobJingleheimer JakobJingleheimer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you!

@GeoffreyBooth should there be an example of each of these and how they might work together in a chain? (resolve and load have them)

@arcanis
Copy link
Contributor Author

arcanis commented Nov 18, 2021

should there be an example of each of these and how they might work together in a chain? (resolve and load have them)

Speaking of that, I wonder if there's perhaps some redundancy in the way the chaining doc is written. Given that hooks all share the same "pattern" for a given proposal, whether it's for middleware or iterative approach, shouldn't the design overview list what are the general input/output of each hook (without taking the chaining into account), and the chaining docs describe how hooks are composed on a generic level?

In its current state, it feels like there aren't many significant differences between the resolve and load section of the chaining proposals, except for their input/output (which are roughly the same in both proposal).

@JakobJingleheimer
Copy link
Member

I think the reason we did that for resolve and load is because their innards are substantially different due to the middleware vs iterative design. If that would be more of the same for the fs hooks, then maybe not. I think one of the original reasons to have examples (back when there was just the 1 proposal from which the current middleware-style one is derived) was to show how one loader in the chain interacts with the next (no pun intended).

For instance, what if package foo was remote and its remote is a zip and subsequently that points elsewhere also remote—I don't know if that's actually possible, but if it is, it might make a good example to demonstrate the use of the hook and the need for it to be a hook rather than just a utility.

@GeoffreyBooth
Copy link
Member

The chaining doc was written to just have two full examples, of a chain of resolve hooks and a chain of load hooks, so that there was enough detail that everyone could understand it. Arguably we should have yet another large example, of chaining loaders where it’s not just a loader with one type of hook but a loader with multiple hooks (like a TypeScript loader that might define both a resolve hook, to say how to resolve imports of with TypeScript-specific things like path mappings, and a load hook that transpiles those files’ sources into JavaScript).

One thing we do have to spec out is how hooks connect to each other. I was recently working on import assertions, which currently happen inside the load hook. This means you can’t define a custom assertion validation behavior without overriding/defining an entire custom load hook. We can’t really chain things that happen inside hooks; like if a load hook calls fs.readFile, it can’t call the a chain of readFile hooks because then you have code from all registered loaders executing while you’re still in the load hook of the first loader. However we can add additional hooks elsewhere in the pipeline; like resolve returns a url that’s the input of load, which could return source and assertions that are the input of a new hook validate, and then this new validate does the assertion validation and returns source and format. Because validate is fully after load in the pipeline, it’s a new hook that we could add without disrupting the ability of load to be chained. Or put another way, resolve runs all registered resolve hooks, passing the output of the last one into load which runs all registered load hooks, which passes its final output into validate which runs all registered validate hooks. Because they happen in sequence, each of the hooks in this pipeline is chainable.

This is one way to add a new hook without breaking the ability for hooks to be chainable. Being completely separate from this pipeline, the way globalPreload is, is another way. That’s the challenge for new hooks, is defining how they can be chainable without breaking the ability of the current resolving/loading pipeline to be chainable.

@arcanis
Copy link
Contributor Author

arcanis commented Nov 18, 2021

if a load hook calls fs.readFile, it can’t call the a chain of readFile hooks because then you have code from all registered loaders executing while you’re still in the load hook of the first loader

Can you detail why that would be a bad thing? In my mind, hooks like readFile would be called by the Node helpers. Given that you've suggested a few times to leave it up to the loaders' implementations to call those helpers, my understanding is that hooks can necessarily call each other (indirectly, through an abstracted interface). What would be the problem with that?

@GeoffreyBooth
Copy link
Member

Can you detail why that would be a bad thing?

Maybe it wouldn’t be, but it would certainly complicate users’ ability to order their loaders. Say you have loaders A, B, and C, where A is first. When A’s load hook starts running and it calls readFile, the registered readFile hooks for all of A, B, and C all run. Then when B’s load hook starts running and it calls readFile, the registered readFile hooks for all of A, B, and C all run all over again.

@arcanis
Copy link
Contributor Author

arcanis commented Dec 3, 2021

I opened an implementation draft on the Node repository: nodejs/node#41076

@GeoffreyBooth GeoffreyBooth removed the loaders-agenda Issues and PRs to discuss during the meetings of the Loaders team label Dec 7, 2021
@GeoffreyBooth
Copy link
Member

@arcanis I finally reviewed nodejs/node#41076. Sorry for delay, and I look forward to discussing it tomorrow. A few general notes:

  • The index-sha512.mjs example in the PR feels like something that should be achievable with the current hooks. resolve could find the sha512 suffix, and load could generate custom source in response. Is there something lacking in the current loaders API that prevents this example from being achieved?

  • Along those lines, I think we need an example/use case that isn’t achievable with the current loaders API. Especially a common/core use case like “instrumentation” or “mocking” that is a general need of the community.

  • I think are some current prominent projects that monkey-patch CommonJS fs; do you mind listing them and what they do, and why the monkey-patching is necessary? Like is Yarn Plug-and-Play one of these, for example? An instrumentation package? Having this written down somewhere, ideally as a file in this repo, would be a great resource for considering use cases that we need to support.

As I wrote in nodejs/node#41076 (comment), I think the next step would be a PR against this repo with some more Markdown files: some background about monkey-patching fs (if you don’t mind), and a design doc for filesystem hooks that can be its own file or part of https://github.com/nodejs/loaders/blob/main/doc/design/overview.md. nodejs/node#41076 proves that an implementation is possible, not that I think anyone would have doubted its achievability; so now we need to work out exactly what the API should be and how it fits together.

@arcanis
Copy link
Contributor Author

arcanis commented Jan 4, 2022

The index-sha512.mjs example in the PR feels like something that should be achievable with the current hooks. resolve could find the sha512 suffix, and load could generate custom source in response. Is there something lacking in the current loaders API that prevents this example from being achieved?

Can you write one such loader, that we have a baseline for comparison? As far as I know, the resolve return value must be an existing file, which isn't the case here. As a result the stats calls will crash, making this solution non-viable.

I think are some current prominent projects that monkey-patch CommonJS fs; do you mind listing them and what they do, and why the monkey-patching is necessary? Like is Yarn Plug-and-Play one of these, for example? An instrumentation package? Having this written down somewhere, ideally as a file in this repo, would be a great resource for considering use cases that we need to support.

Isn't it documented in this very PR? There aren't that many other examples due to the lack of simple primitives (Electron & PnP are the main ones I have in mind, because our projects are amongst the rare to have the bandwidth to maintain our own virtual fs implementations), but those capabilities are foundational in both cases.

@aduh95
Copy link
Contributor

aduh95 commented Jan 4, 2022

This should work:

export function resolve(specifier, context, next) {
  if(specifier.endsWith('?sha512') || specifier.endsWith('-sha512.mjs')) {
    const hash = calculateHash(specifier);
    return { url:`data:text/javascript,export%20default${encodeURI(JSON.stringify(hash)}`, format:'module' };
  }
  return next(specifier, context);
}

Maybe a loader that supports resolving inside a .tar archive would be a better example?

@arcanis
Copy link
Contributor Author

arcanis commented Jan 10, 2022

@aduh95 as far as I can tell this loader isn't sufficient; since the result is a data-url, Node won't treat it the same as regular files. For instance, if you have an exports field pointing to it, Node will crash:

import hash from 'pkg/hash';
{
  "name": "pkg",
  "exports": {
    "./hash": "./index-sha512.mjs"
  }
}
Error [ERR_MODULE_NOT_FOUND]: Cannot find module '/tmp/index-sha512.mjs' imported from /tmp/index.mjs

Basically, the use case is that from Node's perspective, nothing should separate the virtual files from true files, they should have the exact same semantic, and go through the exact same code path. Failing that, they'll be guaranteed to have diverging resolution behaviors and edge cases.

@aduh95
Copy link
Contributor

aduh95 commented Jan 10, 2022

I would like for this to work:

export function resolve(specifier, context, defaultResolve) {
  const nextResult = defaultResolve(specifier, context);
  if (nextResult.url.endsWith('?sha512') || nextResult.url.endsWith('-sha512.mjs')) {
    const hash = calculateHash(nextResult.url);
    return {
      url: `data:text/javascript,export%20default${encodeURI(JSON.stringify(hash))}`,
      format: 'module',
    };
  }

  return nextResult;
}

But unfortunately defaultResolve throws if the file doesn't exist. It feels weird to me that this would fail at resolve rather than load; since there is no extension searching in ESM loader, checking if the file exists at this step seems like unnecessary work 🤔 Anyway, that's really not what this thread is about.

since the result is a data-url, Node won't treat it the same as regular files.

Not sure how you mean, it's still a module with a default export that contains the information you seek (Node.js should treat data-url modules same as regular files imo).

Basically, the use case is that from Node's perspective, nothing should separate the virtual files from true files, they should have the exact same semantic, and go through the exact same code path. Failing that, they'll be guaranteed to have diverging resolution behaviors and edge cases.

I believe that, I'm still not convinced a hashing loader is the best example for the use case though.

@GeoffreyBooth
Copy link
Member

@aduh95 as far as I can tell this loader isn’t sufficient; since the result is a data-url, Node won’t treat it the same as regular files. For instance, if you have an exports field pointing to it, Node will crash:

The https loader example is one case where a resolved URL isn’t a real file, and the load hook supplies its contents: https://github.com/nodejs/loaders/blob/main/doc/design/proposal-chaining-middleware.md#https-loader. If that works for regular imports but not for "exports", then the issue is just that we aren’t sending the "exports" paths through the full loaders code path that import specifiers get (though maybe we shouldn’t?). But it should already work that resolve doesn’t need to return a valid file URL, as long as an accompanying load hook handles that specifier and supplies some source for it.

@arcanis
Copy link
Contributor Author

arcanis commented Jan 11, 2022

it's still a module with a default export that contains the information you seek (Node.js should treat data-url modules same as regular files imo).

But it's not: turning a path into a data url is a lossy process, since you go from location + data to just data. Here's another easy way to break it: imagine we return a virtual file that contains import statements1. Where should they be resolved from? Where should Node look for a package.json, to check if there's any exports field covering these imports2? If we deal with a virtual file path, it's easy, there's nothing special: we use our virtual file path as importer. But data urls don't have physical location on disk.

Perhaps it could be workaround by encoding special types of URL (so that instead of returning a data url with the file content, we instead return a data url with a special payload containing the missing information), but it'd be very fragile and prone to break.

Footnotes

  1. Which is what happens in Yarn with our zip layer: we return the packages' source files, so they contain everything a package contains: package.json files, directories, relative imports, bare identifier imports, etc.

  2. Remember that in Yarn's case, this package.json would itself be virtual, inside a virtual directory. Hence why my PR implements hooks for the stat and readJson calls: to allow Node to traverse this virtual hierarchy.

@JakobJingleheimer
Copy link
Member

As far as I know, the resolve return value must be an existing file, which isn't the case here.

@arcanis that is not correct 🙂 See test-esm-loader.mjs → virtual file (which uses this loader).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants