Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: use papaya instead of DashMap for workspace documents #4624

Merged
merged 1 commit into from
Nov 29, 2024

Conversation

arendjr
Copy link
Contributor

@arendjr arendjr commented Nov 24, 2024

Summary

This PR is a proof-of-concept to show that we can use papaya instead of DashMap for workspace documents. It doesn't need immediate merging, although if all tests are green and the team is on board, I'd be happy to merge this as a stepping stone.

Why papaya?

While papaya offers great read performance, the main reason why I want to make this change is because papaya is lock-free. Because of this, it cannot create dead locks. If we want to implement our own caching and file watching in the workspace, I would not want to run the risk of creating dead locks. They can be hard to catch, hard to debug, and generally drain contributor's energy. Let's try to avoid that :)

Tradeoffs

Because papaya is lock-free, it cannot offer a get_mut() method, meaning we cannot update values within a map in-place. This leads to the following trade-offs:

  • It is no longer easy to reuse a node cache that's already in the map. If we really care about reusing node caches, we can work around this using Cell<NodeCache>, but for this PR I've taken an even simpler route: Just don't persist the NodeCache anymore. If we implement long-term persistence and caching, node caches can theoretically grow without bounds over time, effectively creating a memory leak. Maybe in practice this isn't much of an issue, so we need to decide how much we care either way. Let me know which way you are leaning :)
  • Persisting parsed syntax separately from the documents means we need coordination to protect against the situation where separate threads try to access the syntax simultaneously. Previously this resulted in lock contention, which is already not ideal for performance, but in a lock-free environment we need to resolve this situation on our own. I think the simplest solution is simply to store the syntax with the documents and parse optimistically: After all, there are few situations in which we put documents in the workspace if we don't intend to parse them later anyway.

Future Work

  • I suspect we can also get rid of our DenseSlotMap in favor of papaya, in order to save on dependencies.
  • If we decide that the node cache doesn't need persistence, we can simplify our parser APIs a bit further.
  • Continue with file caching and watching.

Test Plan

CI should remain green.

@arendjr arendjr requested a review from a team November 24, 2024 10:06
@github-actions github-actions bot added the A-Project Area: project label Nov 24, 2024
Copy link

codspeed-hq bot commented Nov 24, 2024

CodSpeed Performance Report

Merging #4624 will not alter performance

Comparing arendjr:papaya (1c0b8ae) with next (c1b2e7b)

Summary

✅ 97 untouched benchmarks

@arendjr arendjr changed the title Use papaya instead of DashMap for workspace documents refactor: use papaya instead of DashMap for workspace documents Nov 24, 2024
@ematipico
Copy link
Member

It seems that one of the dependencies of papaya doesn't support WASM :(

https://github.com/biomejs/biome/actions/runs/11998204835/job/33444689330?pr=4624#step:10:253

@arendjr
Copy link
Contributor Author

arendjr commented Nov 24, 2024

That's unfortunate. I've opened an issue about it: ibraheemdev/papaya#32

I'm afraid for now I need to think of a different approach. Using papaya was never a requirement for the rest of the plan, although it would've been nice if we could guarantee the absence of deadlocks. Oh well...

@arendjr arendjr closed this Nov 24, 2024
@arendjr
Copy link
Contributor Author

arendjr commented Nov 26, 2024

Let's see if it works with a WASM shim...

Copy link
Member

@ematipico ematipico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The refactor showed a shortcoming in how we parse files that I didn't notice before. There's no need to block the PR, but we would need to fix it somehow

/// Returns an error if no file exists in the workspace with this path.
fn get_parse(&self, biome_path: &BiomePath) -> Result<AnyParse, WorkspaceError> {
self.documents
.pin()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to use pin with every papaya hashmap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is the equivalent of calling .read() or .write() on an RwLock. The pin is like a guard, but it’s called differently because it doesn’t lock the map. What it does do is that it gives you mutable access to the map and it prevents any references you take from getting cleaned up as long as you keep it pinned.

Comment on lines +444 to +448
let parsed = self.parse(&params.path, &params.content, index)?;

if let Some(language) = parsed.language {
index = self.set_source(language);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we should go in this direction... we are now parsing any file regardless. This means that we parse even files that might be ignored. Before, get_parse was called by functions like format and pull_actions, but now we call it every time we open a file.

But I suppose also the previous logic was doing the same, but now it's more evident that we are making a mistake 🤔

Copy link
Contributor Author

@arendjr arendjr Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree we need to carefully think this through. In fact, with multi-file analysis we need to decide what “ignoring” a file even means. Does it mean that we should not show any diagnostics about the file, or does it mean we cannot even extract any information from it? For example:

  • A generated file may import non-generated files. If in turn a non-generated file imports the generated one, that may lead to a cycle. If we don’t analyze the generated file, we would fail to detect the cycle and cannot show the diagnostic on the non-generated file.
  • node_modules will typically be ignored even though we’ll want to extract type information from it.

These use cases make me think we should probably parse even the ignored files.

Note this doesn’t imply we need to parse every file inside node_modules. I would propose we start traversing from the included files and expand traversal to those that get imported from them. For those we have good reason to parse them, even if we never need to get diagnostics from them.

Finally, of course there are other reasons why users may want to exclude files (I’m intentionally avoided the word “ignore” here). A file may be too big, or it might contain invalid syntax. That’s a very different reason, and indeed we would want to skip parsing altogether in such cases. I’m not sure yet what’s the best way to distinguish between these cases however…

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, maybe one more point that we should consider. This code is inside the open_file() function. If the file should truly be excluded, I think we should not even call open_file() to begin with. So the intention should be, if we call open_file() we want to at least get some information out of it, meaning that parsing wouldn't be unnecessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. The CLI already does that, but not the LSP. So we should update the LSP code to check if the file is ignored

})
.ok_or_else(WorkspaceError::not_found)?;

let parsed = self.parse(&params.path, &params.content, index)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. If the file is ignored, we should not parse it.

@arendjr
Copy link
Contributor Author

arendjr commented Nov 28, 2024

I removed the shim again, since the latest Git version of papaya works on WASM too. (I confirmed locally, but CI should remain green.)

@arendjr arendjr merged commit ad84856 into biomejs:next Nov 29, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Project Area: project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants