-
-
Notifications
You must be signed in to change notification settings - Fork 514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: use papaya
instead of DashMap for workspace documents
#4624
Conversation
CodSpeed Performance ReportMerging #4624 will not alter performanceComparing Summary
|
papaya
instead of DashMap for workspace documentspapaya
instead of DashMap for workspace documents
It seems that one of the dependencies of https://github.com/biomejs/biome/actions/runs/11998204835/job/33444689330?pr=4624#step:10:253 |
That's unfortunate. I've opened an issue about it: ibraheemdev/papaya#32 I'm afraid for now I need to think of a different approach. Using |
Let's see if it works with a WASM shim... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refactor showed a shortcoming in how we parse files that I didn't notice before. There's no need to block the PR, but we would need to fix it somehow
/// Returns an error if no file exists in the workspace with this path. | ||
fn get_parse(&self, biome_path: &BiomePath) -> Result<AnyParse, WorkspaceError> { | ||
self.documents | ||
.pin() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to use pin
with every papaya hashmap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is the equivalent of calling .read()
or .write()
on an RwLock
. The pin is like a guard, but it’s called differently because it doesn’t lock the map. What it does do is that it gives you mutable access to the map and it prevents any references you take from getting cleaned up as long as you keep it pinned.
let parsed = self.parse(¶ms.path, ¶ms.content, index)?; | ||
|
||
if let Some(language) = parsed.language { | ||
index = self.set_source(language); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should go in this direction... we are now parsing any file regardless. This means that we parse even files that might be ignored. Before, get_parse
was called by functions like format
and pull_actions
, but now we call it every time we open a file.
But I suppose also the previous logic was doing the same, but now it's more evident that we are making a mistake 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree we need to carefully think this through. In fact, with multi-file analysis we need to decide what “ignoring” a file even means. Does it mean that we should not show any diagnostics about the file, or does it mean we cannot even extract any information from it? For example:
- A generated file may import non-generated files. If in turn a non-generated file imports the generated one, that may lead to a cycle. If we don’t analyze the generated file, we would fail to detect the cycle and cannot show the diagnostic on the non-generated file.
node_modules
will typically be ignored even though we’ll want to extract type information from it.
These use cases make me think we should probably parse even the ignored files.
Note this doesn’t imply we need to parse every file inside node_modules
. I would propose we start traversing from the included
files and expand traversal to those that get imported from them. For those we have good reason to parse them, even if we never need to get diagnostics from them.
Finally, of course there are other reasons why users may want to exclude files (I’m intentionally avoided the word “ignore” here). A file may be too big, or it might contain invalid syntax. That’s a very different reason, and indeed we would want to skip parsing altogether in such cases. I’m not sure yet what’s the best way to distinguish between these cases however…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, maybe one more point that we should consider. This code is inside the open_file()
function. If the file should truly be excluded, I think we should not even call open_file()
to begin with. So the intention should be, if we call open_file()
we want to at least get some information out of it, meaning that parsing wouldn't be unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. The CLI already does that, but not the LSP. So we should update the LSP code to check if the file is ignored
}) | ||
.ok_or_else(WorkspaceError::not_found)?; | ||
|
||
let parsed = self.parse(¶ms.path, ¶ms.content, index)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. If the file is ignored, we should not parse it.
3e350f9
to
2917a35
Compare
I removed the shim again, since the latest Git version of papaya works on WASM too. (I confirmed locally, but CI should remain green.) |
Summary
This PR is a proof-of-concept to show that we can use papaya instead of DashMap for workspace documents. It doesn't need immediate merging, although if all tests are green and the team is on board, I'd be happy to merge this as a stepping stone.
Why papaya?
While papaya offers great read performance, the main reason why I want to make this change is because papaya is lock-free. Because of this, it cannot create dead locks. If we want to implement our own caching and file watching in the workspace, I would not want to run the risk of creating dead locks. They can be hard to catch, hard to debug, and generally drain contributor's energy. Let's try to avoid that :)
Tradeoffs
Because papaya is lock-free, it cannot offer a
get_mut()
method, meaning we cannot update values within a map in-place. This leads to the following trade-offs:Cell<NodeCache>
, but for this PR I've taken an even simpler route: Just don't persist theNodeCache
anymore. If we implement long-term persistence and caching, node caches can theoretically grow without bounds over time, effectively creating a memory leak. Maybe in practice this isn't much of an issue, so we need to decide how much we care either way. Let me know which way you are leaning :)Future Work
Test Plan
CI should remain green.