Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

languages: copy over latest version from sourcegraph #828

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

keegancsmith
Copy link
Member

I realised we haven't been updating this package as we updated the package in the sourcegraph repo.

We don't need all the functionality it has, but its easier to just copy paste everything.

Test Plan: go test

I realised we haven't been updating this package as we updated the
package in the sourcegraph repo.

We don't need all the functionality it has, but its easier to just copy
paste everything.

Test Plan: go test
@keegancsmith keegancsmith requested review from mmanela, varungandhi-src and a team September 17, 2024 10:20
@cla-bot cla-bot bot added the cla-signed label Sep 17, 2024
Copy link
Contributor

Fuzz test failed on commit 78a7ea4. To troubleshoot locally, use the GitHub CLI to download the seed corpus with

gh run download 10901106986 -n testdata

@keegancsmith keegancsmith marked this pull request as draft September 17, 2024 13:47
@keegancsmith
Copy link
Member Author

Ok there are issues here. The issue is boils down to how we changed normalization between the two versions of the package.

I actually have bigger concerns around this now. We just lookup by query.Language.Language and do no normalization on it. We expect normalization to be done before. This then relies on the normalization being the same at index time as the time we construct the query. I think this is quite easy to mess up, so I need to read more code to see how this can be made more robust.

But essentially I feel like when we read in index data from disk we should normalize by the rules of the running process. Then we do the same normilization when executing a query. There is one other issue where we directly look into the go-enry maps to get out extensions. That code also seems very fragile.

cc @jtibshirani who I think looked at all this code recently.

Copy link
Contributor

@varungandhi-src varungandhi-src left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move the lib/codeintel/languages package to a public repo and add a dependency on that? Otherwise, having to maintain this copy separately is going to get tedious quickly, and reviewing changes is also not that simple.

Wrt the normalization, I don't fully understand the requirements -- do we need to maintain some kind of compatibility between successive versions of the package?

@keegancsmith
Copy link
Member Author

Let's move the lib/codeintel/languages package to a public repo and add a dependency on that? Otherwise, having to maintain this copy separately is going to get tedious quickly, and reviewing changes is also not that simple.

Agreed, this makes sense to be shared. However, there is a tricky requirement around testing since the implementation of this package is tied directly to the version of go-enry. Maybe what we can do is expose a package which contains the tests, then as part of CI for sourcegraph and zoekt we run those tests to ensure we are compatible with the resolved version of go-enry?

Wrt the normalization, I don't fully understand the requirements -- do we need to maintain some kind of compatibility between successive versions of the package?

We do case-sensitive string comparisons on the language value. I noticed that while doing this upgrade the casing we chose for some stuff changed. I didn't exactly track it down, but I believe it has something to do with things migrating between being inside of go-enry vs our custom support (I believe the code made different decisions around deciding to ToLower a string). I need to look into it more to give you a definite answer, but I saw enough to mark this PR as draft :)

@jtibshirani
Copy link
Member

Let's move the lib/codeintel/languages package to a public repo and add a dependency on that?

+1 let's do that! Could we commit to languages being our only interface for language info (including the IsVendor, etc. methods we added)? Then we could completely remove a direct dependency on go-enry.

@mmanela mmanela removed their request for review January 7, 2025 14:04
@jtibshirani
Copy link
Member

@varungandhi-src @keegancsmith I'd like to revisit this PR. What's your thinking on moving the lib/codeintel/languages package to a public repo so we can add a dependency on that?

@varungandhi-src
Copy link
Contributor

varungandhi-src commented Jan 25, 2025

@jtibshirani Yes, let's do it.

I'm OOO for Lunar New Year next week.

If you want to do this in the fastest way possible, feel free to copy the code to a separate public repo. (I can work on merging it with other code later.)

If you want to keep it with other code, I recommend creating a bindings/go/langur directory in the langur repo (see brief dev docs, fixing the CI (I think that needs an extra flag to Bazel CLI or temporarily pin the Bazel version to something older), copying the code there, and updating the callers.

If you don't have time the coming week, I can follow up on it after coming back.

You should be able to get repo maintainer/admin level access through Entitle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants