docs: extend troubleshooting for very large repositories #329

bahrmichael · 2024-05-16T08:57:13Z

For https://github.com/sourcegraph/sourcegraph/issues/62295

This PR updates the documentation with more tips for very large repositories.

There are difficulties with Code Insights where it may run for a while, and then tell the user that there were incomplete data points. This probably came from very large repositories not being able to compute reasonably fast.

In addition to this documentation update I'm working on giving users more information about which repositories lead to incomplete datapoints: https://github.com/sourcegraph/sourcegraph/issues/62578

@sourcegraph/search-platform I poked a bit at the search backend when gathering this info, and would like to get your input if it's accurate, and if there may be other improvements to make complex queries run faster on very large repos :)

@mike-r-mclaughlin Could you review if this new info would be helpful for customers? I'm planning to expose the repositories that caused incomplete datapoints with https://github.com/sourcegraph/sourcegraph/issues/62578. Then a customer can see which repository didn't compute, pick that one, optimize the query, and then run the big Code Insight again.

vercel · 2024-05-16T08:57:16Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
sourcegraph-docs-v2	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 17, 2024 8:35am

jtibshirani

Looks good to me, just left some minor suggestions.

jtibshirani · 2024-05-16T15:45:39Z

.gitignore

@@ -39,3 +39,6 @@ next-env.d.ts

 # search index file generated on build
 /public/search.json
+
+# IDEs
+.idea


I see you also have great taste in IDEs 😊

jtibshirani · 2024-05-16T15:49:48Z

docs/code_insights/references/incomplete_data_points.mdx

+
+You can use Code Search to test the query against a particular timestamp in a given repository.
+
+Since Code Insights computes data points for twelve datapoints in the give time range,


Unindexed search usually can take longer the further back you go in history. For older commits, more files are different from HEAD, so searcher needs to perform more brute-force file searches.

Could we suggest a specific time to target, like a worst-case? I'm not sure how far back Code Insights goes by default. We could even suggest rev:at.time(...) as a convenience.

Thank you! I was hoping for some info like that, since I had a suspicion about older commits taking longer to search. I'll update some text above to work your suggestion in. Let me know if that sounds good :)

jtibshirani · 2024-05-16T15:51:59Z

docs/code_insights/references/incomplete_data_points.mdx

+
+For example, if you want to track the version of a NPM dependency in your code base, searching for `my_library file:package.json` will compute much faster because there are less files to look at and fewer results to return.
+
+We recommend to make your query as precise as possible (and even omit results that may be relevant) until you reach a query that is able to compute fast enough.


Tiny suggestions:

Great that you mentioned mention file filters, maybe we could mention lang too

We could also mention the importance of quotes "..." if your search string contains whitespace

Maybe we shouldn't say "and even omit results that may be relevant" since we do really want these queries to be relevant :)

Great ideas! I wasn't so sure about the "omit" part. Dropped that now. I've added some more tips, but added a disclaimer to the lang filter. From what I've seen in Language Stats Insights, this filter needs to load the file content and read it, and can therefore be a bit slower.

The search language filter is implemented differently, and tends to run very quickly. (If you're interested in the technical details, the search lang filter first consults the file name, and avoids loading and analyzing content in the vast majority of cases.)

docs/code_insights/references/incomplete_data_points.mdx

docs: extend troubleshooting for very large repositories

a104ec6

bahrmichael requested review from a team and mike-r-mclaughlin May 16, 2024 08:57

jtibshirani reviewed May 16, 2024

View reviewed changes

peterguy reviewed May 16, 2024

View reviewed changes

docs/code_insights/references/incomplete_data_points.mdx Show resolved Hide resolved

fix: work in review comments

9738010

vercel bot deployed to Preview May 17, 2024 08:35 View deployment

jtibshirani approved these changes May 17, 2024

View reviewed changes

bahrmichael merged commit 1c1b9a8 into main May 21, 2024
5 checks passed

bahrmichael deleted the bahrmichael/2024-05-16-large-repos-2 branch May 21, 2024 09:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: extend troubleshooting for very large repositories #329

docs: extend troubleshooting for very large repositories #329

bahrmichael commented May 16, 2024 •

edited

Loading

vercel bot commented May 16, 2024 •

edited

Loading

jtibshirani left a comment

jtibshirani May 16, 2024

jtibshirani May 16, 2024

bahrmichael May 17, 2024

jtibshirani May 16, 2024

bahrmichael May 17, 2024

jtibshirani May 17, 2024


		You can use Code Search to test the query against a particular timestamp in a given repository.

		Since Code Insights computes data points for twelve datapoints in the give time range,


		For example, if you want to track the version of a NPM dependency in your code base, searching for `my_library file:package.json` will compute much faster because there are less files to look at and fewer results to return.

		We recommend to make your query as precise as possible (and even omit results that may be relevant) until you reach a query that is able to compute fast enough.

docs: extend troubleshooting for very large repositories #329

docs: extend troubleshooting for very large repositories #329

Conversation

bahrmichael commented May 16, 2024 • edited Loading

vercel bot commented May 16, 2024 • edited Loading

jtibshirani left a comment

Choose a reason for hiding this comment

jtibshirani May 16, 2024

Choose a reason for hiding this comment

jtibshirani May 16, 2024

Choose a reason for hiding this comment

bahrmichael May 17, 2024

Choose a reason for hiding this comment

jtibshirani May 16, 2024

Choose a reason for hiding this comment

bahrmichael May 17, 2024

Choose a reason for hiding this comment

jtibshirani May 17, 2024

Choose a reason for hiding this comment

bahrmichael commented May 16, 2024 •

edited

Loading

vercel bot commented May 16, 2024 •

edited

Loading