Can the indexed data be imported in an OpenGrok that another OpenGrok indexed? #3965

jetm · 2021-11-01T21:35:24Z

jetm
Nov 1, 2021

It is not a bug report. It's more like a question.

There are huge repositories (double-digit GB of code) where OpenGrok takes hours to finish indexing because reindexing always fails. Previously to Opengrok indexes the new code fetched by Git, the indexed data is removed, and the indexing must be done from scratch. So, all these hours, OpenGrok cannot be used.

Is it possible to index the new code in an OpenGrok running on a different machine, copy the newly indexed data, and reuse it in the production OpenGrok? Aside from source code, indexed data, what else needs to be copied? File configuration?

ahornace · 2021-11-01T22:08:43Z

ahornace
Nov 1, 2021

I don't see a problem as long as you use the same major version and configuration (history generation etc.). You should also be able to add the new project without restart (using https://opengrok.docs.apiary.io/#reference/0/projects/add-project and https://opengrok.docs.apiary.io/#reference/0/project-metadata-management/marks-project-as-indexed).

However, what do you mean by because reindexing always fails. ? Is there a problem with OpenGrok that should rather be fixed than doing a workaround like this?

0 replies

jetm · 2021-11-02T21:07:21Z

jetm
Nov 2, 2021
Author

@ahornace, thank you for your quick response.

With because reindexing always fails I meant OpenGrok reindexing for our huge repositories is unreliable for the following issues:

Sometimes, the reindexing process is very slow, slower than scratch indexing. The logs are not helpful because it shows the same logs as the reindexing before but slower. And one repo indexing could take one day to finish when the last reindexing took one hour.
Sometimes and in silently matters, it missed to reindex new code. The reindex finished without errors, but later somebody reports its most recent change is not showing up in OpenGrok.
OpenGrok reindexing is slow in our huge repositories, even when Git pulled one file change. If only one file was changed, I expect to finish quicker, but it takes double-digits minutes to complete. Compared to indexing from scratch, there is not much difference.

Those problems continue with the most recent OpenGrok releases.

0 replies

vladak · 2021-11-03T08:42:29Z

vladak
Nov 3, 2021
Maintainer

@ahornace, thank you for your quick response.

With because reindexing always fails I meant OpenGrok reindexing for our huge repositories is unreliable for the following issues:
* Sometimes, the reindexing process is very slow, slower than scratch indexing. The logs are not helpful because it shows the same logs as the reindexing before but slower. And one repo indexing could take one day to finish when the last reindexing took one hour.

time to bump the log level up to see what is going on. Maybe use --progress as well.

* Sometimes and in silently matters, it missed to reindex new code. The reindex finished without errors, but later somebody reports its most recent change is not showing up in OpenGrok.

these need to be debugged case by case. What exactly is missing ? xref ? history ? something else ?

* OpenGrok reindexing is slow in our huge repositories, even when Git pulled one file change. If only one file was changed, I expect to finish quicker, but it takes double-digits minutes to complete. Compared to indexing from scratch, there is not much difference.

This is caused by the directory traversal that happens for every reindex. The fix is tracked by #3077.

0 replies

jetm · 2021-11-03T22:51:18Z

jetm
Nov 3, 2021
Author

It's using --progress, but it's not helpful either I can see the same result as the previous one, but with slower progress.

Usually, it's missing the indexed data. Git fetches the new code change, OpenGrok reindexes the new repo, indexer said it finished successfully. OpenGrok web page shows the date when finished indexing, but that change is not showing in OpenGrok when you search or open the file directly. It's difficult to debug because there are no errors related to the missing indexed data in the logs.

Yes, I am aware of #3077. Thank you for sharing. Because of that issue, I assumed reindexing was or might still be broken. And problems, as I have experienced, would be expected in big repositories. #3077 made me change from reindexing to always from scratch.

I have another question. Because I tried different combinations without success, and the documentation is not clear. What is the workflow to add one project at once? Is it supported? I mean, Git pulls the change for the foo repository and tells OpenGrok to index only the foo repository.

0 replies

vladak · 2021-11-04T19:27:33Z

vladak
Nov 4, 2021
Maintainer

It's using --progress, but it's not helpful either I can see the same result as the previous one, but with slower progress.

Grabbing the stack traces of the indexer process with jstack at the moment there is no progress reported in the logs might shed more light.

Usually, it's missing the indexed data. Git fetches the new code change, OpenGrok reindexes the new repo, indexer said it finished successfully. OpenGrok web page shows the date when finished indexing, but that change is not showing in OpenGrok when you search or open the file directly. It's difficult to debug because there are no errors related to the missing indexed data in the logs.

It would be nice to get to the bottom of this because this is the first time I hear about such problem. I mean functional problem, not performance. For each file reported in the logs with DefaultIndexChangedListener.fileAdd (reported with FINE log level), there should be corresponding xref/index refresh. Were the files for which the problem happened reported in the logs ?

The indexer traverses the whole directory tree of given project (in IndexDatabase#indexDown()) and for each file present it checks its last modified time stamp against the time stamp of the document corresponding to the file in the index. If the time stamp of the file on the file system is greater, the document is refreshed. So, either there was a problem with identifying that the file has changed or something has failed during the document refresh. It also would not hurt to check that the file was indeed updated on the file system, esp. the last modified time of the file.

Yes, I am aware of #3077. Thank you for sharing. Because of that issue, I assumed reindexing was or might still be broken. And problems, as I have experienced, would be expected in big repositories. #3077 made me change from reindexing to always from scratch.

#3077 is merely performance enhancement. What is the structure of the repositories in yours big project in terms of repository types ?

I have another question. Because I tried different combinations without success, and the documentation is not clear. What is the workflow to add one project at once? Is it supported? I mean, Git pulls the change for the foo repository and tells OpenGrok to index only the foo repository.

The indexing granularity is per project, i.e. it is not possible to index just one repository of a project.
There has been some discussion related to per project workflow in #3728 recently.

0 replies

jetm · 2021-11-05T00:46:06Z

jetm
Nov 5, 2021
Author

It would be nice to get to the bottom of this because this is the first time I hear about such problem. I mean functional problem, not performance. For each file reported in the logs with DefaultIndexChangedListener.fileAdd (reported with FINE log level), there should be corresponding xref/index refresh. Were the files for which the problem happened reported in the logs ?

Sadly, I don't have the logs to show because I changed everything to build from scratch. I am making a new OpenGrok setup; I could change it to reindexing, wait until it happens to look in the logs and report it back. It could take a while. Sorry.

#3077 is merely performance enhancement. What is the structure of the repositories in yours big project in terms of repository types ?

Let me try to reply with most of the information that I am legally allowed.

OpenGrok runs in a dedicated Ubuntu 18.04 VM with 64 GB RAM; less than this RAM will fail with OOM. It gives an idea of how big the repositories are.

We have around six big Git repositories. Each one of them has around 20 GB of source code plus Git history. Each repo is treated as an OG project. It's indexed to keep the Git history. It has a lot of OG indexing filters to optimize the indexing time and avoid ctag-universal crashes.

It's an OpenGrok standalone setup, and it's not using Docker because the documentation says it should not be used for big repositories. It would need to be adjusted, but I don't want to experiment as this is critical for many devs.

0 replies

vladak · 2021-11-05T08:54:25Z

vladak
Nov 5, 2021
Maintainer

Okay, this means that the changes for #3077 should help with lowering the indexing time in your environment.

0 replies

Can the indexed data be imported in an OpenGrok that another OpenGrok indexed? #3965

Uh oh!

jetm Nov 1, 2021

Replies: 7 comments

Uh oh!

ahornace Nov 1, 2021

Uh oh!

Uh oh!

jetm Nov 2, 2021 Author

Uh oh!

vladak Nov 3, 2021 Maintainer

Uh oh!

Uh oh!

jetm Nov 3, 2021 Author

Uh oh!

Uh oh!

vladak Nov 4, 2021 Maintainer

Uh oh!

jetm Nov 5, 2021 Author

Uh oh!

vladak Nov 5, 2021 Maintainer

jetm
Nov 1, 2021

ahornace
Nov 1, 2021

jetm
Nov 2, 2021
Author

vladak
Nov 3, 2021
Maintainer

jetm
Nov 3, 2021
Author

vladak
Nov 4, 2021
Maintainer

jetm
Nov 5, 2021
Author

vladak
Nov 5, 2021
Maintainer