Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full recrawl and reindex in Solr #18

Open
OkkeKlein opened this issue Apr 22, 2020 · 4 comments
Open

Full recrawl and reindex in Solr #18

OkkeKlein opened this issue Apr 22, 2020 · 4 comments

Comments

@OkkeKlein
Copy link

How would one go about doing a full recrawl of content (filesystemcrawler) and then only do a (hard) commit after all content has been indexed.

So basically do a fresh update on a live system.

@essiembre
Copy link
Contributor

To perform a "clean" crawl (without sending only modifications, deletions, etc.) you simply have to delete your "workdir". More precisely, the crawlstore.

To not commit until you are done, you can set the "solrCommitDisabled" option to "true" in your Solr committer section. This means the committer will never send a Solr "commit" request, thus relying on your Solr configuration to decide when to commit, or your manual commit.

@OkkeKlein
Copy link
Author

Thank you!

@OkkeKlein OkkeKlein reopened this Apr 23, 2020
@OkkeKlein
Copy link
Author

How to deal with deleted docs? A manual delete all? Or maybe parameter send with Solr committer?

@essiembre
Copy link
Contributor

A simple approach would be to add two fields in your collection. One that identifies the source crawler, the second that identifies the crawl date. In your config, you can use the ConstantTagger to populate the first one, and for the second, you can use the CurrentDateTagger.

With this, you can use the "delete by query" approach on Solr. You would issue a query that deletes anything older than the date of your full recrawl, for the given crawler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants