Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multizim (suggestions) does not work at all #479

Open
kelson42 opened this issue Mar 19, 2021 · 21 comments
Open

Multizim (suggestions) does not work at all #479

kelson42 opened this issue Mar 19, 2021 · 21 comments

Comments

@kelson42
Copy link
Collaborator

If I search for suggestion in the welcome page, nothing is printed.

I would like to see the results and it would be great to have the logo of the ZIM beside to see in which content the content is available.

See kiwix/kiwix-tools#385 for the fulltext search multizim lack of scalability

@JensKorte

This comment has been minimized.

@kelson42

This comment has been minimized.

@JensKorte

This comment has been minimized.

@maneeshpm
Copy link
Collaborator

maneeshpm commented Mar 25, 2021

I tried to recreate this bug for a single zim file. In this case, the error occurs because of an empty content argument to the request, that causes a corresponding failure in getIdForName() method. Hence we get a 404 page via the catch block.
https://github.com/kiwix/kiwix-lib/blob/803cb1c2c5b6c99b53bcc540bf6719b69d3552ad/src/server/internalServer.cpp#L395-L402
This is the generated request: http://localhost:8080/suggest?content=&term=berlin
The solution is to fix the faulty request so that it includes suitable content from which bookName can be extracted.

@kelson42
Copy link
Collaborator Author

@maneeshpm Sounds good but we need to think about the scalability as well. How can we secure a proper response, on time, with 2000 ZIM files?

@JensKorte
Copy link

This reminds me a little bit of a meta search engine. The meta search engine queries several search engines and doesn't know, when this will finish. In past some meta search engines provided an interface with a user selectable timeout and a list where search engines could be choosen grouped by categories or languages.

If you think of a timeout between http server and browser, then the server could send a line with a space once in a while, until the search is finished. If the search result page gets an anchor in the URL, the empty line could get ignored by placing the anchor at the begin of the results.

A caching could be helpful, when several people do the same search, e.g. a school class searches during a lesson. For single user this could be helpful, if the first search gets a short timeout and when the search is repeated the caches serves the full response. Maybe a line with the timeout avoiding spaces could be placed at the end of a fast search and when the server finishes the search the user gets a link with "Reload to see all results".

When the first browser request is made to the server, the server could response with a "dynamic" start page where the languages are selected, which the user activated in the browser eg. "DE(-ch), EN(-us)". The user could then enter the search phrase and modify the languages.

@maneeshpm
Copy link
Collaborator

maneeshpm commented Mar 26, 2021

According to this thread on Xapian, Xapian can handle search over multiple databases with a very small overhead compared to single database search. For that, all the databases should be added simultaneously using the Xapian::Database::add_database() method. This is already implemented in libzim. IMO the real bottleneck is in retrieving the indexes from the zim. An improvement here would be to go async and load all the title indexes using multiple threads. This way, we might be able to set up a Xapian::Enquire object faster and let it handle the search. This is limited by the CPU of the host machine, but largely a general solution. But this must be done as soon as the library is loaded since we can assume that the user is going to use search.

PS: I guess this ticket openzim/libzim#418 is well written and captures the issue very well.
As far as suggestions not working is concerned, I believe we need to fix that piece of code in kiwix-lib.

@kelson42
Copy link
Collaborator Author

kelson42 commented Mar 26, 2021

retrieving the indexes from the zim

What do you mean exactly here? the IO overhead? Or simply what is reported in openzim/libzim#418?

@maneeshpm
Copy link
Collaborator

I meant the net cost of (reading a zim + getting the index + adding it to databases object)

@maneeshpm
Copy link
Collaborator

maneeshpm commented Mar 26, 2021

I think this issue is more suited for kiwix-lib instead of kiwix-tools since the bug is there.

handle_search() and handle_suggest() are somewhat similar routines. Both of them initially try to get a bookName from the request obj inside a try catch block. When searching from the input box on the welcome page, both the functions rely on content argument of the request to load a bookName which is generating an error and entering the catch block. handle_search() does nothing in the catch block and has a fallback method to get all open local zim using mp_library->filter(kiwix::Filter().local(true).valid(true)) and does not raise any error. Whereas handle_suggest() returns a 404 in the catch block, hence causing this behavior. We can implement the same fallback method in handle_suggest() to fix this issue.

I think till the issue of scaling up is sorted, we should hide this feature from the main page as it hurts the user experience for a high number of zims.

@kelson42 kelson42 transferred this issue from kiwix/kiwix-tools Mar 26, 2021
@stale
Copy link

stale bot commented Jun 5, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@maneeshpm
Copy link
Collaborator

maneeshpm commented Jul 16, 2021

@kelson42 we can say as a fact that once a Xapian database is ready, search on it is quick(even on huge Xapian DB) and that is something we cannot improve on our side. Now our main concern is how to make the DB ready first time and how to keep it ready for further searches.

Answer to how to keep it ready is caching, which we have already started looking into in #509

Answering how to make it ready first time quickly is a bit more complicated. Currently in libkiwix side, we make a zim::Searcher only after receiving a query(we make it on each query, hence slow). We could prepare a zim::Searcher as soon as the user opens a multizim because we can expect them to do at least one search on the zim.

Now what to do till the zim::Searcher is being created? extracting the xapian entry from all the zim in case of multizim takes time. We could show a message "Searcher is preparing" and offer a simpler/stripped down search using zim index(which is quick) till the searcher is ready.

@stale stale bot removed the stale label Jul 16, 2021
@kelson42
Copy link
Collaborator Author

The topic of the cold start is already touched in openzim/libzim#418. I would keep this topic outside this ticket. That said I still believe that if kiwix-serve has 2000 zim files open, then a multizim search won't give an answer in a reasonable time and memory consumption. This is IMO mostly what this ticket is about.

@kelson42
Copy link
Collaborator Author

Here is how I would propose to proceed. First of all this is a quite lartge ticket, so I would first propose to split it in following tasks:

  • Multizim search ABI design should be agreed/confirmed, automated tests should be written to test it.
  • High load situation should be discussed and a solution should be provided to (1) allow users to get proper feedback in reasonable amount of time (2) avoid the whole software to crash/timeout because of lack of CPU/memory.
  • Kiwix-serve Multizim REST API design should be adapted/checked/tested.
  • Kiwix-serve multizim search should be re-introduced (it was the default on the welcome page taskbar, but since a few weeks we don't have a taskbar anymore on the welcome page... this was not working most of the time anyway).

@maneeshpm @mgautierfr Do you agree? Have you comments?

@kelson42
Copy link
Collaborator Author

Depends on #509

@kelson42 kelson42 pinned this issue Sep 12, 2021
@kelson42 kelson42 moved this from To do to In progress in Improved search (ft/suggestions) Dec 3, 2021
@kelson42
Copy link
Collaborator Author

kelson42 commented Dec 26, 2021

@maneeshpm Would you mine to tackle the multizim problem until we fix the last details of #509? Maybe you have a feedback obout my last comment?

@kelson42 kelson42 modified the milestone: 10.2.0 Jan 10, 2022
@kelson42 kelson42 modified the milestones: 10.1.0, 10.2.0 Mar 24, 2022
@kelson42 kelson42 moved this from In progress to To do in Improved search (ft/suggestions) Mar 30, 2022
@kelson42 kelson42 modified the milestones: 10.2.0, 10.3.0 Apr 7, 2022
@kelson42 kelson42 modified the milestones: 10.2.0, 10.3.0 Apr 23, 2022
@kelson42
Copy link
Collaborator Author

Fulltext multizim search is fixed with #731. The multizim suggestion work is left to do.

@kelson42 kelson42 moved this from To do to In progress in Improved search (ft/suggestions) May 7, 2022
@kelson42 kelson42 moved this from In progress to To do in Improved search (ft/suggestions) Jun 26, 2022
@stale
Copy link

stale bot commented Jul 10, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Jul 10, 2022
@kelson42
Copy link
Collaborator Author

I guess this is ticket fot openzim/linbzim meanwhile.

We should fix openzim/libzim#734 forst IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

4 participants