Skip to content

Commit

Permalink
Reranking on the client side
Browse files Browse the repository at this point in the history
  • Loading branch information
daoudclarke committed Nov 3, 2024
1 parent c6f5300 commit ca27299
Showing 1 changed file with 108 additions and 0 deletions.
108 changes: 108 additions & 0 deletions content/articles/reranking-on-the-client-side.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
---
title: "Re-ranking search results on the client side"
date: "2024-11-03"
author: ["Daoud Clarke"]
---

By many measures, [Mwmbl](https://mwmbl.org) is doing great. We have
indexed over half a billion pages, we have over 4,000 registered
users, and over 30,000 curations from those users. Our volunteers are
crawling around 5 million pages a day.

But the score that I care about most right now is
[NDCG](https://en.wikipedia.org/wiki/Discounted_cumulative_gain). This
measures the quality of our search results against a "gold standard"
which is just Bing search results for the same query. Obviously, we
are not ultimately aiming to be like Bing, so eventually we will stop
using Bing and start using our curated data, once we have enough and
the quality is high enough. But we are far enough away from being good
that moving in a Bing-like direction is great, for now.

Because our NDCG score is pretty poor. A score of 1 would be "matches
Bing exactly", while a score of 0 would be "nothing in common with
Bing". We are scoring 0.336 on our test set. However most of that
comes from sticking Wikipedia results at the top, by querying
Wikipedia with their [excellent
API](https://www.mediawiki.org/wiki/API:Search) and using their
ranking. Without that, we were scoring 0.059. Using Wikipedia alone,
we would score 0.297.

I've experimented off and on with [learning to
rank](https://en.wikipedia.org/wiki/Learning_to_rank). This was the
industry standard for ranking before large language models, so seemed
like an obvious place to start. Implementing a lot of [standard
features](https://www.microsoft.com/en-us/research/project/mslr/)
improved the results slightly over my original intuitively defined
features, but still only gave us an NDCG score of 0.365.

Nevertheless, after trying for a while to improve on this further, I
decided to deploy this ranking instead of the heuristic that we were
using before. It didn't seem to make the results worse, and it seemed
to be fast enough, at least on my local machine.

I didn't realise it at the time, but this was when things started
breaking. You see, we only have one server. We don't do actual
crawling on the server since that is done by volunteers. But we have
to process the crawl data to find new links, prioritise those links
for crawling, preprocess the data for indexing, and index it, as well
as serving search results. All on one server.

And it turned out that adding a ton of features and using XGBoost for
every search query added enough load to the server that things started
to slow down. Even loading the main page could take three or four
seconds. At the time, I didn't realise that this was causing the
problem, since a bunch of other stuff was happening. We had some
enthusiastic volunteers that were crawling many millions of pages a
day.

Search got so slow that I decided we had to turn off the crawling. I
made the server that sends batches to crawl return an empty list. I
turned off the scripts that update the crawl queue and index new
pages. And things got better.

But they still weren't as good as they were before. It took me a long
time to realise that it was the change in ranking that was the
problem.

In retrospect, we should have had better monitoring around everything
so that we could more easily identify the cause of the slowdown. But
this is a project that I do for fun, so I've put off doing this kind
of thing, because I find it boring. Boring is not fun. But a broken
website is also not fun. You live and learn.

Now I've changed the ranking back to the old heuristic and turned on
the crawling again. I still think learning to rank has potential. But
we can't afford it with our current resources. So I've started looking
at alternatives.

What would make ranking almost infinitely scalable? If we don't do it
ourselves, but get our users to do it, on the client side. This works
really well for us, because we don't rank like a normal search
engine. Our results are already ranked for each unigram and bigram
query. We pull out the pre-ranked results for each unigram and bigram
in a user's query, then re-rank them using our heuristic. This is
perfectly feasible to do on the client side, since we don't have more
than around 30 results per unigram or bigram on average.

It also gave me the final push I needed to implement ranking in
Rust. I know that Python is a bad choice for a search engine, but I
chose it because I knew I could build something quickly, and that I
would probably give up at some point if I tried to do it in Rust. Yes,
I know Java and C++, which would have been fine choices, but they
would not be fun. This project has to be fun or it will not happen.

I am a beginner in Rust, and doing too hard things at the same time -
building a search engine and learning Rust - seemed like a recipe for
disaster. But I can build small bits in Rust, especially if I have
already built them in Python. So over the last few days, I've [rebuilt
our heuristic ranking in Rust](https://github.com/mwmbl/rankeval/).
The rust compiles to web assembly which will eventually be called from
our [excellent new front end](https://alpha.mwmbl.org/). Now all the
back end needs to do is pull out pre-ranked results for each unigram
and bigram. These can be easily cached, and potentially even kept in
cloud storage. If we are ever going to serve millions of users on our
shoe-string budget, then this is how we will have to do it.

If you would like to get involved, join us on our
[https://matrix.to/#/#mwmbl:matrix.org](Matrix server), or send a pull
request to fix my rubbish Rust code. Thank you for reading!

0 comments on commit ca27299

Please sign in to comment.