Show similar mods on mod page #438

HebaruSan · 2021-12-17T01:51:16Z

Motivation

See #424, a user who lands on a mod page from off-site has minimal connectivity to other SpaceDock pages. We might have several more mods they'd enjoy, but to find them the user would have to click the header to view an overall list or perform a text search that's currently hard to use. Many web sites famously have "related" links that try to show the user items similar to the current item to help them explore.

Changes

Now the bottom of the mod page hosts a modestly named "Similar-ish Mods" list containing up to 6 mods that are similar to the current mod:

(I decided against using the term "related" because to me, "related" mods would be from the same family of mods like Near Future, or would be designed specifically to extend or work with one another. Rather than such a close relationship, we are just talking about mods that might be about some of the same things, like planet packs or parts.)

To get this list, the mod page template inspects Mod.similar_mods, which is an association proxy based on Mod.similarities, which is a one-to-many relationship to a new ModSimilarity (mod_similarity) table:

Column	Purpose
`main_mod_id`	Stores `Mod.id` of one of the mods being compared
`other_mod_id`	Stores `Mod.id` of the other mod being compared
`similarity`	A number that is larger for more similar mods and smaller for less similar mods

An index of (main_mod_id, other_mod_id) allows quick access to known pairs of mods, and an index of (main_mod_id, similarity.desc()) ensures that Mod.similarities can be generated quickly.

ModSimilarity.similarity (a single precision float because the values will all be between 0 and 4) is calculated in ModSimilarity's constructor by summing the similarities of the authors (divided by 10 to prevent wildly different mods by the same author from dominating mods that share meaningful keywords, call it "the @linuxgurugamer factor"), name, short description, and description of the two mods, which in turn are calculated as:

$$ \frac{|A\cap B|}{|A\cup B|} $$

... where A and B are the words from each string. For authors, each author name is treated as a word, but for the other strings:

Blocks of non-alphanumeric characters are treated as word delimiters
Numbers and single letters are ignored
StudlyCapsWords are considered both in whole ("StudlyCapsWords") and split ("Studly Caps Words")
"Meaningless" words are ignored (e.g., "the", "an", "this", and so on for a long list based on currently available texts)
Letters are converted to lowercase to facilitate case insensitive matching

All routes that modify these inputs are updated to trigger a recalculation of the affected mod's similar mods in the background via a new Celery task. For this to work, we now also db.commit() before those calls. The same Celery task also populates the initial values in the mod_similarity table for all existing mods in the migration.

The Celery task works by iterating over all mods other than the given one published for the same game and comparing them to the given mod, keeping only the 6 pairs with the highest similarity. Pairs with 0 similarity are never included. Once the most similar pairs are known, the list in Mod.similarities is updated to match. According to https://spacedock.info/api/browse there are 2252 mods on SpaceDock right now, so if each has 6 rows in mod_similarities, that would be 13512 new rows.

To reduce redundant work during repeated similarity calculations for the same mod, Mod._author_names and Mod._words now cache the words in the input properties on-demand via Mod.get_author_names() and Mod.get_words().

To make it easier to experiment with production's large, authentic data set, I have included a utility in tests/test_mod_similarity.py that displays the results of the same algorithm applied to data from the production server's API. It's not a test, but I found it very useful for development, and putting it with the tests seemed better than putting it with production code or on a wiki.

I have tried to isolate the similarity logic to the similarity.py and str_similiarity.py modules in case we find a machine learning package that we want to use later; switching over should be possible as long as we can write replacements for ModSimiliarity.__init__, words_similarity and meaningful_words.

Fixes #424.

HebaruSan added Area: Backend Related to the Python code that runs inside gunicorn Priority: Normal Type: Feature Status: Ready Area: Migration Related to Alembic database migrations Scope: Large Complex changes requiring a lot of effort to develop and review labels Dec 17, 2021

HebaruSan requested a review from DasSkelett December 17, 2021 01:51

HebaruSan linked an issue Dec 17, 2021 that may be closed by this pull request

[Feature] Show similar mods on mod page #424

Open

Show similar mods on mod page

03acf91

HebaruSan force-pushed the feature/related branch from f4072a3 to 03acf91 Compare March 12, 2022 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Show similar mods on mod page #438

Show similar mods on mod page #438

HebaruSan commented Dec 17, 2021 •

edited

Loading

Show similar mods on mod page #438

Are you sure you want to change the base?

Show similar mods on mod page #438

Conversation

HebaruSan commented Dec 17, 2021 • edited Loading

Motivation

Changes

HebaruSan commented Dec 17, 2021 •

edited

Loading