Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show similar mods on mod page #438

Open
wants to merge 1 commit into
base: alpha
Choose a base branch
from

Conversation

HebaruSan
Copy link
Contributor

@HebaruSan HebaruSan commented Dec 17, 2021

Motivation

See #424, a user who lands on a mod page from off-site has minimal connectivity to other SpaceDock pages. We might have several more mods they'd enjoy, but to find them the user would have to click the header to view an overall list or perform a text search that's currently hard to use. Many web sites famously have "related" links that try to show the user items similar to the current item to help them explore.

Changes

Now the bottom of the mod page hosts a modestly named "Similar-ish Mods" list containing up to 6 mods that are similar to the current mod:

image

(I decided against using the term "related" because to me, "related" mods would be from the same family of mods like Near Future, or would be designed specifically to extend or work with one another. Rather than such a close relationship, we are just talking about mods that might be about some of the same things, like planet packs or parts.)

To get this list, the mod page template inspects Mod.similar_mods, which is an association proxy based on Mod.similarities, which is a one-to-many relationship to a new ModSimilarity (mod_similarity) table:

Column Purpose
main_mod_id Stores Mod.id of one of the mods being compared
other_mod_id Stores Mod.id of the other mod being compared
similarity A number that is larger for more similar mods and smaller for less similar mods

An index of (main_mod_id, other_mod_id) allows quick access to known pairs of mods, and an index of (main_mod_id, similarity.desc()) ensures that Mod.similarities can be generated quickly.

ModSimilarity.similarity (a single precision float because the values will all be between 0 and 4) is calculated in ModSimilarity's constructor by summing the similarities of the authors (divided by 10 to prevent wildly different mods by the same author from dominating mods that share meaningful keywords, call it "the @linuxgurugamer factor"), name, short description, and description of the two mods, which in turn are calculated as:

$$ \frac{|A\cap B|}{|A\cup B|} $$

... where A and B are the words from each string. For authors, each author name is treated as a word, but for the other strings:

  • Blocks of non-alphanumeric characters are treated as word delimiters
  • Numbers and single letters are ignored
  • StudlyCapsWords are considered both in whole ("StudlyCapsWords") and split ("Studly Caps Words")
  • "Meaningless" words are ignored (e.g., "the", "an", "this", and so on for a long list based on currently available texts)
  • Letters are converted to lowercase to facilitate case insensitive matching

All routes that modify these inputs are updated to trigger a recalculation of the affected mod's similar mods in the background via a new Celery task. For this to work, we now also db.commit() before those calls. The same Celery task also populates the initial values in the mod_similarity table for all existing mods in the migration.

The Celery task works by iterating over all mods other than the given one published for the same game and comparing them to the given mod, keeping only the 6 pairs with the highest similarity. Pairs with 0 similarity are never included. Once the most similar pairs are known, the list in Mod.similarities is updated to match. According to https://spacedock.info/api/browse there are 2252 mods on SpaceDock right now, so if each has 6 rows in mod_similarities, that would be 13512 new rows.

To reduce redundant work during repeated similarity calculations for the same mod, Mod._author_names and Mod._words now cache the words in the input properties on-demand via Mod.get_author_names() and Mod.get_words().

To make it easier to experiment with production's large, authentic data set, I have included a utility in tests/test_mod_similarity.py that displays the results of the same algorithm applied to data from the production server's API. It's not a test, but I found it very useful for development, and putting it with the tests seemed better than putting it with production code or on a wiki.

I have tried to isolate the similarity logic to the similarity.py and str_similiarity.py modules in case we find a machine learning package that we want to use later; switching over should be possible as long as we can write replacements for ModSimiliarity.__init__, words_similarity and meaningful_words.

Fixes #424.

@HebaruSan HebaruSan added Area: Backend Related to the Python code that runs inside gunicorn Priority: Normal Type: Feature Status: Ready Area: Migration Related to Alembic database migrations Scope: Large Complex changes requiring a lot of effort to develop and review labels Dec 17, 2021
@HebaruSan HebaruSan requested a review from DasSkelett December 17, 2021 01:51
@HebaruSan HebaruSan linked an issue Dec 17, 2021 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Backend Related to the Python code that runs inside gunicorn Area: Migration Related to Alembic database migrations Priority: Normal Scope: Large Complex changes requiring a lot of effort to develop and review Status: Ready Type: Feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Show similar mods on mod page
1 participant