Skip to content

Latest commit

 

History

History
78 lines (39 loc) · 5.93 KB

README.md

File metadata and controls

78 lines (39 loc) · 5.93 KB

MoodMuse

License

An app for discovering poetry using embedding-based semantic retrieval

What is semantic search and why do we want it

Demo

MoodMuse: Demo app

Features

  • Open-ended discovery of poetry based on emotions, themes, objects or settings.
  • Efficient Approximate Nearest Neighbors (ANN) search using NGT

Overview

The app happened because I wanted to understand semantic search. I figured out the basics using the millawell/wikipedia_field_of_science dataset, but wanted to make something that would be fun to use myself and maybe share with friends. So I decided to make something that helps me find better poetry.

Data

~16000 english poems scraped from poetryfoundation.org

Embeddings and data can be accessed on googledrive

Modelling notes

Used the MTEB leaderboard and the models listed in the sentence transformers documentation and tested about 10-15 different models.

Contrary to intuition, larger language models didn't necessarily have better embeddings. This worked out great because the larger models also take much longer to embed and create much larger embeddings.

Embedding-as-service platforms like openAI are fast, but those embeddings were not great. The larger models tend to have much vaguer connection to the query than is ideal. Some vagueness is good, too much isn't. And embedding large swaths of text and holding it in a vector db somewhere is much tougher with these services.

The models that are trained for assymmetric retrieval were inferior to the ones trained on symmetric search. This too is counter-intuitive. all-mpnet-base-v2 was the best sentence-tranformer model, although BAAI/bge-base-en and thenlper/gte-base were also good.

The main problem with these models is that the max_seq_length is generally much smaller than the text that needs to be embedded. This makes for great representation of the first 300-500 or so characters and then no representation of the rest of the text. To solve this, I tried out chunking the text and max-pooling the results which definitely improved the results but I wanted more.

Further search lead to jinaai/jina-embeddings-v2-base-en. This embedding model was the best performing. These guys have figured out a way to ingest upto 8192 tokens using ALiBi. They have a fine-tuning library that looks very interesting and seem like a good alternative to the openai/anthropics of the world.

Sentence-Transformers recommends using a reranking model, and I tried them out, and while they do marginally improve the results, the improvements were not enough to justify the extra work.

Indexing and retrieval

Following the guide at pinecone and ANN benchmarks, I tried out Neighborhood Graph and Tree (NGT), FAISS and HNSW extensively on multiple datasets. I found that on smaller datasets, NGT and FAISS work the best, and on larger datasets the difference between the three is negligible. This could be because I didn't try out large enough datasets. The differences are small and some hyperparameter tuning could improve things. I implemented NGT in the app because I like Japan and I don't like Facebook.

Tech stack/Process

  1. Embed corpus on jina-embeddings-v2-base-en
  2. Index embedding using NGT
  3. Embed query using the same model
  4. Search NGT index using query embedding, retrieving based on cosine similarity
  5. Look up top results in a pandas dataframe that has the text of the poems (don't judge me, it's just 50MB and a db is too much work)
  6. Serve the top 5 hits using an Anvil app

Resources

The app takes great inspiration from the excellent Vicki Boykis, who, around the same time as when I began puttering around with semantic search, was doing the same and shared her findings in great detail. Her app for finding books by vibes - Viberary is excellent and her research on this subject was a major source of information.

Pinecone has a great online book on NLP for semantic search

Sentence-transformers docuemntation and github repo are filled with great instructions and examples on how to train, embed, retreieve etc. This site was open all the time for the last few months.

Wattenberg, et al., "How to Use t-SNE Effectively", Distill, 2016. http://doi.org/10.23915/distill.00002

Interesting papers

Mengzhao Wang, Xiaoliang Xu, Qiang Yue, and Yuxiang Wang. 2021. A comprehensive survey and experimental comparison of graph-based approximate nearest neighbor search. Proc. VLDB Endow. 14, 11 (July 2021), 1964–1978. https://doi.org/10.14778/3476249.3476255

Pretrained Transformers for Text Ranking: BERT and Beyond (Yates et al., NAACL 2021)