Skip to content

Latest commit

 

History

History
24 lines (16 loc) · 1.37 KB

README.md

File metadata and controls

24 lines (16 loc) · 1.37 KB

NLP Pipeline for Multi-Document Search

NLP Video

Introduction

This repo aims to utilise different feature engineering techniques to maximise the accuracy of obtaining the most relevant documents in a corpus in relation to a query given by the user. This is done in the backend (private repository).

Pipeline

image

The model firstly pre-processes both the query as well as the documents into bag-of-words, using scikitlearn tools.

It then finds the cosine similarity using 2 feature engineering algorithms (LSI and TF-IDF) and are weighed (evenly atm) to form a combined cosine similarity score.

Afterwards, it attempts to implement pointwise ranking to rank the scores based on historic data to give a set of best solutions. (WIP)

The website is integrated with OpenAI's Completion prompt and utilises prompt engineering to give you the most relevant result in the top file.

To-dos

  • Find the best weights for cosine similarity of TF-IDF and LSI algorithms
  • Do more pre-processing for the user's query
  • Use natural language to shape the sentence answers to answer the query

Credits

  • Practical Natural Language Processing - Sowmya Vajjala et. al (O'Reilly Publications)