Skip to content

neozhixuan/internship-nlpfrontend

Repository files navigation

NLP Pipeline for Multi-Document Search

NLP Video

Introduction

This repo aims to utilise different feature engineering techniques to maximise the accuracy of obtaining the most relevant documents in a corpus in relation to a query given by the user. This is done in the backend (private repository).

Pipeline

image

The model firstly pre-processes both the query as well as the documents into bag-of-words, using scikitlearn tools.

It then finds the cosine similarity using 2 feature engineering algorithms (LSI and TF-IDF) and are weighed (evenly atm) to form a combined cosine similarity score.

Afterwards, it attempts to implement pointwise ranking to rank the scores based on historic data to give a set of best solutions. (WIP)

The website is integrated with OpenAI's Completion prompt and utilises prompt engineering to give you the most relevant result in the top file.

To-dos

  • Find the best weights for cosine similarity of TF-IDF and LSI algorithms
  • Do more pre-processing for the user's query
  • Use natural language to shape the sentence answers to answer the query

Credits

  • Practical Natural Language Processing - Sowmya Vajjala et. al (O'Reilly Publications)

About

[sklearn] NLP Pipeline for Information Retrieval

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published