NLP Pipeline for Multi-Document Search

Introduction

This repo aims to utilise different feature engineering techniques to maximise the accuracy of obtaining the most relevant documents in a corpus in relation to a query given by the user. This is done in the backend (private repository).

Pipeline

The model firstly pre-processes both the query as well as the documents into bag-of-words, using scikitlearn tools.

It then finds the cosine similarity using 2 feature engineering algorithms (LSI and TF-IDF) and are weighed (evenly atm) to form a combined cosine similarity score.

Afterwards, it attempts to implement pointwise ranking to rank the scores based on historic data to give a set of best solutions. (WIP)

The website is integrated with OpenAI's Completion prompt and utilises prompt engineering to give you the most relevant result in the top file.

To-dos

Find the best weights for cosine similarity of TF-IDF and LSI algorithms
Do more pre-processing for the user's query
Use natural language to shape the sentence answers to answer the query

Credits

Practical Natural Language Processing - Sowmya Vajjala et. al (O'Reilly Publications)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

NLP Pipeline for Multi-Document Search

Introduction

Pipeline

To-dos

Credits

Files

README.md

Latest commit

History

README.md

File metadata and controls

NLP Pipeline for Multi-Document Search

Introduction

Pipeline

To-dos

Credits