This repo aims to utilise different feature engineering techniques to maximise the accuracy of obtaining the most relevant documents in a corpus in relation to a query given by the user. This is done in the backend (private repository).
The model firstly pre-processes both the query as well as the documents into bag-of-words, using scikitlearn tools.
It then finds the cosine similarity using 2 feature engineering algorithms (LSI and TF-IDF) and are weighed (evenly atm) to form a combined cosine similarity score.
Afterwards, it attempts to implement pointwise ranking to rank the scores based on historic data to give a set of best solutions. (WIP)
The website is integrated with OpenAI's Completion prompt and utilises prompt engineering to give you the most relevant result in the top file.
- Find the best weights for cosine similarity of TF-IDF and LSI algorithms
- Do more pre-processing for the user's query
- Use natural language to shape the sentence answers to answer the query
- Practical Natural Language Processing - Sowmya Vajjala et. al (O'Reilly Publications)