Author: Zheng Liu
Date: 07/2022 - 09/2022
Summary: 2022 MSc graduation project in UoN which is research about NLP, especially on Topic Modelling.
This is the repository for the MSc project researched by Zheng Liu and supervised by Kai Xu in summer 2022. File structure is described as followed:
- Instruction note (this file, aka, readme.md);
- Document (such as literature review and final report);
- Code (versioning)
Project Introduction:
The whole project is a Sensemaking project which develop a Google Chrome extension for enhancing user retrieval experience. The extension will capture the links user explored and provide service like Visualization, Analysis and Recommendation depending on history links. The research of whole Sensemaking project could be separated as front-end and back-end parts. For the front-end part, it researched different ways for visualizing user exploration in real-time after opening the extension. And back-end technology in this Sensemaking project takes an essential role which analyses the exploration history from the moment user opening the extension until stopping exploration. In the end, the user ticks the button for sending the order of searching history analysis. For the prospect in such field, relevant recommendations could be made by accumulating analysis results over time and labelled links.
Tech Details:
The whole project was developed in programming language Python and developer tool Jupyter Notebook. First, a large dataset of Google search history data user explored should be collected. To do this, Chrome browser history file will be accessed through SQLite for getting links and writing them all in a csv by file IOFlow operations. What to do next is crawling text content from necessary tags with Beautiful Soup library to getting the large dataset. After that, NLP English model was applied for normalizing the format of words aiming to getting a corpus. At the same time, Libraries like SpaCy and NLTK were applied for removing words of Part of Speech (POS) and stemming words, lowercasing words, lemmatization and tokenization. And then, a corpus with almost 300,000 tokens were formed. The function of Topic modelling is to cluster a set of topics and from a corpus. Therefore, LDA, NMF and LSI models were trained in this project. For evaluation, the performances were visualized through matplotlib tool and tested by Coherence Score. Finally, an API was generated by RESTful tool, and it was tested through sending HTTP requests on Postman.
Research Report:
The research report was written by latex, kind of MarkUp language.
Libiriy Usage:
The project developed on conda kernel - Python 3.9.16 and all dependencies are following.
pacakges | version | use |
---|---|---|
browserhistory | 0.1.2 | extract browser history from a user's local computer and write the data to csv files. |
sys | 5.9.4 | manipulate different parts of the Python runtime environment. |
sqlite3 | 3.40.1 | integrate the SQLite database with Python. |
csv | built-in | implements classes to read and write tabular data in CSV format. |
os | built-in | interacting with the operating system. |
pandas | 1.24.2 | working with relational or labeled data easily and intuitively. |
warnings | built-in | Warnings provided situations that aren’t necessarily exceptions. |
requests | 2.28.2 | Requests allows you to send HTTP/1.1 requests extremely easily. |
BeautifulSoup | 4.11.2 | scrape information from web pages. |
lxml | 4.9.2 | easy handling of XML and HTML files, and can also be used for web scraping. |
urllib3 | 1.26.15 | urllib3 is a powerful, user-friendly HTTP client for Python. |
re | built-in | regular expression matching operations. |
gensim | 4.3.1 | Python library for topic modelling, document indexing and similarity retrieval with large corpora. |
spacy | 3.5.1 | advanced natural language processing. |
matplotlib | 3.7.1 | creating static, animated, and interactive visualizations. |