The aim of this project is to extract biographical information from the biographical notes published on the MacTutor website and, at the same time, experimenting with different NLP approaches to achieve this goal.
These texts are published under Creative Commons License 4.0 BY SA (cf. Copyright Information on the original website) and can thus be used for the present project.
The aim is to first identify named entities and link them to LOD resources like DBPaedia and Wikidata.
Then to retrieve temporal relationships and biographical information expressend in the texts in form of relations among entities, and store it in form of Linked Open Data using the SDHSS ontology ecosystem.
Explore the chronological list of mathematicians and prepare data acquisition
Import the texts into a PostgreSQL database
Then produce valid XML in order to be able to operate on the different parts and tags.
Explore the imported textual data: lenght, distribution, etc.
Extract summaries in view of experimenting topic modeling and clustering
Linke the existing persons to DBpedia getting their URIs
Explore the functionality of the main library and its many extensions
NLP treatement with Spacy and result stored in dedicated tables of the database (to be improved, adding vectors)
Tested and not adopted
This notebook explores the own Spacy coreference resolver.
Create a data model using Spacy and store the result in a PostgreSQL database
Add coreferenced texts produced with Coreferee to the database
Linke named entities to Wikidata using SPACY plugins
First exploration of frequent terms cooccurrences (to be improved)
Basic exploration of the NLP features in order to leverage them for entities' relationships extraction
More specific analysis of named entities and verbs frequency, and the semantic structure of specific relationships, with focus on the structure: "study at University of..."
Link main persons to DBPaedia URIs
Explore queries using vector similarities and distances (postgreSQL extension pgvector)
Initial results are promising, but the diversity of linguistic expressions for the same semantic content requires the construction of overly complex algorithms. Other methods, e.g. using LLM, should be tried out first.
Two ways of using the OpenAI API for information extraction were testes:
- produce sentences then apply Spacy model and extract relationships
- use ChatGPT to extract triples (and thus relationships)
In both cases the result is not yet satisfactory and new approaches need to be sought, either by creating a paying account on OpenAI or by using HuggingFace models