Produce biographical information from a corpus of biographical notes

The aim of this project is to extract biographical information from the biographical notes published on the MacTutor website and, at the same time, experimenting with different NLP approaches to achieve this goal.

These texts are published under Creative Commons License 4.0 BY SA (cf. Copyright Information on the original website) and can thus be used for the present project.

The aim is to first identify named entities and link them to LOD resources like DBPaedia and Wikidata.

Then to retrieve temporal relationships and biographical information expressend in the texts in form of relations among entities, and store it in form of Linked Open Data using the SDHSS ontology ecosystem.

Data acquisition, transformation, exploration

maths_explore.ipynb

Explore the chronological list of mathematicians and prepare data acquisition

maths_import.ipynb

Import the texts into a PostgreSQL database

Then produce valid XML in order to be able to operate on the different parts and tags.

explore_db_texts.ipynb

Explore the imported textual data: lenght, distribution, etc.

db_produce_summaries.ipynb

Extract summaries in view of experimenting topic modeling and clustering

get_persons_uris_dbpedia.ipynb

Linke the existing persons to DBpedia getting their URIs

Spacy and the Universe Plugins

Explore the functionality of the main library and its many extensions

spacy_explore.ipynb

NLP treatement with Spacy and result stored in dedicated tables of the database (to be improved, adding vectors)

coreference_resolver_neuralcoref.ipynb

Tested and not adopted

spacy_coreference_resolver_spacy.ipynb

This notebook explores the own Spacy coreference resolver.

coreference_resolver_coreferee_crossLingual.ipynb

Proof of Concept

db_produce_spacy_model.ipynb

Create a data model using Spacy and store the result in a PostgreSQL database

db_add_coreferee_resolved_texts.ipynb

Add coreferenced texts produced with Coreferee to the database

get_persons_uris_wikidata.ipynb

Linke named entities to Wikidata using SPACY plugins

explore_db_cooccurrences_analysis.ipynb

First exploration of frequent terms cooccurrences (to be improved)

explore_db_entities_relationships.ipynb

Basic exploration of the NLP features in order to leverage them for entities' relationships extraction

explore_db_named_entities_and_verbs.ipynb

More specific analysis of named entities and verbs frequency, and the semantic structure of specific relationships, with focus on the structure: "study at University of..."

get_uris.ipynb

Link main persons to DBPaedia URIs

explore_db_nlp_vectors.ipynb

Explore queries using vector similarities and distances (postgreSQL extension pgvector)

Results

explore_db_relation_extraction_synctactic_dependencies.ipynb

Initial results are promising, but the diversity of linguistic expressions for the same semantic content requires the construction of overly complex algorithms. Other methods, e.g. using LLM, should be tried out first.

spacy_openai_relation_extraction.ipynb

Two ways of using the OpenAI API for information extraction were testes:

produce sentences then apply Spacy model and extract relationships
use ChatGPT to extract triples (and thus relationships)

In both cases the result is not yet satisfactory and new approaches need to be sought, either by creating a paying account on OpenAI or by using HuggingFace models

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
documents		documents
.gitignore		.gitignore
README.md		README.md
db_add_coreferee_resolved_texts.ipynb		db_add_coreferee_resolved_texts.ipynb
db_create_manage_tables.ipynb		db_create_manage_tables.ipynb
db_produce_spacy_model.ipynb		db_produce_spacy_model.ipynb
db_produce_summaries.ipynb		db_produce_summaries.ipynb
explore_db_cooccurrences_analysis.ipynb		explore_db_cooccurrences_analysis.ipynb
explore_db_entities_relationships.ipynb		explore_db_entities_relationships.ipynb
explore_db_named_entities_and_verbs.ipynb		explore_db_named_entities_and_verbs.ipynb
explore_db_nlp_vectors.ipynb		explore_db_nlp_vectors.ipynb
explore_db_relation_extraction_synctactic_dependencies.ipynb		explore_db_relation_extraction_synctactic_dependencies.ipynb
explore_db_texts.ipynb		explore_db_texts.ipynb
explore_db_vocabulary.ipynb		explore_db_vocabulary.ipynb
get_persons_uris_dbpedia.ipynb		get_persons_uris_dbpedia.ipynb
get_persons_uris_wikidata.ipynb		get_persons_uris_wikidata.ipynb
huggingface_gpt4_test.md		huggingface_gpt4_test.md
instructions.md		instructions.md
maths_explore.ipynb		maths_explore.ipynb
maths_import.ipynb		maths_import.ipynb
postgresql_functions.py		postgresql_functions.py
references.md		references.md
spacy_codes.md		spacy_codes.md
spacy_coreference_resolver_coreferee_crossLingual.ipynb		spacy_coreference_resolver_coreferee_crossLingual.ipynb
spacy_coreference_resolver_neuralcoref.ipynb		spacy_coreference_resolver_neuralcoref.ipynb
spacy_coreference_resolver_spacy.ipynb		spacy_coreference_resolver_spacy.ipynb
spacy_entity_linking.ipynb		spacy_entity_linking.ipynb
spacy_explore.ipynb		spacy_explore.ipynb
spacy_openai_relation_extraction.ipynb		spacy_openai_relation_extraction.ipynb
spacy_relation_extraction.ipynb		spacy_relation_extraction.ipynb
sparql_functions.py		sparql_functions.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Produce biographical information from a corpus of biographical notes

Data acquisition, transformation, exploration

maths_explore.ipynb

maths_import.ipynb

explore_db_texts.ipynb

db_produce_summaries.ipynb

get_persons_uris_dbpedia.ipynb

Spacy and the Universe Plugins

spacy_explore.ipynb

coreference_resolver_neuralcoref.ipynb

spacy_coreference_resolver_spacy.ipynb

coreference_resolver_coreferee_crossLingual.ipynb

Proof of Concept

db_produce_spacy_model.ipynb

db_add_coreferee_resolved_texts.ipynb

get_persons_uris_wikidata.ipynb

explore_db_cooccurrences_analysis.ipynb

explore_db_entities_relationships.ipynb

explore_db_named_entities_and_verbs.ipynb

get_uris.ipynb

explore_db_nlp_vectors.ipynb

Results

explore_db_relation_extraction_synctactic_dependencies.ipynb

spacy_openai_relation_extraction.ipynb

About

Releases

Packages

Languages

Sciences-historiques-numeriques/mathshistory

Folders and files

Latest commit

History

Repository files navigation

Produce biographical information from a corpus of biographical notes

Data acquisition, transformation, exploration

maths_explore.ipynb

maths_import.ipynb

explore_db_texts.ipynb

db_produce_summaries.ipynb

get_persons_uris_dbpedia.ipynb

Spacy and the Universe Plugins

spacy_explore.ipynb

coreference_resolver_neuralcoref.ipynb

spacy_coreference_resolver_spacy.ipynb

coreference_resolver_coreferee_crossLingual.ipynb

Proof of Concept

db_produce_spacy_model.ipynb

db_add_coreferee_resolved_texts.ipynb

get_persons_uris_wikidata.ipynb

explore_db_cooccurrences_analysis.ipynb

explore_db_entities_relationships.ipynb

explore_db_named_entities_and_verbs.ipynb

get_uris.ipynb

explore_db_nlp_vectors.ipynb

Results

explore_db_relation_extraction_synctactic_dependencies.ipynb

spacy_openai_relation_extraction.ipynb

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages