Skip to content

Commit

Permalink
Merge pull request #567 from dice-group/thesis_ml
Browse files Browse the repository at this point in the history
Retrieval Augmented Generation over KGE
  • Loading branch information
Demirrr authored Nov 15, 2024
2 parents be93d52 + c55377b commit d6eef36
Show file tree
Hide file tree
Showing 3 changed files with 63 additions and 33 deletions.
24 changes: 24 additions & 0 deletions pages/theses/BPE_KGE.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
date: '2024-11-15'
title: 'Byte pair encoding for Knowledge Graph Embeddings'
type: 'Bachelor'
supervisor: dice:CaglarDemir
contact: dice:CaglarDemir
---

# Topic
A knowledge graph embedding (KGE) model assigns a unique embedding row for each unique entities/nodes and relations/edges.
As the size of the unique entities or relations grows, the memory usage of KGE increases.
Therefore, the memory requirement to train KGE model or deploy a trained model is bounded by the size of the data.

LLMs uses byte pair encoding techniques to learn to represent sequence of chars with subword unit.
Therefore, LLM embeddings are subword units, instead of unique words.
Recently, we show that byte pair encoding schema developed for LLMs can also be used for KGEs (see
[Inference over Unseen Entities, Relations and Literals on Knowledge Graphs](https://arxiv.org/pdf/2410.06742) .
In this thesis, the student will design a byte pair encoding schema based on a given knowledge graph.
The student will closely work on [dice-embeddings](https://github.com/dice-group/dice-embeddings).


#### Question & Answer Session

In case you have further questions, feel free to contact [Caglar Demir](https://dice-research.org/CaglarDemir).
39 changes: 39 additions & 0 deletions pages/theses/RAG_KGE.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
date: '2024-11-15'
title: 'RAG over Neural Triple Stores'
type: 'Bachelor'
supervisor: dice:CaglarDemir
contact: dice:CaglarDemir
---

# Topic
Most knowledge graphs are incomplete.
Neural link predictors (most knowledge graph embeddings) can accurately infer of missing knowledge even involving multi-hop reasoning.

In this thesis, the student will focus on techniques that combine LLMs and neural link predictors in the context of retrieval augmented generation (RAG).
Through designing a novel and effective model, we aim to achieve the following workflow

1. A user can ask a question.
2. LLM renders the question into a first order logic expression via prompt engineering.
3. The first-order logic expression is given to a neural link predictor to perform [multi-hop query answering](https://github.com/dice-group/dice-embeddings?tab=readme-ov-file#answering-complex-queries)
4. The result (e.g. an ordered sequence of nodes/entities) is preprocessed and given to LLM to generate fluent response to the user


The student will closely work on [dice-embeddings](https://github.com/dice-group/dice-embeddings) and a LLM provided by us.

A working simple example:
```python
Graph={("ComputerScientist","subclass","Scientist"), ("Scientist","subclass","Person"),("CaglarDemir","type","ComputerScientist")}
trained_kge=KGE().train(G)
user_query="What is the occupation of Caglar?"
llm_endpoint=""
response=students_work(user_query, trained_kge, llm_endpoint)
"""
response ~ Caglar Demir is a Computer Scientist.
"""
```


#### Question & Answer Session

In case you have further questions, feel free to contact [Caglar Demir](https://dice-research.org/CaglarDemir).
33 changes: 0 additions & 33 deletions pages/theses/RobostEmbeddings.mdx

This file was deleted.

0 comments on commit d6eef36

Please sign in to comment.