Merge pull request #567 from dice-group/thesis_ml

Retrieval Augmented Generation over KGE
dice-group · Nov 15, 2024 · d6eef36 · d6eef36
2 parents be93d52 + c55377b
commit d6eef36
Show file tree

Hide file tree

Showing 3 changed files with 63 additions and 33 deletions.
diff --git a/pages/theses/BPE_KGE.mdx b/pages/theses/BPE_KGE.mdx
@@ -0,0 +1,24 @@
+---
+date: '2024-11-15'
+title: 'Byte pair encoding for Knowledge Graph Embeddings'
+type: 'Bachelor'
+supervisor: dice:CaglarDemir
+contact: dice:CaglarDemir
+---
+
+# Topic 
+A knowledge graph embedding (KGE) model assigns a unique embedding row for each unique entities/nodes and relations/edges.
+As the size of the unique entities or relations grows, the memory usage of KGE increases.
+Therefore, the memory requirement to train KGE model or deploy a trained model is bounded by the size of the data.
+
+LLMs uses byte pair encoding techniques to learn to represent sequence of chars with subword unit.
+Therefore, LLM embeddings are subword units, instead of unique words.
+Recently, we show that byte pair encoding schema developed for LLMs can also be used for KGEs (see 
+[Inference over Unseen Entities, Relations and Literals on Knowledge Graphs](https://arxiv.org/pdf/2410.06742) .
+In this thesis, the student will design a byte pair encoding schema based on a given knowledge graph.
+The student will closely work on [dice-embeddings](https://github.com/dice-group/dice-embeddings).
+
+
+#### Question & Answer Session
+
+In case you have further questions, feel free to contact [Caglar Demir](https://dice-research.org/CaglarDemir).
diff --git a/pages/theses/RAG_KGE.mdx b/pages/theses/RAG_KGE.mdx
@@ -0,0 +1,39 @@
+---
+date: '2024-11-15'
+title: 'RAG over Neural Triple Stores'
+type: 'Bachelor'
+supervisor: dice:CaglarDemir
+contact: dice:CaglarDemir
+---
+
+# Topic 
+Most knowledge graphs are incomplete.
+Neural link predictors (most knowledge graph embeddings) can accurately infer of missing knowledge even involving multi-hop reasoning.
+
+In this thesis, the student will focus on techniques that combine LLMs and neural link predictors in the context of retrieval augmented generation (RAG). 
+Through designing a novel and effective model, we aim to achieve the following workflow
+
+1. A user can ask a question.
+2. LLM renders the question into a first order logic expression via prompt engineering.
+3. The first-order logic expression is given to a neural link predictor to perform [multi-hop query answering](https://github.com/dice-group/dice-embeddings?tab=readme-ov-file#answering-complex-queries)
+4. The result (e.g. an ordered sequence of nodes/entities) is preprocessed and given to LLM to generate fluent response to the user
+
+
+The student will closely work on [dice-embeddings](https://github.com/dice-group/dice-embeddings) and a LLM provided by us.
+
+A working simple example:
+```python
+Graph={("ComputerScientist","subclass","Scientist"), ("Scientist","subclass","Person"),("CaglarDemir","type","ComputerScientist")}
+trained_kge=KGE().train(G)
+user_query="What is the occupation of Caglar?"
+llm_endpoint=""
+response=students_work(user_query, trained_kge, llm_endpoint)
+"""
+response ~ Caglar Demir is a Computer Scientist.
+"""
+```
+
+
+#### Question & Answer Session
+
+In case you have further questions, feel free to contact [Caglar Demir](https://dice-research.org/CaglarDemir).
diff --git a/pages/theses/RobostEmbeddings.mdx b/pages/theses/RobostEmbeddings.mdx