Skip to content

Commit

Permalink
Merge pull request #248 from dice-group/develop
Browse files Browse the repository at this point in the history
Prep for the new release
  • Loading branch information
Demirrr authored Jun 26, 2024
2 parents dae330e + 3eebbac commit 4bc42ae
Show file tree
Hide file tree
Showing 16 changed files with 400 additions and 1,508 deletions.
52 changes: 22 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ Deploy a pre-trained embedding model without writing a single line of code.
### Installation from Source
``` bash
git clone https://github.com/dice-group/dice-embeddings.git
conda create -n dice python=3.10.13 --no-default-packages && conda activate dice && cd dice-embeddings &&
conda create -n dice python=3.10.13 --no-default-packages && conda activate dice
pip3 install -e .
```
or
Expand All @@ -48,7 +48,7 @@ wget https://files.dice-research.org/datasets/dice-embeddings/KGs.zip --no-check
```
To test the Installation
```bash
python -m pytest -p no:warnings -x # Runs >114 tests leading to > 15 mins
python -m pytest -p no:warnings -x # Runs >119 tests leading to > 15 mins
python -m pytest -p no:warnings --lf # run only the last failed test
python -m pytest -p no:warnings --ff # to run the failures first and then the rest of the tests.
```
Expand Down Expand Up @@ -95,45 +95,26 @@ A KGE model can also be trained from the command line
```bash
dicee --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
```
dicee automaticaly detects available GPUs and trains a model with distributed data parallels technique. Under the hood, dicee uses lighning as a default trainer.
dicee automatically detects available GPUs and trains a model with distributed data parallels technique.
```bash
# Train a model by only using the GPU-0
CUDA_VISIBLE_DEVICES=0 dicee --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
# Train a model by only using GPU-1
CUDA_VISIBLE_DEVICES=1 dicee --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1 python dicee/scripts/run.py --trainer PL --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
# Train a model by using all available GPUs
dicee --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
```
Under the hood, dicee executes run.py script and uses lighning as a default trainer
Under the hood, dicee executes the run.py script and uses [lightning](https://lightning.ai/) as a default trainer.
```bash
# Two equivalent executions
# (1)
dicee --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
# Evaluate Keci on Train set: Evaluate Keci on Train set
# {'H@1': 0.9518788343558282, 'H@3': 0.9988496932515337, 'H@10': 1.0, 'MRR': 0.9753123402351737}
# Evaluate Keci on Validation set: Evaluate Keci on Validation set
# {'H@1': 0.6932515337423313, 'H@3': 0.9041411042944786, 'H@10': 0.9754601226993865, 'MRR': 0.8072362996241839}
# Evaluate Keci on Test set: Evaluate Keci on Test set
# {'H@1': 0.6951588502269289, 'H@3': 0.9039334341906202, 'H@10': 0.9750378214826021, 'MRR': 0.8064032293278861}

# (2)
CUDA_VISIBLE_DEVICES=0,1 python dicee/scripts/run.py --trainer PL --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
# Evaluate Keci on Train set: Evaluate Keci on Train set
# {'H@1': 0.9518788343558282, 'H@3': 0.9988496932515337, 'H@10': 1.0, 'MRR': 0.9753123402351737}
# Evaluate Keci on Train set: Evaluate Keci on Train set
# Evaluate Keci on Validation set: Evaluate Keci on Validation set
# {'H@1': 0.6932515337423313, 'H@3': 0.9041411042944786, 'H@10': 0.9754601226993865, 'MRR': 0.8072362996241839}
# Evaluate Keci on Test set: Evaluate Keci on Test set
# {'H@1': 0.6951588502269289, 'H@3': 0.9039334341906202, 'H@10': 0.9750378214826021, 'MRR': 0.8064032293278861}
```
Similarly, models can be easily trained with torchrun
```bash
torchrun --standalone --nnodes=1 --nproc_per_node=gpu dicee/scripts/run.py --trainer torchDDP --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
# Evaluate Keci on Train set: Evaluate Keci on Train set: Evaluate Keci on Train set
# {'H@1': 0.9518788343558282, 'H@3': 0.9988496932515337, 'H@10': 1.0, 'MRR': 0.9753123402351737}
# Evaluate Keci on Validation set: Evaluate Keci on Validation set
# {'H@1': 0.6932515337423313, 'H@3': 0.9041411042944786, 'H@10': 0.9754601226993865, 'MRR': 0.8072499937521418}
# Evaluate Keci on Test set: Evaluate Keci on Test set
{'H@1': 0.6951588502269289, 'H@3': 0.9039334341906202, 'H@10': 0.9750378214826021, 'MRR': 0.8064032293278861}
```
You can also train a model in multi-node multi-gpu setting.
```bash
Expand All @@ -143,7 +124,7 @@ torchrun --nnodes 2 --nproc_per_node=gpu --node_rank 1 --rdzv_id 455 --rdzv_bac
Train a KGE model by providing the path of a single file and store all parameters under newly created directory
called `KeciFamilyRun`.
```bash
dicee --path_single_kg "KGs/Family/family-benchmark_rich_background.owl" --model Keci --path_to_store_single_run KeciFamilyRun --backend rdflib
dicee --path_single_kg "KGs/Family/family-benchmark_rich_background.owl" --model Keci --path_to_store_single_run KeciFamilyRun --backend rdflib --eval_model None
```
where the data is in the following form
```bash
Expand All @@ -152,6 +133,11 @@ _:1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07
<http://www.benchmark.org/family#hasChild> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#ObjectProperty> .
<http://www.benchmark.org/family#hasParent> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#ObjectProperty> .
```
**Continual Training:** the training phase of a pretrained model can be resumed.
```bash
dicee --continual_learning KeciFamilyRun --path_single_kg "KGs/Family/family-benchmark_rich_background.owl" --model Keci --path_to_store_single_run KeciFamilyRun --backend rdflib --eval_model None
```

**Apart from n-triples or standard link prediction dataset formats, we support ["owl", "nt", "turtle", "rdf/xml", "n3"]***.
Moreover, a KGE model can be also trained by providing **an endpoint of a triple store**.
```bash
Expand Down Expand Up @@ -285,16 +271,22 @@ pre_trained_kge.predict_topk(r=[".."],t=[".."],topk=10)

## Downloading Pretrained Models

We provide plenty pretrained knowledge graph embedding models at [dice-research.org/projects/DiceEmbeddings/](https://files.dice-research.org/projects/DiceEmbeddings/).
<details> <summary> To see a code snippet </summary>

```python
from dicee import KGE
# (1) Load a pretrained ConEx on DBpedia
model = KGE(url="https://files.dice-research.org/projects/DiceEmbeddings/KINSHIP-Keci-dim128-epoch256-KvsAll")
mure = KGE(url="https://files.dice-research.org/projects/DiceEmbeddings/YAGO3-10-Pykeen_MuRE-dim128-epoch256-KvsAll")
quate = KGE(url="https://files.dice-research.org/projects/DiceEmbeddings/YAGO3-10-Pykeen_QuatE-dim128-epoch256-KvsAll")
keci = KGE(url="https://files.dice-research.org/projects/DiceEmbeddings/YAGO3-10-Keci-dim128-epoch256-KvsAll")
quate.predict_topk(h=["Mongolia"],r=["isLocatedIn"],topk=3)
# [('Asia', 0.9894362688064575), ('Europe', 0.01575559377670288), ('Tadanari_Lee', 0.012544365599751472)]
keci.predict_topk(h=["Mongolia"],r=["isLocatedIn"],topk=3)
# [('Asia', 0.6522021293640137), ('Chinggis_Khaan_International_Airport', 0.36563414335250854), ('Democratic_Party_(Mongolia)', 0.19600993394851685)]
mure.predict_topk(h=["Mongolia"],r=["isLocatedIn"],topk=3)
# [('Asia', 0.9996906518936157), ('Ulan_Bator', 0.0009907372295856476), ('Philippines', 0.0003116439620498568)]
```

- For more please look at [dice-research.org/projects/DiceEmbeddings/](https://files.dice-research.org/projects/DiceEmbeddings/)

</details>

## How to Deploy
Expand Down
2 changes: 2 additions & 0 deletions dicee/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,8 @@ def __init__(self, **kwargs):
self.block_size: int = None
"block size of LLM"

self.continual_learning=None
"Path of a pretrained model size of LLM"

def __iter__(self):
# Iterate
Expand Down
2 changes: 1 addition & 1 deletion dicee/evaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -456,7 +456,7 @@ def dummy_eval(self, trained_model, form_of_labelling: str):
valid_set=valid_set,
test_set=test_set,
trained_model=trained_model)
elif self.args.scoring_technique in ['KvsAll', 'KvsSample', '1vsAll', 'PvsAll', 'CCvsAll']:
elif self.args.scoring_technique in ["AllvsAll",'KvsAll', 'KvsSample', '1vsAll']:
self.eval_with_vs_all(train_set=train_set,
valid_set=valid_set,
test_set=test_set,
Expand Down
31 changes: 16 additions & 15 deletions dicee/executer.py
Original file line number Diff line number Diff line change
Expand Up @@ -234,31 +234,32 @@ class ContinuousExecute(Execute):
(1) Loading & Preprocessing & Serializing input data.
(2) Training & Validation & Testing
(3) Storing all necessary info
During the continual learning we can only modify *** num_epochs *** parameter.
Trained model stored in the same folder as the seed model for the training.
Trained model is noted with the current time.
"""

def __init__(self, args):
assert os.path.exists(args.path_experiment_folder)
assert os.path.isfile(args.path_experiment_folder + '/configuration.json')
# (1) Load Previous input configuration
previous_args = load_json(args.path_experiment_folder + '/configuration.json')
dargs = vars(args)
del args
for k in list(dargs.keys()):
if dargs[k] is None:
del dargs[k]
# (2) Update (1) with new input
previous_args.update(dargs)
# (1) Current input configuration.
assert os.path.exists(args.continual_learning)
assert os.path.isfile(args.continual_learning + '/configuration.json')
# (2) Load previous input configuration.
previous_args = load_json(args.continual_learning + '/configuration.json')
args=vars(args)
#
previous_args["num_epochs"]=args["num_epochs"]
previous_args["continual_learning"]=args["continual_learning"]
print("Updated configuration:",previous_args)
try:
report = load_json(dargs['path_experiment_folder'] + '/report.json')
report = load_json(args['continual_learning'] + '/report.json')
previous_args['num_entities'] = report['num_entities']
previous_args['num_relations'] = report['num_relations']
except AssertionError:
print("Couldn't find report.json.")
previous_args = SimpleNamespace(**previous_args)
previous_args.full_storage_path = previous_args.path_experiment_folder
print('ContinuousExecute starting...')
print(previous_args)
# TODO: can we remove continuous_training from Execute ?
super().__init__(previous_args, continuous_training=True)

def continual_start(self) -> dict:
Expand All @@ -279,7 +280,7 @@ def continual_start(self) -> dict:
"""
# (1)
self.trainer = DICE_Trainer(args=self.args, is_continual_training=True,
storage_path=self.args.path_experiment_folder)
storage_path=self.args.continual_learning)
# (2)
self.trained_model, form_of_labelling = self.trainer.continual_start()

Expand Down
1 change: 1 addition & 0 deletions dicee/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@
from .clifford import Keci, KeciBase, CMult, DeCaL # noqa
from .pykeen_models import * # noqa
from .function_space import * # noqa
from .dualE import DualE
2 changes: 2 additions & 0 deletions dicee/models/base_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -431,6 +431,8 @@ class IdentityClass(torch.nn.Module):
def __init__(self, args=None):
super().__init__()
self.args = args
def __call__(self, x):
return x

@staticmethod
def forward(x):
Expand Down
88 changes: 60 additions & 28 deletions dicee/models/clifford.py
Original file line number Diff line number Diff line change
Expand Up @@ -764,7 +764,7 @@ def forward_triples(self, x: torch.Tensor) -> torch.FloatTensor:
Parameter
---------
x: torch.LongTensor with (n,3) shape
x: torch.LongTensor with (n, ) shape
Returns
-------
Expand Down Expand Up @@ -844,9 +844,9 @@ def forward_triples(self, x: torch.Tensor) -> torch.FloatTensor:
sigma_qr = 0
return h0r0t0 + score_p + score_q + score_r + sigma_pp + sigma_qq + sigma_rr + sigma_pq + sigma_qr + sigma_pr

def cl_pqr(self, a):
def cl_pqr(self, a:torch.tensor)->torch.tensor:

''' Input: tensor(batch_size, emb_dim) ----> output: tensor with 1+p+q+r components with size (batch_size, emb_dim/(1+p+q+r)) each.
''' Input: tensor(batch_size, emb_dim) ---> output: tensor with 1+p+q+r components with size (batch_size, emb_dim/(1+p+q+r)) each.
1) takes a tensor of size (batch_size, emb_dim), split it into 1 + p + q +r components, hence 1+p+q+r must be a divisor
of the emb_dim.
Expand All @@ -861,17 +861,25 @@ def cl_pqr(self, a):
def compute_sigmas_single(self, list_h_emb, list_r_emb, list_t_emb):

'''here we compute all the sums with no others vectors interaction taken with the scalar product with t, that is,
1) s0 = h_0r_0t_0
2) s1 = \sum_{i=1}^{p}h_ir_it_0
3) s2 = \sum_{j=p+1}^{p+q}h_jr_jt_0
4) s3 = \sum_{i=1}^{q}(h_0r_it_i + h_ir_0t_i)
5) s4 = \sum_{i=p+1}^{p+q}(h_0r_it_i + h_ir_0t_i)
5) s5 = \sum_{i=p+q+1}^{p+q+r}(h_0r_it_i + h_ir_0t_i)
.. math::
s0 = h_0r_0t_0
s1 = \sum_{i=1}^{p}h_ir_it_0
s2 = \sum_{j=p+1}^{p+q}h_jr_jt_0
s3 = \sum_{i=1}^{q}(h_0r_it_i + h_ir_0t_i)
s4 = \sum_{i=p+1}^{p+q}(h_0r_it_i + h_ir_0t_i)
s5 = \sum_{i=p+q+1}^{p+q+r}(h_0r_it_i + h_ir_0t_i)
and return:
*) sigma_0t = \sigma_0 \cdot t_0 = s0 + s1 -s2
*) s3, s4 and s5'''
.. math::
sigma_0t = \sigma_0 \cdot t_0 = s0 + s1 -s2
s3, s4 and s5
'''

p = self.p
q = self.q
Expand Down Expand Up @@ -906,15 +914,19 @@ def compute_sigmas_multivect(self, list_h_emb, list_r_emb):
For same bases vectors interaction we have
1) \sigma_pp = \sum_{i=1}^{p-1}\sum_{i'=i+1}^{p}(h_ir_{i'}-h_{i'}r_i) (models the interactions between e_i and e_i' for 1 <= i, i' <= p)
2) \sigma_qq = \sum_{j=p+1}^{p+q-1}\sum_{j'=j+1}^{p+q}(h_jr_{j'}-h_{j'} (models the interactions between e_j and e_j' for p+1 <= j, j' <= p+q)
3) \sigma_rr = \sum_{k=p+q+1}^{p+q+r-1}\sum_{k'=k+1}^{p}(h_kr_{k'}-h_{k'}r_k) (models the interactions between e_k and e_k' for p+q+1 <= k, k' <= p+q+r)
.. math::
\sigma_pp = \sum_{i=1}^{p-1}\sum_{i'=i+1}^{p}(h_ir_{i'}-h_{i'}r_i) (models the interactions between e_i and e_i' for 1 <= i, i' <= p)
\sigma_qq = \sum_{j=p+1}^{p+q-1}\sum_{j'=j+1}^{p+q}(h_jr_{j'}-h_{j'} (models the interactions between e_j and e_j' for p+1 <= j, j' <= p+q)
\sigma_rr = \sum_{k=p+q+1}^{p+q+r-1}\sum_{k'=k+1}^{p}(h_kr_{k'}-h_{k'}r_k) (models the interactions between e_k and e_k' for p+q+1 <= k, k' <= p+q+r)
For different base vector interactions, we have
4) \sigma_pq = \sum_{i=1}^{p}\sum_{j=p+1}^{p+q}(h_ir_j - h_jr_i) (interactionsn between e_i and e_j for 1<=i <=p and p+1<= j <= p+q)
5) \sigma_pr = \sum_{i=1}^{p}\sum_{k=p+q+1}^{p+q+r}(h_ir_k - h_kr_i) (interactionsn between e_i and e_k for 1<=i <=p and p+q+1<= k <= p+q+r)
6) \sigma_qr = \sum_{j=p+1}^{p+q}\sum_{j=p+q+1}^{p+q+r}(h_jr_k - h_kr_j) (interactionsn between e_j and e_k for p+1 <= j <=p+q and p+q+1<= j <= p+q+r)
.. math::
\sigma_pq = \sum_{i=1}^{p}\sum_{j=p+1}^{p+q}(h_ir_j - h_jr_i) (interactionsn between e_i and e_j for 1<=i <=p and p+1<= j <= p+q)
\sigma_pr = \sum_{i=1}^{p}\sum_{k=p+q+1}^{p+q+r}(h_ir_k - h_kr_i) (interactionsn between e_i and e_k for 1<=i <=p and p+q+1<= k <= p+q+r)
\sigma_qr = \sum_{j=p+1}^{p+q}\sum_{j=p+q+1}^{p+q+r}(h_jr_k - h_kr_j) (interactionsn between e_j and e_k for p+1 <= j <=p+q and p+q+1<= j <= p+q+r)
'''

Expand Down Expand Up @@ -958,15 +970,15 @@ def forward_k_vs_all(self, x: torch.Tensor) -> torch.FloatTensor:
"""
Kvsall training
(1) Retrieve real-valued embedding vectors for heads and relations \mathbb{R}^d .
(2) Construct head entity and relation embeddings according to Cl_{p,q}(\mathbb{R}^d) .
(1) Retrieve real-valued embedding vectors for heads and relations
(2) Construct head entity and relation embeddings according to Cl_{p,q, r}(\mathbb{R}^d) .
(3) Perform Cl multiplication
(4) Inner product of (3) and all entity embeddings
forward_k_vs_with_explicit and this funcitons are identical
Parameter
---------
x: torch.LongTensor with (n,2) shape
x: torch.LongTensor with (n, ) shape
Returns
-------
torch.FloatTensor with (n, |E|) shape
Expand Down Expand Up @@ -1097,9 +1109,12 @@ def construct_cl_multivector(self, x: torch.FloatTensor, re: int, p: int, q: int

def compute_sigma_pp(self, hp, rp):
"""
\sigma_{p,p}^* = \sum_{i=1}^{p-1}\sum_{i'=i+1}^{p}(x_iy_{i'}-x_{i'}y_i)
Compute
.. math::
\sigma_{p,p}^* = \sum_{i=1}^{p-1}\sum_{i'=i+1}^{p}(x_iy_{i'}-x_{i'}y_i)
sigma_{pp} captures the interactions between along p bases
\sigma_{pp} captures the interactions between along p bases
For instance, let p e_1, e_2, e_3, we compute interactions between e_1 e_2, e_1 e_3 , and e_2 e_3
This can be implemented with a nested two for loops
Expand All @@ -1125,7 +1140,12 @@ def compute_sigma_pp(self, hp, rp):

def compute_sigma_qq(self, hq, rq):
"""
Compute \sigma_{q,q}^* = \sum_{j=p+1}^{p+q-1}\sum_{j'=j+1}^{p+q}(x_jy_{j'}-x_{j'}y_j) Eq. 16
Compute
.. math::
\sigma_{q,q}^* = \sum_{j=p+1}^{p+q-1}\sum_{j'=j+1}^{p+q}(x_jy_{j'}-x_{j'}y_j) Eq. 16
sigma_{q} captures the interactions between along q bases
For instance, let q e_1, e_2, e_3, we compute interactions between e_1 e_2, e_1 e_3 , and e_2 e_3
This can be implemented with a nested two for loops
Expand Down Expand Up @@ -1157,7 +1177,9 @@ def compute_sigma_qq(self, hq, rq):

def compute_sigma_rr(self, hk, rk):
"""
\sigma_{r,r}^* = \sum_{k=p+q+1}^{p+q+r-1}\sum_{k'=k+1}^{p}(x_ky_{k'}-x_{k'}y_k)
.. math::
\sigma_{r,r}^* = \sum_{k=p+q+1}^{p+q+r-1}\sum_{k'=k+1}^{p}(x_ky_{k'}-x_{k'}y_k)
"""
# Compute indexes for the upper triangle of p by p matrix
Expand All @@ -1173,7 +1195,11 @@ def compute_sigma_rr(self, hk, rk):

def compute_sigma_pq(self, *, hp, hq, rp, rq):
"""
\sum_{i=1}^{p} \sum_{j=p+1}^{p+q} (h_i r_j - h_j r_i) e_i e_j
Compute
.. math::
\sum_{i=1}^{p} \sum_{j=p+1}^{p+q} (h_i r_j - h_j r_i) e_i e_j
results = []
sigma_pq = torch.zeros(b, r, p, q)
Expand All @@ -1189,7 +1215,11 @@ def compute_sigma_pq(self, *, hp, hq, rp, rq):

def compute_sigma_pr(self, *, hp, hk, rp, rk):
"""
\sum_{i=1}^{p} \sum_{j=p+1}^{p+q} (h_i r_j - h_j r_i) e_i e_j
Compute
.. math::
\sum_{i=1}^{p} \sum_{j=p+1}^{p+q} (h_i r_j - h_j r_i) e_i e_j
results = []
sigma_pq = torch.zeros(b, r, p, q)
Expand All @@ -1205,7 +1235,9 @@ def compute_sigma_pr(self, *, hp, hk, rp, rk):

def compute_sigma_qr(self, *, hq, hk, rq, rk):
"""
\sum_{i=1}^{p} \sum_{j=p+1}^{p+q} (h_i r_j - h_j r_i) e_i e_j
.. math::
\sum_{i=1}^{p} \sum_{j=p+1}^{p+q} (h_i r_j - h_j r_i) e_i e_j
results = []
sigma_pq = torch.zeros(b, r, p, q)
Expand Down
Loading

0 comments on commit 4bc42ae

Please sign in to comment.