Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot generate clusters with learnnrnoisy #15

Open
fburdet opened this issue Jun 22, 2022 · 2 comments
Open

Cannot generate clusters with learnnrnoisy #15

fburdet opened this issue Jun 22, 2022 · 2 comments

Comments

@fburdet
Copy link

fburdet commented Jun 22, 2022

Hello,

We could successfully run SAFRAN applymax and explore the results with LinkExplorer, thanks again for your help!

Now we would like to run the non-redundant algorithms.

What we did is
run calcjacc with this config
PATH_TRAINING = Adsicore_V04.tsv
PATH_TEST = DB05419.test.tsv
PATH_VALID = valid.txt

PATH_RULES = rules/alpha-1000

WORKER_THREADS = 30

VERBOSE = 1

PATH_JACCARD = jaccard.V02

which takes quite some time, and then run learnnrnoisy with this config

PATH_TRAINING = Adsicore_V04.tsv
PATH_TEST = DB05419.test.tsv
PATH_VALID = valid.txt

PATH_JACCARD = jaccard.V02
PATH_RULES = rules/alpha-1000

PATH_OUTPUT = predictions.learnnrnoisy.V02

WORKER_THREADS = 15

PATH_CLUSTER = cluster.adsicore.V02.txt

which runs pretty quickly. Unfortunately, cluster.adsicore.V02.txt is empty.

It is to be noted that the test set only contains triplets about 1 molecule, and 54 lines, could it be that there are just no clusters found? (Adsicore_V04.tsv contains 6411664 lines)

@nomisto
Copy link
Contributor

nomisto commented Jun 22, 2022

Ah I remember now, valid.txt is empty right? The quality of clustering is evaluated using a validation set (hyperparameter tuning). So would have to split your train set into train and val set, f.e. 10% of the triples.

If you don't care about the unfair evaluation, you could do this just for learnnrnoisy and for applynrnoisy use the whole train set again (however if you do this please be aware that you then have triples in the training set in applynrnoisy that were used for hyperparametertuning in learnnrnoisy)

@fburdet
Copy link
Author

fburdet commented Jun 24, 2022

Hello,

Thanks for the answer!

Indeed, specifying a non-empty valid.txt seems to launch many more calculations.

I'm even afraid it's going to take days... is it proportional to the number of triples in valid.txt? There are 641186 in mine, and it seems to take ~ 24h per rule relation, and there are 95 of them... any idea on how to speed that up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants