This open-source project contains the Python implementation of our approach HybridFC. This project is designed to ease real-world applications of fact-checking over knowledge graphs and produce better results. With this aim, we rely on:
- PytorchLightning to perform training via multi-CPUs, GPUs, TPUs, or computing clusters,
- Pre-trained-KG-embeddings to get pre-trained KG embeddings for knowledge graphs for knowledge graph-based component,
- Elastic-search to load text corpus (Wikipedia) on the elastic search for text-based component, and
- Path-based-approach to calculate the output score for the path-based component.
First, clone the repository:
git clone https://github.com/factcheckerr/HybridFC.git
cd HybridFC
There are two options to reproduce the results.
- using pre-generated data, and
- Regenerate data from scratch. Please choose any 1 of these 2 options.
download and unzip data and embedding files in the root folder of the project.
pip install gdown
wget https://files.dice-research.org/datasets/ISWC2022_HybridFC/data.zip
unzip data.zip
Note: if it gives permission denied error you can try running the commands with "sudo"
In case you don't want to use pre-generated data, follow this step:
run KGV to collect results from FactCheck and COPAAL.
run FactCheck on FactBench, FaVel, and BPDP datasets using Wikipedia as a reference corpus.
As an input user needs the output of FactCheck in JSON format.
JSON format is as follows:
-=-=-=-=-=-=-==-=-==-=-=-=-==-=-=-=-=-=-=-==-=-=-=-
10
/factbench/test/correct/death/death_00053.ttl
defactoScore: 0.98 setProofSentences : [ComplexProofs{website='https://en.wikipedia.org/wiki/Reba White Williams', proofPhrase='In 1999 , White Williams ran unsuccessfully for the New York City City Council in District 4 .', trustworthinessScore='0.997908778988452'}, ComplexProofs{website='https://en.wikipedia.org/wiki/James Leo Herlihy', proofPhrase='Like Williams , Herlihy had lived in New York City .', trustworthinessScore='0.9975670565782072'}, ComplexProofs{website='https://en.wikipedia.org/wiki/Charles Williams (musician)', proofPhrase='Charles Isaac Williams -LRB- born July 18 , 1932 -RRB- is an alto saxophonist based in New York City .', trustworthinessScore='0.9991775993927828'}] subject : Tennessee Williams object : New York City predicate deathPlace
-=-=-=-=-=-=-==-=-==-=-=-=-==-=-=-=-=-=-=-==-=-=-=-
Put the result JSON file in the data folder.
Further details are in the readme file in overall_process folder
Install dependencies via conda:
#setting up the environment
#creating and activating the conda environment
conda env create -f environment.yml
conda activate hfc2
#If conda command not found: download miniconda from (https://docs.conda.io/en/latest/miniconda.html#linux-installers) and set the path:
#export PATH=/path-to-conda/miniconda3/bin:$PATH
start generating results:
# Start the training process, with the required number of hyperparameters. Details about other hyperparameters are in the main.py file.
python main.py --emb_type CoNex --model full-Hybrid --num_workers 32 --min_num_epochs 100 --max_num_epochs 1000 --check_val_every_n_epochs 10 --eval_dataset FactBench
# Computing evaluation files from the saved model in "dataset/Hybrid_Stroage" directory
python evaluate_checkpoint_model.py --emb_type TransE --model full-Hybrid --num_workers 32 --min_num_epochs 100 --max_num_epochs 1000 --check_val_every_n_epochs 10 --eval_dataset FactBench
-
To reproduce similar results you have to use the exact parameters as listed above.
-
For other datasets you need to change the parameter in front of --dataset
-
Use GPU for fast processing. The default parameter is set to 2 GPUs that we used to generate results.
-
For different embeddings type(emb_type) or model type(model), you just need to change the parameters.
-
For differnt embeddings type(emb_type) or model type(model), you just need to change the parameters.
Available embeddings types: ConEx, TransE,
The following can be added: ComplEx, RDF2Vec (only for BPDP dataset), QMult.
Available models: hybridfc-full-Hybrid, KGE-only,text-only, text-KGE-Hybrid, path-only, text-path-Hybrid, KGE-path-Hybrid
Note: model names are case-sensitive. So please use exact names.
After computing evaluation results, the prediction files are saved in the "dataset/HYBRID_Storage" folder along with ground truth files. These files can be uploaded to a live instance of GERBIL (by Roder et al.) framework to produce AUROC curve scores.
In future work, we will exploit the modularity of HybridFC by integrating rule-based approaches and path embedding. We also plan to explore other possibilities to select the best evidence sentences.
The work has been supported by the EU H2020 Marie Skłodowska-Curie project KnowGraphs (no. 860801)).
- Umair Qudus (DICE, Paderborn University)
- Michael Röder (DICE, Paderborn University)
- Muhammad Saleem (DICE, Paderborn University)
- Axel-Cyrille Ngonga Ngomo (DICE, Paderborn University)
@InProceedings{qudus2022hybridfc,
Author = {Qudus, Umair and Röder, Michael and Saleem,Muhammad and Ngomo, Axel-Cyrille Ngonga},
Editor ={Sattler, Ulrike and Hogan, Aidan and Keet, Maria and Presutti, Valentina and Almeida, Jo{\~a}o Paulo A. and Takeda, Hideaki and Monnin, Pierre and Pirr{\`o}, Giuseppe and d'Amato, Claudia},
Title = {HybridFC: A Hybrid Fact-Checking Approach for Knowledge Graphs},
booktitle = {The Semantic Web -- ISWC 2022},
Year = {2022},
Doi = {10.1007/978-3-031-19433-7\_27},
isbn ={978-3-031-19433-7},
pages = {462--480},
address ={Cham},
publisher = {Springer International Publishing},
biburl = {https://www.bibsonomy.org/bibtex/2ec2f0b9ee7ca0c1c6ef1d8fbcd7262e4/dice-research},
keywords = {knowgraphs frockg raki 3dfed dice ngonga saleem roeder qudus},
url = {https://papers.dice-research.org/2022/ISWC_HybridFC/public.pdf},
}