ReCG is the first bottom-up JSON schema discovery algorithm.
Table of Contents
We introduce ReCG, our novel algorithm for JSON schema discovery. ReCG is designed to address the limitations of traditional top-down methods by operating in a bottom-up manner. Here are the key features of ReCG:
- Utilizing a bottom-up approach for JSON schema discovery, which builds tree-structured JSON schemas from leaf elements upwards, ensuring more informed decisions about what type of schema node to derive
- Implementing a repetitive cluster-and-generalize framework that systematically explores candidate schema sets
- Applying the Minimum Description Length (MDL) principle to select the most concise and precise schemas, balancing generality and specificity
Evaluations show ReCG improves recall and precision by up to 47%, resulting in a 46% better F1 score and over twice the speed compared to state-of-the-art techniques.
This page guides you to reproduce the results written in the paper "ReCG: Bottom-Up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize Framework".
Please refer to the instructions below.
You must be able to download our docker image from the docker cloud. Please refer to Docker Docs to download docker.
We made a docker image of our environment. Please download from docker cloud.
- Download our image from docker cloud
docker pull joohyungyun/vldb2024-recg:1.0
Create a docker container using the downloaded image.
- Docker run
docker run -itd --name vldb2024-recg joohyungyun/vldb2024-recg:1.0 /bin/bash
- Docker start
docker start vldb2024-recg
- Docker init
docker init vldb2024-recg
The whole reproduction process can be easily done by typing a single line
./runAll.shThe anticipated runtime of the whole process is over 4 full days, so we recomment you to run the process using tmux!
For detailed explanation or for a more fine-grained run, jump to Quick Overview
- ReCG
This directory contains the C++ implementation of ReCG.
Refer to README of this directory for more information.
- Dataset
This directory contains all 20 datasets used in the paper "ReCG: Bottom-Up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize Framework". Due to their file sizes, the datasets are not uploaded on the github repository, but are within our docker image. Thus, the reproduction will not be successful if one just cloned our github repository.
Refer to README of this directory for more information.
- Experiment
This directory contains the Python implementations for the four experiments conducted in "ReCG: Bottom-Up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize Framework".
Refer to README of this directory for more information.
- ExperimentVisualization
This directory contains the Python implementations that visualize (either printing in consoles or drawing plots) experiments conducted in "ReCG: Bottom-Up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize Framework".
Refer to README of this directory for more information.
We explain our code in a fine-grained manner.
./runAll.sh file is comprised in three steps.
(a) Build C++ Implementations
cd ReCG
./compile.sh
cd ..
cd ReCG_TopDown
./compile.sh
cd ..
cd Frozza
./compile.sh
cd ..
cd Klettke
./compile.sh
cd ..(b) Build Jxplain
./buildJxplain.sh(c) Build KReduce
./buildKReduce.shRun all experiments and return to this directory.
cd Experiment
./runAllExperiments.sh
cd ..For detailed explanation of each experiments, refer README.md of Experiements directory.
Run all experiments visualizations and return to this directory.
cd ExperimentVisualization
./runAllExperimentVisualizations.sh
cd ..For detailed explanation of each experiments, refer README.md of ExperiementVisualizations directory.
If you only want to run ReCG, please follow refer below.
You can run ReCG in release mode with the following command:
~/VLDB2024_ReCG/ReCG/build/ReCG
--in_path [pathToInputFile (.jsonl)]
--out_path [pathToOutputSchema (.json)]
--search_alg kbeam
--beam_width [int]
--epsilon [float | 0 < x && x <= 1]
--min_pts_perc [int | 0 < x && x <= 100]
--sample_size [int | x > 0]
--src_weight [float | 0 <= src_weight && src_weight <= 1.0 && src_weight + drc_weight == 1]
--drc_weight [float | 0 <= drc_weight && src_weight <= 1.0 && src_weight + drc_weight == 1]
--cost_model [{mdl, kse}]You may also run it in the debugging mode with the following command:
~/VLDB2024_ReCG/ReCG/build-debug/ReCG
--in_path [pathToInputFile (.jsonl)]
--out_path [pathToOutputSchema (.json)]
--search_alg kbeam
--beam_width [int]
--epsilon [float | 0 < x && x <= 1]
--min_pts_perc [int | 0 < x && x <= 100]
--sample_size [int | x > 0]
--src_weight [float | 0 <= src_weight && src_weight <= 1.0 && src_weight + drc_weight == 1]
--drc_weight [float | 0 <= drc_weight && src_weight <= 1.0 && src_weight + drc_weight == 1]
--cost_model [{mdl, kse}]Example code: try out this one!
~/VLDB2024_ReCG/ReCG/build/ReCG
--in_path ~/VLDB2024_ReCG/ReCG/test_data/ckg_node_Amino_acid_sequence.jsonl \
--out_path something.json \
--search_alg kbeam \
--beam_width 3 \
--sample_size 1000 \
--epsilon 0.5 \
--src_weight 0.5 \
--drc_weight 0.5