Skip to content

This is a private repository storing the official implementation of "ReCG: Bottom-Up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize Framework"

Notifications You must be signed in to change notification settings

joohyung00/recg_vldb_2024

Repository files navigation


ReCG: Bottom-Up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize Framework

ReCG is the first bottom-up JSON schema discovery algorithm.

Table of Contents
  1. About ReCG
  2. Getting Started
  3. Single-Command Reproduction
  4. Explanation About Directories
  5. Quick Overview

About ReCG

We introduce ReCG, our novel algorithm for JSON schema discovery. ReCG is designed to address the limitations of traditional top-down methods by operating in a bottom-up manner. Here are the key features of ReCG:

  • Utilizing a bottom-up approach for JSON schema discovery, which builds tree-structured JSON schemas from leaf elements upwards, ensuring more informed decisions about what type of schema node to derive
  • Implementing a repetitive cluster-and-generalize framework that systematically explores candidate schema sets
  • Applying the Minimum Description Length (MDL) principle to select the most concise and precise schemas, balancing generality and specificity

drawing

Evaluations show ReCG improves recall and precision by up to 47%, resulting in a 46% better F1 score and over twice the speed compared to state-of-the-art techniques.

ReCG is implemented with C++.

(back to top)

Getting Started

This page guides you to reproduce the results written in the paper "ReCG: Bottom-Up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize Framework".

Please refer to the instructions below.

Prerequisites

Docker

You must be able to download our docker image from the docker cloud. Please refer to Docker Docs to download docker.

Download Docker Image

We made a docker image of our environment. Please download from docker cloud.

  1. Download our image from docker cloud
    docker pull joohyungyun/vldb2024-recg:1.0

Create Docker Container

Create a docker container using the downloaded image.

  1. Docker run
    docker run -itd --name vldb2024-recg joohyungyun/vldb2024-recg:1.0 /bin/bash
  2. Docker start
    docker start vldb2024-recg
  3. Docker init
    docker init vldb2024-recg

(back to top)

Single-Command Reproduction

The whole reproduction process can be easily done by typing a single line

./runAll.sh

The anticipated runtime of the whole process is over 4 full days, so we recomment you to run the process using tmux!

For detailed explanation or for a more fine-grained run, jump to Quick Overview

Explanation about Directories

  • ReCG

This directory contains the C++ implementation of ReCG.

Refer to README of this directory for more information.

  • Dataset

This directory contains all 20 datasets used in the paper "ReCG: Bottom-Up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize Framework". Due to their file sizes, the datasets are not uploaded on the github repository, but are within our docker image. Thus, the reproduction will not be successful if one just cloned our github repository.

Refer to README of this directory for more information.

  • Experiment

This directory contains the Python implementations for the four experiments conducted in "ReCG: Bottom-Up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize Framework".

Refer to README of this directory for more information.

  • ExperimentVisualization

This directory contains the Python implementations that visualize (either printing in consoles or drawing plots) experiments conducted in "ReCG: Bottom-Up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize Framework".

Refer to README of this directory for more information.

(back to top)

Quick Overview

We explain our code in a fine-grained manner. ./runAll.sh file is comprised in three steps.

1. Build or Compile Algorithms

(a) Build C++ Implementations

cd ReCG
./compile.sh
cd ..
cd ReCG_TopDown
./compile.sh
cd ..
cd Frozza
./compile.sh
cd ..
cd Klettke
./compile.sh
cd ..

(b) Build Jxplain

./buildJxplain.sh

(c) Build KReduce

./buildKReduce.sh

2. Run Experiments

Run all experiments and return to this directory.

cd Experiment
./runAllExperiments.sh
cd ..

For detailed explanation of each experiments, refer README.md of Experiements directory.

3. Visualize Experiment Results

Run all experiments visualizations and return to this directory.

cd ExperimentVisualization
./runAllExperimentVisualizations.sh
cd ..

For detailed explanation of each experiments, refer README.md of ExperiementVisualizations directory.

(A) Run ReCG

If you only want to run ReCG, please follow refer below.

You can run ReCG in release mode with the following command:

~/VLDB2024_ReCG/ReCG/build/ReCG
    --in_path [pathToInputFile (.jsonl)]
    --out_path [pathToOutputSchema (.json)]
    --search_alg kbeam
    --beam_width [int]
    --epsilon [float | 0 < x && x <= 1]
    --min_pts_perc [int | 0 < x && x <= 100]
    --sample_size [int | x > 0]
    --src_weight [float | 0 <= src_weight && src_weight <= 1.0 && src_weight + drc_weight == 1]
    --drc_weight [float | 0 <= drc_weight && src_weight <= 1.0 && src_weight + drc_weight == 1]
    --cost_model [{mdl, kse}]

You may also run it in the debugging mode with the following command:

~/VLDB2024_ReCG/ReCG/build-debug/ReCG
    --in_path [pathToInputFile (.jsonl)]
    --out_path [pathToOutputSchema (.json)]
    --search_alg kbeam
    --beam_width [int]
    --epsilon [float | 0 < x && x <= 1]
    --min_pts_perc [int | 0 < x && x <= 100]
    --sample_size [int | x > 0]
    --src_weight [float | 0 <= src_weight && src_weight <= 1.0 && src_weight + drc_weight == 1]
    --drc_weight [float | 0 <= drc_weight && src_weight <= 1.0 && src_weight + drc_weight == 1]
    --cost_model [{mdl, kse}]

Example code: try out this one!

~/VLDB2024_ReCG/ReCG/build/ReCG
    --in_path ~/VLDB2024_ReCG/ReCG/test_data/ckg_node_Amino_acid_sequence.jsonl \
    --out_path something.json \
    --search_alg kbeam \
    --beam_width 3 \
    --sample_size 1000 \
    --epsilon 0.5 \
    --src_weight 0.5 \
    --drc_weight 0.5

(back to top)

About

This is a private repository storing the official implementation of "ReCG: Bottom-Up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize Framework"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published