Dataset & code for DestT5 (NLP for ConvAI, ACL 2023)
If you use this dataset or repository, please cite the following paper:
@inproceedings{glenn2023correcting,
author = {Parker Glenn, Parag Pravin Dakle, Preethi Raghavan},
title = "Correcting Semantic Parses with Natural Language through Dynamic Schema Encoding",
booktitle = "Proceedings of the 5th Workshop on NLP for Conversational AI",
publisher = "Association for Computational Linguistics",
year = "2023"
}
Below we display the exact-match (EM%) and execution-accuracy (EX%) of DestT5 on the SPLASH dataset, as well as the auxiliary test sets available in the NLEdit codebase.
Seq2Struct (SPLASH) | EditSQL | TaBERT | RAT-SQL | T5-Large | ||
---|---|---|---|---|---|---|
DestT5 (parkervg/destt5-schema-prediction with parkervg/destt5-text2sql) | EM% | 53.43 | 31.82 | 31.47 | 28.37 | 26.1 |
EX% | 56.86 | 40.3 | 28.84 | 36.53 | 30.43 |
The file data/splash-t5-3vnuv1vf.json contains 112 annotations for interactive semantic parsing.
Given randomly selected errors on the Spider dataset by tscholak/3vnuv1vf, natural language feedback is given to correct the erroneous parse.
Our codebase is based off the great implementation of Picard. Specifically, we make the following updates to the DataTrainingArguments
at seq2seq/utils/dataset.py to re-create the experiments described in the paper.
use_gold_concepts: bool = field(
default=False,
metadata={
"help": "Whether or not to serialize input only with columns/tables/values present in the gold query."
},
)
use_serialization_file: Optional[List[str]] = field(
default=None,
metadata={
"help": "If specified, points to the output of a T5 concept prediction model. Uses predictions as serialization to current text-to-sql model"
},
)
include_explanation: Optional[bool] = field(
default=False,
metadata={
"help": "Boolean defining whether to serialize explanation in SPLASH training"
},
)
include_question: Optional[bool] = field(
default=False,
metadata={
"help": "Boolean defining whether to serialize question in SPLASH training"
},
)
splash_train_with_spider: Optional[bool] = field(
default=False,
metadata={
"help": "Boolean defining whether to interleave Spider train set with Splash train"
},
)
shuffle_splash_feedback: Optional[bool] = field(
default=False,
metadata={
"help": "Test to see if model is actually using feedback, by running evaluation on test set with shuffled feedback"
},
)
shuffle_splash_question: Optional[bool] = field(
default=False,
metadata={
"help": "Test to see if model is actually using question, by running evaluation on test set with shuffled questions"
},
)
task_type: Optional[str] = field(
default="text2sql",
metadata={"help": "One of text2sql, schema_prediction"},
)
spider_eval_on_splash: Optional[bool] = field(
default=False,
metadata={"help": "Whether we're running a Spider model on SPLASH. Only use question, in that case."},
)
First, clone the repo.
This repo uses submodules, we can install them with the following commands.
git submodule init
git submodule update
Then, create a destt5
conda env with the following command.
conda env create --file env.yml
This work requires both the Spider dataset and the Splash dataset.
First, download Spider.zip here.
Place this file in seq2seq/datasets/spider
.
Then
Then, to run the training for DestT5, run the following command.
python -m seq2seq.run_seq2seq ./seq2seq/configs/question/text2sql-t5-base-schema-generator.json