Skip to content

A large scale data for answer verbalization for simple natural questions.

Notifications You must be signed in to change notification settings

AskNowQA/VANiLLa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VANiLLa : Verbalized Answers in Natural Language at Large scale

Introduction

In the last years, there have been significant developments in the area of Question Answering over Knowledge Graphs (KGQA). Despite all the notable advancements, current KGQA datasets only provide the answers as the direct output result of the formal query, rather than full sentences incorporating question context. For achieving coherent answers sentence with the question's vocabulary, template-based verbalization so are usually employed for a better representation of answers, which in turn require extensive expert intervention. Thus, making way for machine learning approaches; however, there is a scarcity of datasets that empower machine learning models in this area. Hence, we provide the VANiLLa dataset which aims at reducing this gap by offering answers in natural language sentences. The answer sentences in this dataset are syntactically and semantically closer to the question than to the triple fact. Our dataset consists of over 100k simple questions adapted from the CSQA and SimpleQuestionsWikidata datasets and generated using a semi-automatic framework. We also present results of training our dataset on multiple baseline models adapted from current state-of-the-art Natural Language Generation (NLG) architectures. We believe that this dataset will allow researchers to focus on finding suitable methodologies and architectures for answer verbalization.

Dataset

The dataset is available at: here under Attribution 4.0 International (CC BY 4.0).

Our dataset contains over 100k examples with a 80% (train) - 20% (test) split. Each instance of the dataset consists of:

{
    "question_id": "an unique identification number for a dataset instance",
    "question": "question",
    "answer": "retrieved answer",
    "answer_sentence": "verbalized answer in natural language"
}

Paper

Please see https://arxiv.org/abs/2105.11407 for details about the dataset.

Experimental Results

Baseline Models

We decided to use some conventional sequence-to-sequence models following the underlying Encoder-Decoder pipeline:

  • Sequence-to-Sequence model with attention mechanism
  • Convolution based Encoder-Decoder model
  • Transformer
Baseline Model PPl Precision BLEU
Seq2Seq with Attention 27.91 19.84 16.66
CNN Enc-Dec 87.67 70.50 15.42
Transformer 12.10 76.00 30.80

About

A large scale data for answer verbalization for simple natural questions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages