HiTab is a dataset for question answering and data-to-text over hierarchical tables . It contains 10,672 samples and 3,597 tables from statistical reports (StatCan, NSF) and Wikipedia (ToTTo). 98.1% of the tables in HiTab are with hierarchies. You can find more details in our paper.
During the dataset annotation process, annotators first manually collect tables and descriptive sentences highly-related to tables on statistical websites written by professional analysts. And then these descriptions are revised to questions to preserve the original meanings and analyses.
We hope HiTab can serve as a useful benchmark for table understanding on hierarchical tables.
In the latest version dataset, we have improved the algorithm for hierarchy extraction and fixed some unreliable question answering pairs, thus the qa and data2text performance will be slightly higher than the results reported in the paper. We show more details in qa and data2text descriptions.
- 2024-11-12: “Encoding Spreadsheets for Large Language Models” at EMNLP 2024.
- 2024-7-15: A tutorial on “Large Language Models for Tabular Data” at SIGIR 2024.
- 2022-7-23: A survey on “Table Pretraining: A Survey on Model Architectures, Pretraining Objectives, and Downstream Tasks” at IJCAI 2022.
- 2022-5-28: Code for data-to-text experiments is now available.
- 2022-3-8: HiTab is accepted to ACL 2022 main conference.
- 2022-2-7: We released the final version of HiTab data. Please feel free to explore it!
- 2021-12-6: We released code of question answering and a new version HiTab data. Several modifications on data: (1) more precise hierarchies are derived for ~3% tables with new heuristic algorithms; (2) fix the problem that ~0.6% tables ranges were not correctly extracted from original excel file; (3) temporarily set aside ~1.5% samples for further check containing unreliable answers or aggregations, which hopefully won't affect evaluating new methods due to the small proportion. We'll release the final version HiTab version after checking. Thank you for your patience.
- 2021-9-2: We released full HiTab data, including (1) question answering and data2text samples, (2) tables with parsed hierarchies.
HiTab dataset consists of three .jsonl
files for train/dev/test samples and a directory of .json
files for tables.
{
"id": "7392822961051524760",
"table_id": "1028",
"table_source": "statcan",
"sentence_id": "5895",
"sub_sentence_id": "1",
"sub_sentence": "in 2013/2014, on any given day, there were on average 139,337 adult offenders being supervised in either provincial/territorial or federal correctional services",
"question": "in 2013/2014, on any given day, how many adult offenders are being supervised in either provincial/territorial or federal correctional services?",
"answer": [
139337
],
"aggregation": [
"sum"
],
"linked_cells": {
"entity_link": {
"top": {
"correctional services": {
"(0, 7)": "total correctional services"
}
},
"left": {
"provincial/territorial": {
"(14, 0)": "provinces and territories - total"
},
"federal": {
"(15, 0)": "federal"
}
},
"top_left_corner": {}
},
"quantity_link": {
"[ANSWER]": {
"(15, 7)": 22895.0,
"(14, 7)": 116442.0
}
}
},
"answer_formulas": [
"=H17+H18"
],
"reference_cells_map": {
"H17": "(14, 7)",
"H18": "(15, 7)"
}
}
- Meta Data:
id
is the unique id of each sample. The other ids describe the detailed information in annotations andtable_source
shows which source the table comes from. - Task Data:
sub_sentence
is "text" in data2text task.question
andanswer
are for question answering task. - Links and Compositions:
aggregation
is the aggregation(s) to derive the answer.linked_cells
are the regarded cells in both tasks.answer_formulas
are formulas about how cells composite to derive the answer.reference_cells_map
are the referenced cells to current cell coordinate in the table matrix.- Linked Cells:
linked_cells
are divided intoentity_link
(not in data region) andquantity_link
(cells in data region).entity_link
are further classified intotop
(top header),left
(left header) andtop-left-corner
(on the top-left corner of table). The key of each link is the phrase in the sub-sentence, like "correctional services". The value contains key-value pairs in format cell coordinate - cell string in table, like "(0, 7)": "total correctional services" . [ANSWER] is a special key as it stands for the cells that composite to derive the answer. Usually [ANSWER] appears inquantity_link
, but sometimes it can be inentity_link
if the answer is a header.
- Linked Cells:
The cell coordinates above are under the coordinate system of the table matrix provided in following table format.
{
"top_root": {
"row_index": -1,
"column_index": -1,
"children": [
{
"row_index": 0,
"column_index": 1,
"children": [
{
"row_index": 1,
"column_index": 1,
"children": []
},
{
"row_index": 1,
"column_index": 2,
"children": []
}
]
},...
]
},
"left_root": {
"row_index": -1,
"column_index": -1,
"children": [
{
"row_index": 2,
"column_index": 0,
"children": [
{
"row_index": 3,
"column_index": 0,
"children": []
},
{
"row_index": 4,
"column_index": 0,
"children": []
},...
]
},
...
]
},
"top_header_rows_num": 3,
"left_header_columns_num": 1
}
top_root
and left_root
are the parsed tree hierarchies of top headers and left headers. row_index
and column_index
are row and column index of current header node in the table matrix. -1 stands for the virtual root. top_header_rows_num
and left_header_columns_num
are number of rows/columns of headers in the table matrix.
{
"texts": [
[
"",
"total beverages",
"",
"skim, 1% or 2% milk",
"",
"whole milk and flavoured milk",
"",
"fruit juice",
"",
"soft drinks",
"",
"fruit drinks",
""
],...
],
"merged_regions": [
{
"first_row": 0,
"last_row": 0,
"first_column": 5,
"last_column": 6
},
{
"first_row": 0,
"last_row": 0,
"first_column": 3,
"last_column": 4
}, ...
],
}
texts
is the complete table matrix consisting M rows and N columns. merged_regions
lists all the merged cells. If a cell is a merged cells, only its core cell (the top left position in the merged cell) will have content in texts
, and others will be empty.
The tables in tables/hmt/
directory are an adapted version to the hierarchical matrix table data structure customized for hierarchy-aware logical form, which basically contain the same information as the data format above.
The question answering codebase references pytorch version of MAPO and TaBERT. Many respects and thanks for PengCheng Yin's great work!
Weakly supervised Table QA usually requires consistent programs for warm start and alignments between question and table schemas or headers as input features,
which we already provide as data/explore/saved_programs.json
, and data/processed_input/
.
Users can also start with raw data format, i.e. data/*_samples.jsonl
, by searching programs with qa/table/random_explore.py
and extracting question-table alignments with qa/datadump/process_input.py
. The detailed usage of console arguments can be found in the code files.
Here is a very quick start script for "MAPO with hierarchical-aware logical form" method in HiTab paper using our processed data.
# unzip table files
unzip -d data/ data/tables.zip
# set 'MY_PATH_TO' in config as the path to the project (similarly for partial supervision)
vim qa/config/config.vanilla_bert.json
# train
bash train_hmtqa.sh
# test
bash test_hmtqa.sh
The training phase takes ~10 hours on 4 V100 GPUs.
If needed, we provide the baseline "MAPO with hierarchical-aware logical form" model checkpoint, which achieves 45.5% on dev set and 42.3% on test set. Both are sligtly higher than the results in paper due to the updated dataset. We also find that disabling trigger words in training may increase accuracy at the cost of much higher spurious program rate, thus we choose to retain the trigger words.
We explore four baseline models to generate meaning text from hierarchical tables in HiTab. Three of them are transformer-based models: T5, BART, and BERT-to-BERT. The other is a Pointer-Generator Network based on LSTM architecture.
To start with, make sure to install the following requirements:
pip install openpyxl
pip install datasets
pip install transformers
Read in the train_samples.jsonl
, dev_samples.jsonl
, test_samples.jsonl
in the ./data/
directory.
Process each sample with: (1) highlighted/linked table cells, (2) with additional operations and answer(s).
- The generation
target
label is the annotatedsub_sentence
. - To create a serialized table data input, we need to: (1) find all linked entity/quantity cells, (2) find all of their ascendants, then linearize their cell contents following a top-down left-to-right order. If extra operational information is required, we will then append the answer formula and answer string to the
source
as the final model input.
This process create pairs of source-target for train/dev/test sets. To perform data pre-processing for the cell highlight setting, simply run:
python do_preprocess.py
Or to enable the cell & calculation setting, specify the additional argument by:
python do_preprocess.py --add_aggr
Both will load the data from hitab/data/
directory and generate a processed version in hitab/data2text/data/
.
Note that the input samples require a another layer of tokenization, using hitab/data2text/experiment/pointer_generator/parse_sample.py
.
The experiment
directory contains the code for training (train_d2t.py
) and evaluation (eval_d2t.py
).
The T5, BART, and BERT-to-BERT directly call the training process from the installed transformers
library.
Pointer-Generator Network (PGN) requires additional code modules, specifically in the pointer_generator
directory.
To follow the training pipeline, take BART for an example, run:
python run_experiment.py --expr_name bart --do_train --do_eval --do_test
Alter the expr_name
argument among t5/bart/b2b/pgn to explore different models.
If you find HiTab dataset is useful in your work, please consider citing the paper:
@article{cheng2021hitab,
title={HiTab: A Hierarchical Table Dataset for Question Answering and Natural Language Generation},
author={Cheng, Zhoujun and Dong, Haoyu and Wang, Zhiruo and Jia, Ran and Guo, Jiaqi and Gao, Yan and Han, Shi and Lou, Jian-Guang and Zhang, Dongmei},
journal={arXiv preprint arXiv:2108.06712},
year={2021}
}
This dataset follows the Computational Use of Data Agreement v1.0.
If you have any question regarding HiTab dataset or publication, please create an issue in this repository. You can also reach us by e-mail addresses in the paper.