Inconsistencies between database schema and paper #38

jeffeben21 · 2024-12-19T22:48:20Z

Hi there, I've been checking out the data provided in Spider-lite and Spider-snow under Spider2/tree/main/spider2-lite/resource/databases and Spider2/tree/main/spider2-snow/resource/databases and I'm having trouble determining what the correct schema to use is. Based on the paper's Table 2, both Lite and Snow should have close to 800 columns per database on average. However, when I check the schemas I actually see that the mean number of columns per database is over 3000, primarily due to databases like fec having sub-databases such as Spider2/spider2-lite/resource/databases/bigquery/fec/bigquery-public-data.census_bureau_acs with 60K+ columns. Additionally, when checking median values the number seems much lower (I see a median of 164 columns/database when checking Spider-lite using the provided JSON files to extract column names per table).

Additionally, I see for many databases that the DDL.csv contains more tables than are provided in the corresponding <table_name>.json files. Is this expected, and is the true full dataset to consider over the tables in DDL.csv files, or the ones for which .json files exist?

I'm trying to extract the full database schemas to consider, and all related context, to better understand the dataset. If there's a better way, such as using the code originally used in the paper to get these metrics, please let me know.

The text was updated successfully, but these errors were encountered:

thatmee · 2024-12-20T01:15:27Z

Have the same issue. I also noticed that the DDL column is missing from some of the DDL.csv files, such as GITHUB_REPOS_DATE.DAY and MLB.BASEBALL and so on.

lfy79001 · 2024-12-20T03:09:10Z

Hello, thank you for your attention to Spider2.

Regarding the statistics of the average number of columns, we have made the following adjustments. For tables with time as a single dimension, if the total number of columns exceeds 10k, we will ignore these tables. For those with 3k-10k columns, we will keep them in the statistics. This is also the reason for the lower median. In fact, the total number of columns in many databases is less than 800. We include tables from different time periods in the statistics, mainly because we considered that, in Snowflake SQL, there is no function like TABLE_SUFFIX to aggregate time-based tables, so we used this calculation method instead.

@jeffeben21

lfy79001 · 2024-12-20T03:48:02Z

Thank you for pointing out our mistake. We have updated the DDL for mlb github_repos_date deps_dev_v1.
@thatmee

jeffeben21 · 2024-12-20T20:09:22Z

@lfy79001 Thank you, this helps a lot. I also found one of the pre-processing scripts at spider2-lite/baselines/dailsql/preprocessed_data/spider2_preprocess.py which can be run by overriding --dev spider2-lite, although it does require modifying one line of a source files to include an additional ../ suffix for compatibility with the current package layout. It looks like similar scripts exist in the other baseline methods. I think this may help others as well, as it combines all the input data context nicely.

With this I was able to get the processed output, which upon filtering all databases with > 10K columns, gave me a mean N columns of 804, which matches the paper. Based on this, I believe you are dropping all 'databases' that have total number of columns exceeding 10K, not just tables within those databases, correct?

lfy79001 · 2024-12-20T20:13:24Z

@jeffeben21
yes, correct.
We will clarify this in the next version of the paper. Thanks for pointing out the qualification issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistencies between database schema and paper #38

Inconsistencies between database schema and paper #38

jeffeben21 commented Dec 19, 2024

thatmee commented Dec 20, 2024

lfy79001 commented Dec 20, 2024

lfy79001 commented Dec 20, 2024

jeffeben21 commented Dec 20, 2024

lfy79001 commented Dec 20, 2024 •

edited

Loading

Inconsistencies between database schema and paper #38

Inconsistencies between database schema and paper #38

Comments

jeffeben21 commented Dec 19, 2024

thatmee commented Dec 20, 2024

lfy79001 commented Dec 20, 2024

lfy79001 commented Dec 20, 2024

jeffeben21 commented Dec 20, 2024

lfy79001 commented Dec 20, 2024 • edited Loading

lfy79001 commented Dec 20, 2024 •

edited

Loading