Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistencies between database schema and paper #38

Open
jeffeben21 opened this issue Dec 19, 2024 · 5 comments
Open

Inconsistencies between database schema and paper #38

jeffeben21 opened this issue Dec 19, 2024 · 5 comments

Comments

@jeffeben21
Copy link

Hi there, I've been checking out the data provided in Spider-lite and Spider-snow under Spider2/tree/main/spider2-lite/resource/databases and Spider2/tree/main/spider2-snow/resource/databases and I'm having trouble determining what the correct schema to use is. Based on the paper's Table 2, both Lite and Snow should have close to 800 columns per database on average. However, when I check the schemas I actually see that the mean number of columns per database is over 3000, primarily due to databases like fec having sub-databases such as Spider2/spider2-lite/resource/databases/bigquery/fec/bigquery-public-data.census_bureau_acs with 60K+ columns. Additionally, when checking median values the number seems much lower (I see a median of 164 columns/database when checking Spider-lite using the provided JSON files to extract column names per table).

Additionally, I see for many databases that the DDL.csv contains more tables than are provided in the corresponding <table_name>.json files. Is this expected, and is the true full dataset to consider over the tables in DDL.csv files, or the ones for which .json files exist?

I'm trying to extract the full database schemas to consider, and all related context, to better understand the dataset. If there's a better way, such as using the code originally used in the paper to get these metrics, please let me know.

@thatmee
Copy link

thatmee commented Dec 20, 2024

Have the same issue. I also noticed that the DDL column is missing from some of the DDL.csv files, such as GITHUB_REPOS_DATE.DAY and MLB.BASEBALL and so on.

@lfy79001
Copy link
Collaborator

Hello, thank you for your attention to Spider2.

Regarding the statistics of the average number of columns, we have made the following adjustments. For tables with time as a single dimension, if the total number of columns exceeds 10k, we will ignore these tables. For those with 3k-10k columns, we will keep them in the statistics. This is also the reason for the lower median. In fact, the total number of columns in many databases is less than 800. We include tables from different time periods in the statistics, mainly because we considered that, in Snowflake SQL, there is no function like TABLE_SUFFIX to aggregate time-based tables, so we used this calculation method instead.

@jeffeben21

@lfy79001
Copy link
Collaborator

Thank you for pointing out our mistake. We have updated the DDL for mlb github_repos_date deps_dev_v1.
@thatmee

@jeffeben21
Copy link
Author

@lfy79001 Thank you, this helps a lot. I also found one of the pre-processing scripts at spider2-lite/baselines/dailsql/preprocessed_data/spider2_preprocess.py which can be run by overriding --dev spider2-lite, although it does require modifying one line of a source files to include an additional ../ suffix for compatibility with the current package layout. It looks like similar scripts exist in the other baseline methods. I think this may help others as well, as it combines all the input data context nicely.

With this I was able to get the processed output, which upon filtering all databases with > 10K columns, gave me a mean N columns of 804, which matches the paper. Based on this, I believe you are dropping all 'databases' that have total number of columns exceeding 10K, not just tables within those databases, correct?

@lfy79001
Copy link
Collaborator

lfy79001 commented Dec 20, 2024

@jeffeben21
yes, correct.
We will clarify this in the next version of the paper. Thanks for pointing out the qualification issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants