-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistencies between database schema and paper #38
Comments
Have the same issue. I also noticed that the |
Hello, thank you for your attention to Spider2. Regarding the statistics of the average number of columns, we have made the following adjustments. For tables with time as a single dimension, if the total number of columns exceeds 10k, we will ignore these tables. For those with 3k-10k columns, we will keep them in the statistics. This is also the reason for the lower median. In fact, the total number of columns in many databases is less than 800. We include tables from different time periods in the statistics, mainly because we considered that, in Snowflake SQL, there is no function like TABLE_SUFFIX to aggregate time-based tables, so we used this calculation method instead. |
Thank you for pointing out our mistake. We have updated the DDL for mlb github_repos_date deps_dev_v1. |
@lfy79001 Thank you, this helps a lot. I also found one of the pre-processing scripts at With this I was able to get the processed output, which upon filtering all databases with > 10K columns, gave me a mean N columns of 804, which matches the paper. Based on this, I believe you are dropping all 'databases' that have total number of columns exceeding 10K, not just tables within those databases, correct? |
@jeffeben21 |
Hi there, I've been checking out the data provided in Spider-lite and Spider-snow under
Spider2/tree/main/spider2-lite/resource/databases
andSpider2/tree/main/spider2-snow/resource/databases
and I'm having trouble determining what the correct schema to use is. Based on the paper's Table 2, both Lite and Snow should have close to 800 columns per database on average. However, when I check the schemas I actually see that the mean number of columns per database is over 3000, primarily due to databases likefec
having sub-databases such asSpider2/spider2-lite/resource/databases/bigquery/fec/bigquery-public-data.census_bureau_acs
with 60K+ columns. Additionally, when checking median values the number seems much lower (I see a median of 164 columns/database when checking Spider-lite using the provided JSON files to extract column names per table).Additionally, I see for many databases that the DDL.csv contains more tables than are provided in the corresponding
<table_name>.json
files. Is this expected, and is the true full dataset to consider over the tables inDDL.csv
files, or the ones for which.json
files exist?I'm trying to extract the full database schemas to consider, and all related context, to better understand the dataset. If there's a better way, such as using the code originally used in the paper to get these metrics, please let me know.
The text was updated successfully, but these errors were encountered: