Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update collect_data() to not bind all files into single giant .csv #25

Open
theamarks opened this issue Sep 20, 2024 · 0 comments
Open
Assignees
Labels
🪄 enhancement New functionality or feature request

Comments

@theamarks
Copy link
Member

print(paste("adding", df_files[i]))
curr_df <- read.csv(df_files[i])
df <- df %>%
bind_rows(curr_df)

This function currently duplicates all consumption and trade data in AWS s3 bucket by combining all files into "all_hs_all_years". We do not need to do this on AWS and increases our I/O by many GB.

The code is slow with bind_rows() and requires a large amount of local memory - could replace with:

# Initialize the output file by writing the first file's header
fwrite(fread(df_files[1]), file = output_file)

# Loop over the remaining files and append them to the output file
for (i in 2:length(df_files)) {
  curr_df <- fread(df_files[i])
  fwrite(curr_df, file = output_file, append = TRUE)
  message(paste0("Adding ", i, "/", length(df_files), " files"))
}
@theamarks theamarks added the 🪄 enhancement New functionality or feature request label Sep 20, 2024
@theamarks theamarks self-assigned this Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🪄 enhancement New functionality or feature request
Projects
None yet
Development

No branches or pull requests

1 participant