Update `collect_data()` to not bind all files into single giant .csv #25

theamarks · 2024-09-20T19:14:53Z

Lines 40 to 43 in ca9cbf4

    
           print(paste("adding", df_files[i])) 
        
           curr_df <- read.csv(df_files[i]) 
        
           df <- df %>% 
        
             bind_rows(curr_df)

This function currently duplicates all consumption and trade data in AWS s3 bucket by combining all files into "all_hs_all_years". We do not need to do this on AWS and increases our I/O by many GB.

The code is slow with bind_rows() and requires a large amount of local memory - could replace with:

# Initialize the output file by writing the first file's header
fwrite(fread(df_files[1]), file = output_file)

# Loop over the remaining files and append them to the output file
for (i in 2:length(df_files)) {
  curr_df <- fread(df_files[i])
  fwrite(curr_df, file = output_file, append = TRUE)
  message(paste0("Adding ", i, "/", length(df_files), " files"))
}

The text was updated successfully, but these errors were encountered:

theamarks added the 🪄 enhancement New functionality or feature request label Sep 20, 2024

theamarks self-assigned this Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update `collect_data()` to not bind all files into single giant .csv #25

Update `collect_data()` to not bind all files into single giant .csv #25

theamarks commented Sep 20, 2024

Update collect_data() to not bind all files into single giant .csv #25

Update collect_data() to not bind all files into single giant .csv #25

Comments

theamarks commented Sep 20, 2024

Update `collect_data()` to not bind all files into single giant .csv #25

Update `collect_data()` to not bind all files into single giant .csv #25