Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet support #10

Merged
merged 9 commits into from
Jan 24, 2025
Merged

Parquet support #10

merged 9 commits into from
Jan 24, 2025

Conversation

vkhodygo
Copy link
Member

@vkhodygo vkhodygo commented Oct 2, 2023

This should close #5.

I'd like to know what you think about this.

@vkhodygo vkhodygo added the enhancement New feature or request label Oct 2, 2023
@andrewbaxter439
Copy link
Member

This looks great! I'll aim to try testing it out ASAP and can likely be pulled. In previous tries of converting to parquet the three problems faced were:

  • Integer values with "null" being wrongly coded as "character"
  • Specifying schema for such cols as integer causing problems with "null" character values in column
  • Trailing commas being interpreted as empty column

As far as I can see your code seems to address all the above.

@andrewbaxter439
Copy link
Member

andrewbaxter439 commented Oct 4, 2023

Suggestions:

f.base <- scenario_id |>
paste("output", sep = "/") |>
paste(path_string, sep = "/") |>
paste("csv", sep = "/")

  • replace with file.path(scenario_id, "output", path_string, "csv") as a safer, system-agnostic, path specification?

paths_pattern = "^2023\\d+\\_([1-9]|[1234][0-9]|50)00$") {

  • This will stop working in 2024!
  • A less specific file path of say "^20\\d+_\\d+$" will still catch all directories of runs (until we run it in the year 2100) without any danger of missing runs with either higher seeds or seeds not ending in "00"

run = run + (batch_id - 1) * 20,

  • This makes sense, and if I'm following it correctly it will renumber the runs 1-1000 (rather than my old system of having 100-119, 200-219, ... 5000-5019 etc.)
  • Minor future compatibility issue will be if the batch size -b passed to do_full_run.sh changes from 20 at all this will break the hard-coded numbering system. Might be too niche a problem to worry about for now

map(\(x) convert.single_simulation(x, sid), .progress = TRUE)

  • furrr::future_map might be useful here to parallelise? Haven't tried running this yet - my own implementations used the annoying read_csv_arrow which loaded whole dataset (but didn't need schemas), whereas open_csv_arrow I presume should be faster/smoother - so parallelisation might not be possible/needed.

source("./schemas/schemas.R")

  • will check how this runs but I would have thought it should be source("R/schemas/schemas.R") if running from project root?

I've noted some of the above suggestions as tick-boxes. Check off any you don't think need actioned and we can discuss/patch others as needed.

@vkhodygo
Copy link
Member Author

vkhodygo commented Oct 31, 2023

I've fixed some of the issues, I also replaced a hardcoded postfix with a parameter, however, other methods do not incorporate this change yet.

  • TODO: finish this bit

We could employ furrr::future_map, but I have a few concerns. arrow is inherently parallel itself, and I don't want to create a cluster of threads that spawn threads of their own. This approach leads to potential bottlenecks and crashes, better safe than sorry. Besides, this code is reasonably performant as it is, at the moment it's about an hour per scenario.

Regarding your last point, I'm not sure how it works in R. Any suggestions?

P.S. I also would like you to take a look at convert.single_simulation. It's in need of code consolidation (switch?), could you please give it a go?

@andrewbaxter439
Copy link
Member

Hi Vlad, thanks for update. I was working on this the other week and hit some barriers to progress at the same time as my simpaths parallel runs stopped working all over again as I attempted to generate a test dataset to work on. Spent hours watching it repeatedly crash after a tiny number of runs, but then noticed that the home drive is 99% full?? Ran out of time to give this to solve that week. Will pick this up again in the coming weeks.

@vkhodygo
Copy link
Member Author

vkhodygo commented Nov 1, 2023

but then noticed that the home drive is 99% full

That's my fault, I apologize. For some reason quotas don't work, I also need a lot of space to store ML models and produced synthetic results. Meanwhile, try using /storage for that.

@andrewbaxter439
Copy link
Member

Finally got round to testing this after a lot of problems. I think with the tweaks in #12 it works perfectly in giving useable parquet files for baseline/reform runs.

Updated schemas for current SimPaths outputs
@andrewbaxter439 andrewbaxter439 marked this pull request as ready for review January 22, 2025 16:11
@vkhodygo vkhodygo merged commit 7b2e678 into main Jan 24, 2025
@vkhodygo vkhodygo deleted the parquet branch January 24, 2025 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UPD: add missing parts
2 participants