Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

posts/2019-06-04-using-awk-and-r-to-parse-25tb/index #10

Open
utterances-bot opened this issue Oct 11, 2023 · 1 comment
Open

posts/2019-06-04-using-awk-and-r-to-parse-25tb/index #10

utterances-bot opened this issue Oct 11, 2023 · 1 comment

Comments

@utterances-bot
Copy link

- Using AWK and R to parse 25tb

Recently I was tasked with parsing 25tb of raw genotype data. This is the story of how I brought the query time and cost down from 8 minutes and $20 to a tenth of a second and less than a penny, plus the lessons learned along the way.

https://livefreeordichotomize.com/posts/2019-06-04-using-awk-and-r-to-parse-25tb/index.html

Copy link

ercbk commented Oct 11, 2023

Very much enjoyed reading about your journey, especially the AWK stuff. Thanks for writing it. I've never used Spark on a job remotely like this, but the unbalanced partition problem sounds like a data skew problem. I thought this was taken care of with Adaptive Query Execution (AQE), but that is only for Apache Spark 3.2+. So, I wonder if you were perhaps using an older version of Spark. Was also surprised that dplyr::filter beat index-based filtering for your use case. Might want to try out vctrs::vec_slice as it's supposed to be the fastest of the three. I'm sure you aren't currently interested in refactoring given the current performance of your solution and the amount of work you put into it, but I have some stuff in my notebook on Spark optimization and skew that may help with a few of your issues if you ever decide to revisit it.
Partitioning
Data Skew
Optimization
Tidyverse Replacements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants