Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

big_fread1 function and the possibility to parallelize it #11

Open
laurentGithub13 opened this issue Jan 20, 2023 · 8 comments
Open

big_fread1 function and the possibility to parallelize it #11

laurentGithub13 opened this issue Jan 20, 2023 · 8 comments

Comments

@laurentGithub13
Copy link

laurentGithub13 commented Jan 20, 2023

Hello,
I have a very huge file to process and a small RAM memory. So I decided to use the big_fread1 function. I use a window of 200 lines and on every window I did the calculation I want to do on only three columns of my dataframe. The script below works fine on a file with a not hyge size but on my actual file it lasts many, many hours. May you please tell me if it would be possible to parallelize easily this script ? I can do the calculation independendly in every window and after I could sort the result on the time variable to put the result together in the good order.

Thank you
Best Regards
Laurent

# my text file containing the dataframe
csv <- "./txt/my_very_huge_file.txt"

my_results <- big_fread1( csv
                             , every_nlines = 200
                             , skip = 0
                             , header = TRUE
                             , .transform = function(df) {
  df %>%
     dplyr::select(tindex, param1,param2,param3)  %>%
                 summarise(    time     = sum(range(tindex))/2
                        , param1 = my_function(.[,"param1"])
                        , param2 = my_function(.[,"param2"])
                        , param3 = my_function(.[,"param2"])
                        )

     }
)
@privefl
Copy link
Owner

privefl commented Jan 20, 2023

What are the size of the file, the number of lines and number of columns?

@laurentGithub13
Copy link
Author

File size : 6 Go
4 columns
81210693 lines
(initialy the file had 19 variables, that's why I do a dplyr::select in my script but I found the mean to reduce the number of columns)

@privefl
Copy link
Owner

privefl commented Jan 21, 2023

If you have 81M lines and only 4 columns, you should use a much larger every_nline, maybe up to 1M.

@laurentGithub13
Copy link
Author

laurentGithub13 commented Jan 22, 2023

Yes, I will try that. In fact, for physical conditions, I have to do my calcultion en every 200 lines, that's why I chose every_line=200. So I modified my script to do the calculation on every 200 lines in every chunk of 1e+6 lines as bellow:

my_results <- big_fread1( csv
, every_nlines = 1e+6
, skip = 0
, header = TRUE
, .transform = function(df) {
cat("Current process: data between ", min(df$tindex) ," and ", max(df$tindex) , "\n")
df <- as.data.frame(df)
df$group <- ordered( as.numeric( cut_number( tindex, n = (1+e6)/200 ) ) )
df %>%
dplyr::select(tindex, param1,param2,param3,group) %>%
dplyr::group_by( group ) %>
summarise( time = sum(range(tindex))/2
, param1 = my_function(.[,"param1"])
, param2 = my_function(.[,"param2"])
, param3 = my_function(.[,"param2"])
)

 }

)
I launched the script......

@laurentGithub13
Copy link
Author

The spliting is faster but I have still the same problem of time. In fact it is my calculation which takes a long time. Would it be possible to parallelize the work pertained to every chunk of lines ?

@privefl
Copy link
Owner

privefl commented Jan 24, 2023

Yes, it should be possible.
You can just adapt the source code of the function by replacing lapply() with foreach().
You should not need to redo the splitting of the file (the first step).
If you do not know how to use foreach for parallelism, have look at this tutorial.

@laurentGithub13
Copy link
Author

Thank you for the link. I will see that and try to do something.
thx

@privefl
Copy link
Owner

privefl commented Aug 17, 2023

Any update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants