big_fread1 function and the possibility to parallelize it #11

laurentGithub13 · 2023-01-20T13:54:57Z

Hello,
I have a very huge file to process and a small RAM memory. So I decided to use the big_fread1 function. I use a window of 200 lines and on every window I did the calculation I want to do on only three columns of my dataframe. The script below works fine on a file with a not hyge size but on my actual file it lasts many, many hours. May you please tell me if it would be possible to parallelize easily this script ? I can do the calculation independendly in every window and after I could sort the result on the time variable to put the result together in the good order.

Thank you
Best Regards
Laurent

# my text file containing the dataframe
csv <- "./txt/my_very_huge_file.txt"

my_results <- big_fread1( csv
                             , every_nlines = 200
                             , skip = 0
                             , header = TRUE
                             , .transform = function(df) {
  df %>%
     dplyr::select(tindex, param1,param2,param3)  %>%
                 summarise(    time     = sum(range(tindex))/2
                        , param1 = my_function(.[,"param1"])
                        , param2 = my_function(.[,"param2"])
                        , param3 = my_function(.[,"param2"])
                        )

     }
)

privefl · 2023-01-20T14:07:44Z

What are the size of the file, the number of lines and number of columns?

laurentGithub13 · 2023-01-20T21:03:24Z

File size : 6 Go
4 columns
81210693 lines
(initialy the file had 19 variables, that's why I do a dplyr::select in my script but I found the mean to reduce the number of columns)

privefl · 2023-01-21T13:50:18Z

If you have 81M lines and only 4 columns, you should use a much larger every_nline, maybe up to 1M.

laurentGithub13 · 2023-01-22T15:28:40Z

Yes, I will try that. In fact, for physical conditions, I have to do my calcultion en every 200 lines, that's why I chose every_line=200. So I modified my script to do the calculation on every 200 lines in every chunk of 1e+6 lines as bellow:

my_results <- big_fread1( csv
, every_nlines = 1e+6
, skip = 0
, header = TRUE
, .transform = function(df) {
cat("Current process: data between ", min(df$tindex) ," and ", max(df$tindex) , "\n")
df <- as.data.frame(df)
df$group <- ordered( as.numeric( cut_number( tindex, n = (1+e6)/200 ) ) )
df %>%
dplyr::select(tindex, param1,param2,param3,group) %>%
dplyr::group_by( group ) %>
summarise( time = sum(range(tindex))/2
, param1 = my_function(.[,"param1"])
, param2 = my_function(.[,"param2"])
, param3 = my_function(.[,"param2"])
)

)
I launched the script......

laurentGithub13 · 2023-01-23T21:14:12Z

The spliting is faster but I have still the same problem of time. In fact it is my calculation which takes a long time. Would it be possible to parallelize the work pertained to every chunk of lines ?

privefl · 2023-01-24T07:29:35Z

Yes, it should be possible.
You can just adapt the source code of the function by replacing lapply() with foreach().
You should not need to redo the splitting of the file (the first step).
If you do not know how to use foreach for parallelism, have look at this tutorial.

laurentGithub13 · 2023-01-24T18:38:47Z

Thank you for the link. I will see that and try to do something.
thx

privefl · 2023-08-17T08:43:29Z

Any update on this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

big_fread1 function and the possibility to parallelize it #11

big_fread1 function and the possibility to parallelize it #11

laurentGithub13 commented Jan 20, 2023 •

edited by privefl

Loading

privefl commented Jan 20, 2023

laurentGithub13 commented Jan 20, 2023

privefl commented Jan 21, 2023

laurentGithub13 commented Jan 22, 2023 •

edited

Loading

laurentGithub13 commented Jan 23, 2023

privefl commented Jan 24, 2023

laurentGithub13 commented Jan 24, 2023

privefl commented Aug 17, 2023

big_fread1 function and the possibility to parallelize it #11

big_fread1 function and the possibility to parallelize it #11

Comments

laurentGithub13 commented Jan 20, 2023 • edited by privefl Loading

privefl commented Jan 20, 2023

laurentGithub13 commented Jan 20, 2023

privefl commented Jan 21, 2023

laurentGithub13 commented Jan 22, 2023 • edited Loading

laurentGithub13 commented Jan 23, 2023

privefl commented Jan 24, 2023

laurentGithub13 commented Jan 24, 2023

privefl commented Aug 17, 2023

laurentGithub13 commented Jan 20, 2023 •

edited by privefl

Loading

laurentGithub13 commented Jan 22, 2023 •

edited

Loading