read.bismark errors and warnings for large number of files #93

shawpa · 2020-07-09T14:54:39Z

I am having issues with the read.bismark command. I have a large number of files that need to be analyzed. Right now I am working with 24 but I'd like to be able to run up to 30 or more at any given time. I realize the problem is likely with R and the amount of memory that it can allocate itself. I am working on a linux cluster. I have been trying to adhere to the "best practices" for this command as outlined in the vignette. I am using cytosine reports so they are about 57M lines and I have 24 files. So I understand that is a lot of data. I wrote a loop so that it was only dealing with 1 chromosome at a time but I am still running into errors. It is very strange because R will work for a chromosome or 2 and then fail. I then have to close out R, close out my terminal, and restart. If I just try to restart my script in R without shutting everything off it generally just keeps failing. I haven't tried any of the multicore settings as outlined in the vignette because I don't really understand what to do. The following is the read.bismark command I am running.

meth = bsseq::read.bismark(
files = files,
colData = data.frame(row.names=c("G1138","G1774","G642","G1641","G443","G965","G2043","G1342","G1354","G1451","G1299","G1462","G2241","G946","G1500","G2256","G1927","G1533","G2024","G2092","G335","G1787","G709","G1631")),
rmZeroCov = FALSE,
strandCollapse = FALSE,
verbose=2,
loci=lociTemp
)

Please note that lociTemp is only the loci from 1 chromosome. The data files are all chromosomes though. I really don't want to have to split those up if I don't have to. In the rest of my script I've done other things like remove files once they were finished in order to preserve memory. This seems to have gotten rid of the "forking errors". Even when it works I get the following warning/error: "Error in mcexit(0L) : ignoring SIGPIPE signal" This is usually repeated about 20 times but everything seems to work so I have just been ignoring it. When it stops, I see the warning and then some verbiage about allocating index less than value. I wish I could provide the exact error but you know when I wanted it to give me the error, it didn't. If I see it again, I will definitely post.

Is there any other guidance you can provide to me to deal with a large number of files?

Annie

PeteHaitch · 2020-07-09T22:00:28Z

Hi Annie,

It looks like you are using the (default) in-memory backend (equivalent to read.bismark(, ...BACKEND = NULL)).
This may be causing you to run out of memory because it will be trying to store a large amount of data in memory.
Have you tried using the HDF5Array backend (read.bismark(..., BACKEND = "HDF5Array"))?

It looks like you are also using the (default) parallelisation strategy (equivalent to read.bismark(, ...BPPARAM = bpparam())
This may be causing you to run out of memory, or for your job to be killed by the cluster, if the default parallelisation strategy requests too many workers.
What is the value of bpparam() when you run this in the same R session as you are using for running read.bismark()?
Have could tried using (slower) serial processing (read.bismark(..., BPPARAM = SerialParam()))?.

Let me know if either of these help.

Cheers,
Pete

PS Could you also please include the output of BiocManager::valid()?

shawpa · 2020-07-20T18:46:53Z

Thank you for your reply. Sorry it took me so long to respond but I had to work on another project. I modified my code to include the HDF5 array option. It did seem to be "dumping data" into the specified temp folder but I still got an error. I honestly am very confused by the documentation on the BPPARAM and how to change that. I don't really have any idea what I should try. I am attaching the biocmanager output you requested.
biocmanager.txt

My new code with HDF5 array option is:

meth = bsseq::read.bismark(
    files = files,
    colData = data.frame(row.names = c("PL1142","PL1973","PL232","PL722","PL837","PL1103","PL171","PL2102","PL274","PL1523","PL1746","PL230","PL891","PL1487","PL1616","PL1814","PL449","PL865","PL875","PL1342","PL1500","PL2043","PL2241","PL2256","PL443","PL946","PL965","PL1177","PL1553","PL373","PL874","PL899","PL1274","PL1457","PL1540","PL2137","PL1138","PL1299","PL1354","PL1451","PL1462","PL1641","PL1774","PL642","PL1027","PL1085","PL721","PL816","PL862","PL1303","PL1545","PL1549","PL1674","PL2091","PL796B","PL921")),
    rmZeroCov = FALSE,
    strandCollapse = FALSE,
	verbose=2,
	loci=lociTemp,
	BACKEND="HDF5Array",
	dir="/mnt/DATA/Cores/hiseq2000/annie/misc_methylation_analysis/PE_july20/temp"
)

The error message that I am still getting is:
Loading required package: rhdf5
[read.bismark] Using 'loci' as candidate loci.
[read.bismark] Parsing files and constructing 'M' and 'Cov' matrices ...
Error in result[[njob]] <- value :
  attempt to select less than one element in OneIndex
In addition: Warning message:
In parallel::mccollect(wait = FALSE, timeout = 1) :
  1 parallel job did not deliver a result

I am not sure if I did this right but I think my value for BPPARAM is as follows:

> BPPARAM
class: MulticoreParam
  bpisup: FALSE; bpnworkers: 22; bptasks: 0; bpjobname: BPJOB
  bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
  bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
  bpexportglobals: TRUE
  bplogdir: NA
  bpresultdir: NA
  cluster type: FORK

Thank you for your assistance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read.bismark errors and warnings for large number of files #93

read.bismark errors and warnings for large number of files #93

shawpa commented Jul 9, 2020

PeteHaitch commented Jul 9, 2020

shawpa commented Jul 20, 2020 •

edited by PeteHaitch

Loading

read.bismark errors and warnings for large number of files #93

read.bismark errors and warnings for large number of files #93

Comments

shawpa commented Jul 9, 2020

PeteHaitch commented Jul 9, 2020

shawpa commented Jul 20, 2020 • edited by PeteHaitch Loading

shawpa commented Jul 20, 2020 •

edited by PeteHaitch

Loading