Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read.bismark errors and warnings for large number of files #93

Open
shawpa opened this issue Jul 9, 2020 · 2 comments
Open

read.bismark errors and warnings for large number of files #93

shawpa opened this issue Jul 9, 2020 · 2 comments

Comments

@shawpa
Copy link

shawpa commented Jul 9, 2020

I am having issues with the read.bismark command. I have a large number of files that need to be analyzed. Right now I am working with 24 but I'd like to be able to run up to 30 or more at any given time. I realize the problem is likely with R and the amount of memory that it can allocate itself. I am working on a linux cluster. I have been trying to adhere to the "best practices" for this command as outlined in the vignette. I am using cytosine reports so they are about 57M lines and I have 24 files. So I understand that is a lot of data. I wrote a loop so that it was only dealing with 1 chromosome at a time but I am still running into errors. It is very strange because R will work for a chromosome or 2 and then fail. I then have to close out R, close out my terminal, and restart. If I just try to restart my script in R without shutting everything off it generally just keeps failing. I haven't tried any of the multicore settings as outlined in the vignette because I don't really understand what to do. The following is the read.bismark command I am running.

meth = bsseq::read.bismark(
files = files,
colData = data.frame(row.names=c("G1138","G1774","G642","G1641","G443","G965","G2043","G1342","G1354","G1451","G1299","G1462","G2241","G946","G1500","G2256","G1927","G1533","G2024","G2092","G335","G1787","G709","G1631")),
rmZeroCov = FALSE,
strandCollapse = FALSE,
verbose=2,
loci=lociTemp
)

Please note that lociTemp is only the loci from 1 chromosome. The data files are all chromosomes though. I really don't want to have to split those up if I don't have to. In the rest of my script I've done other things like remove files once they were finished in order to preserve memory. This seems to have gotten rid of the "forking errors". Even when it works I get the following warning/error: "Error in mcexit(0L) : ignoring SIGPIPE signal" This is usually repeated about 20 times but everything seems to work so I have just been ignoring it. When it stops, I see the warning and then some verbiage about allocating index less than value. I wish I could provide the exact error but you know when I wanted it to give me the error, it didn't. If I see it again, I will definitely post.

Is there any other guidance you can provide to me to deal with a large number of files?

Annie

@PeteHaitch
Copy link
Contributor

Hi Annie,

It looks like you are using the (default) in-memory backend (equivalent to read.bismark(, ...BACKEND = NULL)).
This may be causing you to run out of memory because it will be trying to store a large amount of data in memory.
Have you tried using the HDF5Array backend (read.bismark(..., BACKEND = "HDF5Array"))?

It looks like you are also using the (default) parallelisation strategy (equivalent to read.bismark(, ...BPPARAM = bpparam())
This may be causing you to run out of memory, or for your job to be killed by the cluster, if the default parallelisation strategy requests too many workers.
What is the value of bpparam() when you run this in the same R session as you are using for running read.bismark()?
Have could tried using (slower) serial processing (read.bismark(..., BPPARAM = SerialParam()))?.

Let me know if either of these help.

Cheers,
Pete

PS Could you also please include the output of BiocManager::valid()?

@shawpa
Copy link
Author

shawpa commented Jul 20, 2020

Thank you for your reply. Sorry it took me so long to respond but I had to work on another project. I modified my code to include the HDF5 array option. It did seem to be "dumping data" into the specified temp folder but I still got an error. I honestly am very confused by the documentation on the BPPARAM and how to change that. I don't really have any idea what I should try. I am attaching the biocmanager output you requested.
biocmanager.txt

My new code with HDF5 array option is:

meth = bsseq::read.bismark(
    files = files,
    colData = data.frame(row.names = c("PL1142","PL1973","PL232","PL722","PL837","PL1103","PL171","PL2102","PL274","PL1523","PL1746","PL230","PL891","PL1487","PL1616","PL1814","PL449","PL865","PL875","PL1342","PL1500","PL2043","PL2241","PL2256","PL443","PL946","PL965","PL1177","PL1553","PL373","PL874","PL899","PL1274","PL1457","PL1540","PL2137","PL1138","PL1299","PL1354","PL1451","PL1462","PL1641","PL1774","PL642","PL1027","PL1085","PL721","PL816","PL862","PL1303","PL1545","PL1549","PL1674","PL2091","PL796B","PL921")),
    rmZeroCov = FALSE,
    strandCollapse = FALSE,
	verbose=2,
	loci=lociTemp,
	BACKEND="HDF5Array",
	dir="/mnt/DATA/Cores/hiseq2000/annie/misc_methylation_analysis/PE_july20/temp"
)

The error message that I am still getting is:
Loading required package: rhdf5
[read.bismark] Using 'loci' as candidate loci.
[read.bismark] Parsing files and constructing 'M' and 'Cov' matrices ...
Error in result[[njob]] <- value :
  attempt to select less than one element in OneIndex
In addition: Warning message:
In parallel::mccollect(wait = FALSE, timeout = 1) :
  1 parallel job did not deliver a result

I am not sure if I did this right but I think my value for BPPARAM is as follows:

> BPPARAM
class: MulticoreParam
  bpisup: FALSE; bpnworkers: 22; bptasks: 0; bpjobname: BPJOB
  bplog: FALSE; bpthreshold: INFO; bpstopOnError: TRUE
  bpRNGseed: ; bptimeout: 2592000; bpprogressbar: FALSE
  bpexportglobals: TRUE
  bplogdir: NA
  bpresultdir: NA
  cluster type: FORK

Thank you for your assistance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants