Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Range of bins in bed file needs to exactly match all bins in matrix #21

Open
rikrdo89 opened this issue Oct 17, 2019 · 5 comments
Open

Comments

@rikrdo89
Copy link

rikrdo89 commented Oct 17, 2019

When running TADtool using an iced matrix file and its corresponding bed file, generated from the HiC-Pro pipeline, I get the following error. However, when I used the sparse matrix and bed file from the examples folder, TADtool works as expected.

2019-10-16 23:30:52,814 INFO Loading regions...
2019-10-16 23:30:53,000 INFO Checking plotting region in matrix...
2019-10-16 23:30:53,005 INFO Loading matrix...
Traceback (most recent call last):
  File "/home/smithlinares/anaconda2/envs/py3/lib/python3.7/site-packages/tadtool/tad.py", line 245, in load_matrix
    m = np.load(file_name)
  File "/home/smithlinares/anaconda2/envs/py3/lib/python3.7/site-packages/numpy/lib/npyio.py", line 457, in load
    raise ValueError("Cannot load file containing pickled data "
ValueError: Cannot load file containing pickled data when allow_pickle=False

Upon closer inspection, I see that the matrices from the examples folder and the ones I got from HiC-pro are similar, however, the bed files are different in that HiC-Pro creates a bed filed with 4 columns, with the last one containing the bin number.

$dmso_chr19_test.bed 
chr19	3160000	3200000	30127
chr19	3200000	3240000	30128
chr19	3240000	3280000	30129
chr19	3280000	3320000	30130
chr19	3320000	3360000	30131

$ chr12_20-35Mb_regions.bed 
chr12	19975000	20000000
chr12	20000000	20025000
chr12	20025000	20050000
chr12	20050000	20075000

So when I use the HiC-Pro matrix, which almost always contains subset of all possible bins, TADtool fails... unless I also subset the bed file to begin and end in the exact bins used in the matrix.

I am using tadtool v0.81 installed using conda with python3.

@kaukrise
Copy link
Collaborator

kaukrise commented Oct 17, 2019

I m a bit puzzled by the error you are seeing. The ValueError should be caught at this point in the function since version 0.81. Can you ensure you are running the latest version by giving me the output of pip freeze | grep tadtool?

Edit: Is that the complete error you are seeing? Or is this followed by something else?

@rikrdo89
Copy link
Author

I am using tadtool v0.81 installed using conda with python3.
conda install -c bioconda tadtool
When using the sample files from your repo, tadtool works without any issues. I can see the gui and set different thresholds.

There was a few more lines of error. Here is the complete error msg. Keep in mind I was able to circumvent this error by sub-setting the bed file to only contain the bins used in the matrix file.

> 2019-10-17 08:57:45,500 INFO Loading regions...
> 2019-10-17 08:57:45,502 INFO Checking plotting region in matrix...
> 2019-10-17 08:57:45,503 INFO Loading matrix...
> Traceback (most recent call last):
>   File "/home/smithlinares/anaconda2/envs/py3/lib/python3.7/site-packages/tadtool/tad.py", line 245, in load_matrix
>     m = np.load(file_name)
>   File "/home/smithlinares/anaconda2/envs/py3/lib/python3.7/site-packages/numpy/lib/npyio.py", line 457, in load
>     raise ValueError("Cannot load file containing pickled data "
> ValueError: Cannot load file containing pickled data when allow_pickle=False
> 
> During handling of the above exception, another exception occurred:
> 
> Traceback (most recent call last):
>   File "/home/smithlinares/anaconda2/envs/py3/bin/tadtool", line 447, in <module>
>     TADtool()
>   File "/home/smithlinares/anaconda2/envs/py3/bin/tadtool", line 44, in __init__
>     getattr(self, args.command)()
>   File "/home/smithlinares/anaconda2/envs/py3/bin/tadtool", line 142, in plot
>     matrix = tad.load_matrix(matrix_file, len(regions), ix_converter=ix_converter)
>   File "/home/smithlinares/anaconda2/envs/py3/lib/python3.7/site-packages/tadtool/tad.py", line 282, in load_matrix
>     source = ix_converter[fields[0]]
> KeyError: '6'

@kaukrise
Copy link
Collaborator

Ah, that makes a lot more sense now. In general, the BED entries need to match the matrix bins exactly, just as you said. Because we know that most of the time people will most likely work with matrix subsets, we have included the tadtool subset command, which will take care of the subsetting for you starting from the complete Hi-C matrix.

@rikrdo89
Copy link
Author

I tried tadtool subset but again, it gave me an error msg unless I start with the exact number of bins. I will subset using awk for now, but it would be nice to implement this function right in the tool.

Unrelated question... is there support for multi-threading or any sort of parallelization? I notice that the the tool uses very minimal resources, and this results in a lot of processing time for large files.

@kaukrise
Copy link
Collaborator

If you can provide me with (small) sample files that throw the error when using mismatched BED and matrix files, I will see how simple it would be to implement this in TADtool. The requirement of matching region and matrix files also serves as an input sanity check, so we would probably issue a warning when a mismatch is detected.

We are currently not planning to implement a multi-threaded version of the insulation score calculation. Our recommendation, as outlined in the README, is to run TADtool on individual chromosomes (you can easily parallelise manually with tadtool tads on each chromosome that way). If a chromosome still takes too much time, you can further subset the chromosome to a smaller region, use that to identify suitable parameters, and then run tadtool tads on those parameters alone, which should be feasible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants