Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An error occurred while generating random regions. #14

Open
jinqiyuan1 opened this issue Jan 9, 2025 · 3 comments
Open

An error occurred while generating random regions. #14

jinqiyuan1 opened this issue Jan 9, 2025 · 3 comments
Assignees

Comments

@jinqiyuan1
Copy link

  • Does this software have any other requirements for BAM files? I provided the sorted BAM file for running, and the command is as follows:
    python /public/home/yjq/tools/cfDNA_GCcorrection/cfDNA_GCcorrection/computeGCBias_background.py
    -b /public/home/yjq/projects/PA_projects/data/NBT_WGS/bamfilter/SRR17478154_filter.sorted.bam
    -g /public/home/yjq/genome_anno/hg19/hg19_UCSC.2bit
    -p 2
    -i
    --output /public/home/yjq/projects/PA_projects/data/NBT_WGS/GC_correction/background/
    --debug

  • The following error occurs:
    Traceback (most recent call last):
    File "/public/home/yjq/tools/cfDNA_GCcorrection/cfDNA_GCcorrection/computeGCBias_background.py", line 596, in
    main()
    File "/public/home/yjq/.local/lib/python3.8/site-packages/click/core.py", line 1161, in call
    return self.main(*args, **kwargs)
    File "/public/home/yjq/.local/lib/python3.8/site-packages/click/core.py", line 1082, in main
    rv = self.invoke(ctx)
    File "/public/home/yjq/.local/lib/python3.8/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
    File "/public/home/yjq/.local/lib/python3.8/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
    File "/public/home/yjq/tools/cfDNA_GCcorrection/cfDNA_GCcorrection/computeGCBias_background.py", line 549, in main
    regions = get_regions(
    File "/public/home/yjq/tools/cfDNA_GCcorrection/cfDNA_GCcorrection/computeGCBias_background.py", line 125, in get_regions
    random_regions.to_dataframe(
    File "/public/home/yjq/miniconda3/envs/celfeer_env/lib/python3.8/site-packages/pybedtools/bedtool.py", line 3762, in to_dataframe
    return pandas.read_csv(self.fn, *args, sep="\t", **kwargs) # type: ignore
    File "/public/home/yjq/miniconda3/envs/celfeer_env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
    return _read(filepath_or_buffer, kwds)
    File "/public/home/yjq/miniconda3/envs/celfeer_env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 577, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
    File "/public/home/yjq/miniconda3/envs/celfeer_env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1407, in init
    self._engine = self._make_engine(f, self.engine)
    File "/public/home/yjq/miniconda3/envs/celfeer_env/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1679, in _make_engine
    return mapping[engine](f, **self.options)
    File "/public/home/yjq/miniconda3/envs/celfeer_env/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 93, in init
    self._reader = parsers.TextReader(src, **kwds)
    File "pandas/_libs/parsers.pyx", line 557, in pandas._libs.parsers.TextReader.cinit
    pandas.errors.EmptyDataError: No columns to parse from file.

  • I am sure that I have successfully installed pandas and pybedtools. My deeptools version is 3.5.5, pandas version is 2.0.3, bedtools version is v2.31.1, pybedtools version is 0.11.0, and the Python version is 3.8.19.

@sroener sroener self-assigned this Jan 9, 2025
@sroener
Copy link
Collaborator

sroener commented Jan 9, 2025

Hi, from the error traceback it looks like no regions were selected.

Could you please run the command again with the --debug flag and share the resulting output?

Please make sure that you provide the same reference version (e.g. hg19/hg38) that the bam file was mapped to.

Additionally, if possible it would be great if you share your bam file and your whole software environment. This way I could see why no regions were created. One possible cause of this behavior could be that your reference 2bit file and your bam file do not share common chromosome names, which should be handled by an automatically created mapping between these files.

@jinqiyuan1
Copy link
Author

Thank you for your enthusiastic response. Your reply provided me with ideas on how to solve the problem, and I successfully ran the computeGCBias_background.py script by replacing the reference .2bit file as you suggested. However, I encountered the following issues in subsequent tests.

  1. When running the computeGCBias_readlen script, if I include the -i parameter, the script successfully executes and produces the result file. However, if I remove this parameter, the script will encounter the following error: Yet, when I check the --help option, I see that the purpose of this parameter is to reduce precision but improve speed. Perhaps it should not result in an error when this parameter is omitted.
    `/public/home/yjq/miniconda3/envs/celfeer_env/bin/computeGCBias_readlen:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    import('pkg_resources').require('cfDNA-GCcorrection==0.1')
    Traceback (most recent call last):
    File "/public/home/yjq/miniconda3/envs/celfeer_env/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3653, in get_loc
    return self._engine.get_loc(casted_key)
    File "pandas/_libs/index.pyx", line 147, in pandas._libs.index.IndexEngine.get_loc
    File "pandas/_libs/index.pyx", line 176, in pandas._libs.index.IndexEngine.get_loc
    File "pandas/_libs/hashtable_class_helper.pxi", line 2606, in pandas._libs.hashtable.Int64HashTable.get_item
    File "pandas/_libs/hashtable_class_helper.pxi", line 2630, in pandas._libs.hashtable.Int64HashTable.get_item
    KeyError: 31

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/public/home/yjq/miniconda3/envs/celfeer_env/bin/computeGCBias_readlen", line 7, in
exec(compile(f.read(), file, 'exec'))
File "/public/home/yjq/tools/cfDNA_GCcorrection/bin/computeGCBias_readlen", line 12, in
main(args)
File "/public/home/yjq/.local/lib/python3.8/site-packages/click/core.py", line 1161, in call
return self.main(*args, **kwargs)
File "/public/home/yjq/.local/lib/python3.8/site-packages/click/core.py", line 1082, in main
rv = self.invoke(ctx)
File "/public/home/yjq/.local/lib/python3.8/site-packages/click/core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/public/home/yjq/.local/lib/python3.8/site-packages/click/core.py", line 788, in invoke
return __callback(*args, **kwargs)
File "/public/home/yjq/tools/cfDNA_GCcorrection/cfDNA_GCcorrection/computeGCBias_readlen.py", line 966, in main
r_data = get_ratio(data)
File "/public/home/yjq/tools/cfDNA_GCcorrection/cfDNA_GCcorrection/computeGCBias_readlen.py", line 576, in get_ratio
f_tmp = F_GC.loc[i].to_numpy()
File "/public/home/yjq/miniconda3/envs/celfeer_env/lib/python3.8/site-packages/pandas/core/indexing.py", line 1103, in getitem
return self._getitem_axis(maybe_callable, axis=axis)
File "/public/home/yjq/miniconda3/envs/celfeer_env/lib/python3.8/site-packages/pandas/core/indexing.py", line 1343, in _getitem_axis
return self._get_label(key, axis=axis)
File "/public/home/yjq/miniconda3/envs/celfeer_env/lib/python3.8/site-packages/pandas/core/indexing.py", line 1293, in _get_label
return self.obj.xs(label, axis=axis)
File "/public/home/yjq/miniconda3/envs/celfeer_env/lib/python3.8/site-packages/pandas/core/generic.py", line 4095, in xs
loc = index.get_loc(key)
File "/public/home/yjq/miniconda3/envs/celfeer_env/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3655, in get_loc
raise KeyError(key) from err
KeyError: 31`

  1. When running the computeGCBias_readlen script, if I add the --precomputed_background parameter and provide the file generated by running the computeGCBias_background.py script, the following error occurs. This precomputed background file may lose its significance if there are mismatches in chromosome names between the background file and the data being processed.
    /public/home/yjq/miniconda3/envs/celfeer_env/bin/computeGCBias_readlen:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html __import__('pkg_resources').require('cfDNA-GCcorrection==0.1') Traceback (most recent call last): File "/public/home/yjq/miniconda3/envs/celfeer_env/bin/computeGCBias_readlen", line 7, in <module> exec(compile(f.read(), __file__, 'exec')) File "/public/home/yjq/tools/cfDNA_GCcorrection/bin/computeGCBias_readlen", line 12, in <module> main(args) File "/public/home/yjq/.local/lib/python3.8/site-packages/click/core.py", line 1161, in __call__ return self.main(*args, **kwargs) File "/public/home/yjq/.local/lib/python3.8/site-packages/click/core.py", line 1082, in main rv = self.invoke(ctx) File "/public/home/yjq/.local/lib/python3.8/site-packages/click/core.py", line 1443, in invoke return ctx.invoke(self.callback, **ctx.params) File "/public/home/yjq/.local/lib/python3.8/site-packages/click/core.py", line 788, in invoke return __callback(*args, **kwargs) File "/public/home/yjq/tools/cfDNA_GCcorrection/cfDNA_GCcorrection/computeGCBias_readlen.py", line 906, in main chrom_dict = { File "/public/home/yjq/tools/cfDNA_GCcorrection/cfDNA_GCcorrection/computeGCBias_readlen.py", line 907, in <dictcomp> precomputed_chrom_mapping[key]: tuple(value) KeyError: 'chr1_gl000191_random'

  2. When running the correctGCBias_readlen script, you are prompted that the -w parameter does not exist. You may want to check if there is an error in your Readme file.
    Error: No such option: -w /public/home/yjq/miniconda3/envs/celfeer_env/bin/correctGCBias_readlen:4: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html __import__('pkg_resources').require('cfDNA-GCcorrection==0.1') Usage: correctGCBias_readlen [OPTIONS] Try 'correctGCBias_readlen --help' for help.

  3. If the BAM files that require GC correction are quite large, the software will correspondingly demand a significant amount of memory. If there is insufficient memory, the software will not be able to run. I suspect that using bedtools might be the cause of this issue.

  4. Finally, thank you sincerely for your reply and assistance. Thank you very much~

@sroener sroener mentioned this issue Jan 23, 2025
@sroener
Copy link
Collaborator

sroener commented Jan 23, 2025

Hi,

thank you for using the software and reporting issues you ran into. I'll try to answer your questions as good as possible.

  1. Thank you for pointing the error out. This seems to be an edge case that should be fixed. First I want to describe how you can proceed with your work. The help message for the -i message was outdated and is now updated. If activated, it uses splines to interpolate missing values and smoothes existing values considering neighbouring bins. By now I would recommend using this option to get better correction values. Runtime should not be impacted too much.

Related to the issue, could you please run the script on one of your bam files with the options -i --MeasurementOutput and send me the resulting table. These flags will interpolate the values, but save a copy of the raw measured values that would be the input for the function causing the error you reported. It is a simple table containing counts for measured and expected reads binned by their fragment lengt and GC content.

  1. I'm sorry to hear that you expected different behavior. The idea of the script is to create a background file that is representative for multiple files aligned to the same reference genome. That way the computation does not have to be repeated. The pitfall is that all files need to have the same chromosomes, which sometimes can be tricky with non-standard chromosomes. If you are looking for standard chromosomes (1-22+x+y) an easy fix to your issue would be the --standard_chroms option.

  2. You are right. The -w shorthand for --weights was depricated, but still in the example code. I updated the documentation. In theory, you could just ommit the option, because it is the default by now.

  3. Could you give me a bit more information? Which script is demanding a significant amount of memory? How much memory would that be?

Otherwise, it's hard to determine, what causes the memory demand. I spent some time optimizing the resource requirements, and had no problems on reasonable hardware. One idea I have, might be the number of cores. You can think of it as spawning workers that read lots of chunks from your BAM file. If many of them are open at the same time, the memory footprint increases. If you are comfortable to do so, you could profile the memory usage with a memory profiler (I had good experiences with scalene).

I hope this helps you in any way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants