Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input files for KAPAC #4

Open
khandarius opened this issue May 27, 2019 · 6 comments
Open

Input files for KAPAC #4

khandarius opened this issue May 27, 2019 · 6 comments
Assignees

Comments

@khandarius
Copy link

Hello,

I'm interested in using PAQR and KAPAC on my own samples, but I'm unsure about the KAPAC input files. Specifically, I would like to know how to obtain the site count file for KAPAC (corresponding to kmer_counts.tsv of the test data). I don't see a suitable file among the output of PAQR.

Best regards,
Darius

@koljaLanger
Copy link
Member

Hi Darius,
the site count matrix is very specific to the set of poly(A) sites that you use. That is why it is very hard for us to provide a universal input file in this case. If you use the poly(A) sites that we are using, of course you can also use the site count file that we provide.
However, it is not too difficult to create the file yourself: you need the genomic coordinates of your poly(A) sites and the corresponding fasta file of the genome of interest. Then, you scan over the region of each poly(A) site (with an extension up- and downstream of the site) and simply count every kmer that you encounter.

Hopefully, this is helping you. Let us know if you have further questions.

Best regards,
Ralf

@koljaLanger
Copy link
Member

Hi Darius,
maybe you also want to have a look in the model pipeline we uploaded to zenodo: https://doi.org/10.5281/zenodo.1147433
If you are familiar with snakemke, this is the best way to check our approach to create the site count matrices. It includes also the script we use.

Best regards,
Ralf

@xflicsu
Copy link

xflicsu commented Jun 14, 2019

Hi Darius,
maybe you also want to have a look in the model pipeline we uploaded to zenodo: https://doi.org/10.5281/zenodo.1147433
If you are familiar with snakemke, this is the best way to check our approach to create the site count matrices. It includes also the script we use.

Best regards,
Ralf

Hello @koljaLanger ,
I also try to use KAPAC in my project.
I find the link you provided.
But the file is huge.
Could you provide a test data and script less than 100MB?

@koljaLanger
Copy link
Member

Hi
I am really sorry that the model pipeline archive became that big. This is because it allows to recapitulate the results from our paper which was only possible when the used bam files were included in the archive.

Of course I don't know the reason that prevents you from downloading the archive. But just in case it is the disk space my suggestions would be: download and unpack the archive and then use samtools to create small random samples from the bam files. This would massively reduce the size needed.

Please let us know if this would be of any value for you. Otherwise, we might find another solution.

Best,
Ralf

@xflicsu
Copy link

xflicsu commented Jun 17, 2019

Hi
I am really sorry that the model pipeline archive became that big. This is because it allows to recapitulate the results from our paper which was only possible when the used bam files were included in the archive.

Of course I don't know the reason that prevents you from downloading the archive. But just in case it is the disk space my suggestions would be: download and unpack the archive and then use samtools to create small random samples from the bam files. This would massively reduce the size needed.

Please let us know if this would be of any value for you. Otherwise, we might find another solution.

Best,
Ralf

Thanks for your response!
I want to prepare KAPAC input files from PAQR output result.
As you suggestions, small bam file can be created by samtools.
So, could you only provide an example with small size?
This small size example maybe more useful for a new user of KAPAC.

@koljaLanger
Copy link
Member

Hi,
I have two options in mind how we best solve this problem:
1.
I down-sampled the BAM input files (each one is now 12 MB in size) and created a new snakemake archive. Still, it is 1.3 GB big. However, it has the big advantage that it is self-contained and can run on linux as independent snakemake pipeline. Please let me know if this would help you. In this case I would consider uploading this pipeline to zenodo, too.
2.
I created another git repo that only contains the scripts and auxiliary files of the model pipeline. If you're simply interested in using our scripts, clone the following repo: https://git.scicore.unibas.ch/zavolan_public/paqr_kapac_modelpipeline_only_scripts

I hope, this allows you to use KAPAC and PAQR.

Kind regards,

Ralf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants