Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per-cell bams or fastqs from the mega-bam? #11

Open
olgabot opened this issue Feb 11, 2021 · 1 comment
Open

Per-cell bams or fastqs from the mega-bam? #11

olgabot opened this issue Feb 11, 2021 · 1 comment

Comments

@olgabot
Copy link

olgabot commented Feb 11, 2021

Hello,
Would it be possible to create 1000s of per-cell fast{q,a}s (preferred) or per-cell bams from the mega-bam of the 10x channel run? We wrote a tool to do this in Python, but it is slow and takes a lot of memory: http://github.com/czbiohub/bam2fasta In the end, performing 1000s of grep jobs was faster than doing it in one go with Python.
Thank you!
Warmest,
Olga

@olgabot
Copy link
Author

olgabot commented Mar 8, 2021

Hello! Any updates on this? This is my current workaround in the kmermaid pipeline:

Extract aligned reads to fastq:

    samtools view -ub -F 4 ${bam} \\
        | samtools fastq --threads ${task.cpus} -T ${tenx_tags} - \\
        | gzip -c - \\
          > ${reads}

Extract unaligned reads to fastq:

    samtools view -f4 ${bam} \\
      | grep -E '${tenx_cell_barcode_pattern}' \\
      | samtools fastq --threads ${task.cpus} -T ${tenx_tags} - \\
      | gzip -c - \\
        > ${reads} \\
      || touch ${reads}

Then running bam2fasta count_umis_per_cell:

        bam2fasta count_umis_percell \\
            --filename ${reads} \\
            --min-umi-per-barcode ${tenx_min_umi_per_cell} \\
            --cell-barcode-pattern '${tenx_cell_barcode_pattern}' \\
            --molecular-barcode-pattern '${tenx_molecular_barcode_pattern}' \\
            --write-barcode-meta-csv ${umis_per_cell} \\
            --barcodes-significant-umis-file ${good_barcodes}

Then using ripgrep to search for each cell barcode in the aligned and unaligned fastqs which has UMIs above a certain threshold:

    rg \\
      --search-zip \\
      --after-context 3 \\
      --threads ${task.cpus} \\
      '${this_cell_barcode}' \\
      ${reads} \\
      | gzip -c - \\
      > ${this_cell_fastq_gz} || touch ${this_cell_fastq_gz}

A one-stop-shop solution that would use Rust and be 10-100x faster would be much preferred :)

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant