Merge pull request #50 from sbslee/0.29.0-dev

0.29.0 dev
sbslee · Dec 19, 2021 · f4eb5f6 · f4eb5f6
2 parents 54c07e2 + ab764b0
commit f4eb5f6
Show file tree

Hide file tree

Showing 9 changed files with 329 additions and 110 deletions.
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -1,6 +1,20 @@
 Changelog
 *********
 
+0.29.0 (2021-12-19)
+-------------------
+
+* Add new property ``pyvcf.VcfFrame.phased``.
+* Update :meth:`pyvcf.VcfFrame.slice` method to automatically handle the 'chr' string.
+* Add new argument ``--thread`` to :command:`ngs-hc` command. This argument will be used to set ``--native-pair-hmm-threads`` for GATK's :command:`HaplotypeCaller` command, ``--reader-threads`` for GATK's :command:`GenomicsDBImport` command, and ``-XX:ParallelGCThreads`` and ``-XX:ConcGCThreads`` for Java.
+* Add new argument ``--batch`` to :command:`ngs-hc` command. This argument will be used to set ``--batch-size`` for GATK's :command:`GenomicsDBImport` command.
+* Update :command:`ngs-bam2fq` command to fix the SGE issue that outputs an error like ``Unable to run job: denied: "XXXXX" is not a valid object name (cannot start with a digit)``.
+* Update :command:`ngs-hc` command so that when ``--posix`` is set, it will use ``--genomicsdb-shared-posixfs-optimizations`` argument from GATK's :command:`GenomicsDBImport` command in addition to exporting relevant shell variable (i.e. ``export TILEDB_DISABLE_FILE_LOCKING=1``).
+* Add new argument ``--job`` to :command:`ngs-fq2bam` command.
+* Update :command:`ngs-fq2bam` command so that BAM creation step and BAM processing step are now in one step.
+* Update :command:`ngs-fq2bam` command so that ``--thread`` is now also used to set ``-XX:ParallelGCThreads`` and ``-XX:ConcGCThreads`` for Java.
+* Add new method :meth:`common.parse_list_or_file`.
+
 0.28.0 (2021-12-05)
 -------------------
 

diff --git a/docs/cli.rst b/docs/cli.rst
@@ -210,7 +210,7 @@ bam-slice
                     provide a BED file (compressed or uncompressed) to specify 
                     regions. Note that the 'chr' prefix in contig names (e.g. 
                     'chr1' vs. '1') will be automatically added or removed as 
-                    necessary to match the input VCF's contig names.
+                    necessary to match the input BED's contig names.
    
    Optional arguments:
      -h, --help     Show this help message and exit.
@@ -773,8 +773,8 @@ ngs-fq2bam
 
    $ fuc ngs-fq2bam -h
    usage: fuc ngs-fq2bam [-h] [--bed PATH] [--thread INT] [--platform TEXT]
-                         [--force] [--keep]
-                         manifest fasta output qsub1 qsub2 java vcf [vcf ...]
+                         [--job TEXT] [--force] [--keep]
+                         manifest fasta output qsub java vcf [vcf ...]
    
    Pipeline for converting FASTQ files to analysis-ready BAM files.
    
@@ -798,12 +798,7 @@ ngs-fq2bam
      manifest         Sample manifest CSV file.
      fasta            Reference FASTA file.
      output           Output directory.
-     qsub1            SGE resoruce to request with qsub for read alignment 
-                      and sorting. Since both tasks support multithreading, 
-                      it is recommended to speicfy a parallel environment (PE) 
-                      to speed up the process (also see --thread).
-     qsub2            SGE resoruce to request with qsub for the rest of the 
-                      tasks, which do not support multithreading.
+     qsub             SGE resoruce to request for qsub.
      java             Java resoruce to request for GATK.
      vcf              One or more reference VCF files containing known variant 
                       sites (e.g. 1000 Genomes Project).
@@ -813,6 +808,7 @@ ngs-fq2bam
      --bed PATH       BED file.
      --thread INT     Number of threads to use (default: 1).
      --platform TEXT  Sequencing platform (default: 'Illumina').
+     --job TEXT       Job submission ID for SGE.
      --force          Overwrite the output directory if it already exists.
      --keep           Keep temporary files.
    
@@ -822,7 +818,6 @@ ngs-fq2bam
      ref.fa \
      output_dir \
      "-q queue_name -pe pe_name 10" \
-     "-q queue_name" \
      "-Xmx15g -Xms15g" \
      1.vcf 2.vcf 3.vcf \
      --thread 10
@@ -833,7 +828,6 @@ ngs-fq2bam
      ref.fa \
      output_dir \
      "-l h='node_A|node_B' -pe pe_name 10" \
-     "-l h='node_A|node_B'" \
      "-Xmx15g -Xms15g" \
      1.vcf 2.vcf 3.vcf \
      --thread 10
@@ -844,8 +838,8 @@ ngs-hc
 .. code-block:: text
 
    $ fuc ngs-hc -h
-   usage: fuc ngs-hc [-h] [--bed PATH] [--dbsnp PATH] [--job TEXT] [--force]
-                     [--keep] [--posix]
+   usage: fuc ngs-hc [-h] [--bed PATH] [--dbsnp PATH] [--thread INT]
+                     [--batch INT] [--job TEXT] [--force] [--keep] [--posix]
                      manifest fasta output qsub java1 java2
    
    Pipeline for germline short variant discovery.
@@ -869,10 +863,22 @@ ngs-hc
      -h, --help    Show this help message and exit.
      --bed PATH    BED file.
      --dbsnp PATH  VCF file from dbSNP.
+     --thread INT  Number of threads to use (default: 1).
+     --batch INT   Batch size used for GenomicsDBImport (default: 0). This 
+                   controls the number of samples for which readers are 
+                   open at once and therefore provides a way to minimize 
+                   memory consumption. The size of 0 means no batching (i.e. 
+                   readers for all samples will be opened at once).
      --job TEXT    Job submission ID for SGE.
      --force       Overwrite the output directory if it already exists.
      --keep        Keep temporary files.
-     --posix       Optimize for a POSIX filesystem.
+     --posix       Set GenomicsDBImport to allow for optimizations to improve 
+                   the usability and performance for shared Posix Filesystems 
+                   (e.g. NFS, Lustre). If set, file level locking is disabled 
+                   and file system writes are minimized by keeping a higher 
+                   number of file descriptors open for longer periods of time. 
+                   Use with --batch if keeping a large number of file 
+                   descriptors open is an issue.
    
    [Example] Specify queue:
      $ fuc ngs-hc \

diff --git a/fuc/api/common.py b/fuc/api/common.py
@@ -1333,6 +1333,9 @@ def update_chr_prefix(regions, mode='remove'):
     """
     Add or remove the (annoying) 'chr' string from specified regions.
 
+    The method will automatically detect regions that don't need to be
+    updated and will return them unchanged.
+
     Parameters
     ----------
     regions : str or list
@@ -1349,10 +1352,18 @@ def update_chr_prefix(regions, mode='remove'):
     -------
 
     >>> from fuc import common
-    >>> common.update_chr_prefix(['chr1:100-200', '1:300-400'], mode='remove')
-    ['1:100-200', '1:300-400']
-    >>> common.update_chr_prefix(['chr1:100-200', '1:300-400'], mode='add')
-    ['chr1:100-200', 'chr1:300-400']
+    >>> common.update_chr_prefix(['chr1:100-200', '2:300-400'], mode='remove')
+    ['1:100-200', '2:300-400']
+    >>> common.update_chr_prefix(['chr1:100-200', '2:300-400'], mode='add')
+    ['chr1:100-200', 'chr2:300-400']
+    >>> common.update_chr_prefix('chr1:100-200', mode='remove')
+    '1:100-200'
+    >>> common.update_chr_prefix('chr1:100-200', mode='add')
+    'chr1:100-200'
+    >>> common.update_chr_prefix('2:300-400', mode='add')
+    'chr2:300-400'
+    >>> common.update_chr_prefix('2:300-400', mode='remove')
+    '2:300-400'
     """
     def remove(x):
         return x.replace('chr', '')
@@ -1368,3 +1379,54 @@ def add(x):
         return modes[mode](regions)
 
     return [modes[mode](x) for x in regions]
+
+def parse_list_or_file(obj, extensions=['txt', 'tsv', 'csv', 'list']):
+    """
+    Parse the input variable and then return a list of items.
+
+    This method is useful when parsing a command line argument that accepts
+    either a list of items or a text file containing one item per line.
+
+    Parameters
+    ----------
+    obj : str or list
+        Object to be tested. Must be non-empty.
+    extensions : list, default: ['txt', 'tsv', 'csv', 'list']
+        Recognized file extensions.
+
+    Returns
+    -------
+    list
+        List of items.
+
+    Examples
+    --------
+
+    >>> from fuc import common
+    >>> common.parse_list_or_file(['A', 'B', 'C'])
+    ['A', 'B', 'C']
+    >>> common.parse_list_or_file('A')
+    ['A']
+    >>> common.parse_list_or_file('example.txt')
+    ['A', 'B', 'C']
+    >>> common.parse_list_or_file(['example.txt'])
+    ['A', 'B', 'C']
+    """
+    if not isinstance(obj, str) and not isinstance(obj, list):
+        raise TypeError(
+            f'Input must be str or list, not {type(obj).__name__}')
+
+    if not obj:
+        raise ValueError('Input is empty')
+
+    if isinstance(obj, str):
+        obj = [obj]
+
+    if len(obj) > 1:
+        return obj
+
+    for extension in extensions:
+        if obj[0].endswith(f'.{extension}'):
+            return convert_file2list(obj[0])
+
+    return obj