feat: a `validate` subcommand to check whether a `.hap` file is valid #47

aryarm · 2022-06-10T00:17:30Z

Important: Please familiarize yourself with the .hap file specification before reading this issue!

Originating from item 2 in the "Future Work" section of PR #43:

It would be useful to have a validate command that simply validates the .hap file, ensuring it follows the specification. An optional parameter to this command could turn on messages about best practices.

At first, this command should reject unsorted .hap files, but at some point we should also add a --no-sort parameter to support unsorted files, since those are also technically valid input.

For each violation of the standard, it would be nice if the validate subcommand reported the exact line that contains the issue, and ideally, it would quote the problematic part of the line, as well.

We should probably also add an optional argument that specifies the subcommand that this .hap file will be used as input for. That way, we can import its custom Haplotype class and acquire the expected extra field types from there.

Here are some rules it should check the .hap file follows:

And here are some rules for the header of the .hap file:

The text was updated successfully, but these errors were encountered:

aryarm · 2022-06-17T15:49:23Z

we could also use this subcommand to validate other kinds of files like .map or .dat files, @mlamkin7

In that case, maybe the best way to structure this code would be to create a Validator class with methods inside of it for each of the file formats? Or a Validator abstract class that gets implemented for each of the file formats?

Update: After some discussion, we decided not to use this to validate other kinds of files, after all.

aryarm · 2022-09-06T14:44:00Z

probably the best way to start on this PR would be to create a validate.py module with a single validate_haps() function. The validate_haps() function can simply make a call to a new Haplotypes.validate() method that we create.

Inside that function, you can iterate over the .hap file line by line and verify that everything looks correct. As an example of how to do this, you can take a look at what I did in the Haplotypes.__iter__() method:

haptools/haptools/data/haplotypes.py

Lines 894 to 915 in d74633a

    
           with hook_compressed(self.fname, mode="rt") as haps: 
        
               self.log.info("Not taking advantage of indexing.") 
        
               header_lines = [] 
        
               for line in haps: 
        
                   line = line.rstrip("\n") 
        
                   line_type = self._line_type(line) 
        
                   if line[0] == "#": 
        
                       # store header for later 
        
                       try: 
        
                           header_lines.append(line) 
        
                       except AttributeError: 
        
                           # this happens when we encounter a line beginning with a # 
        
                           # after already having seen a valid line type (like H or V) 
        
                           # These are usually just comment lines, so we can ignore it 
        
                           pass 
        
                   else: 
        
                       if header_lines: 
        
                           metas, extras = self.check_header(header_lines) 
        
                           types = self._get_field_types(extras, metas.get("order")) 
        
                           header_lines = None 
        
                           self.log.info("Finished reading header.") 
        
                       if line_type == "H":

aryarm · 2023-07-12T17:37:02Z

also note:
we could create an option to the .hap file to allow whitespace (ie consecutive spaces) to be converted to tabs (field delimiters) or allow for another option that can parse whitespace as a field delimiter when reading the hap file

that way, we can account for situations where the user might have a text editor that automatically inserts spaces when the tab key is pressed

update: sed 's/^[ \t]*//;s/[ \t]*$//;s/[[:blank:]]\{2,\}/\t/g' can also be used to remove trailing/leading whitespace and convert consecutive whitespace to a tab

This base is still missing some features Still not implemented as a cli switch Refer to #47

aryarm changed the title ~~create a validate subcommand to check whether a .hap file is valid~~ feat: a validate subcommand to check whether a .hap file is valid Jun 10, 2022

aryarm added the enhancement New feature or request label Jun 17, 2022

aryarm mentioned this issue Jun 17, 2022

try cases where .hap file does not follow format exactly #51

Closed

4 tasks

ayimany pushed a commit that referenced this issue Jul 14, 2023

Create base for hapfile validation

2b9f708

This base is still missing some features Still not implemented as a cli switch Refer to #47

ayimany linked a pull request Jul 27, 2023 that will close this issue

feat: implement a new validate command #220

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: a `validate` subcommand to check whether a `.hap` file is valid #47

feat: a `validate` subcommand to check whether a `.hap` file is valid #47

aryarm commented Jun 10, 2022 •

edited

Loading

aryarm commented Jun 17, 2022 •

edited

Loading

aryarm commented Sep 6, 2022 •

edited

Loading

aryarm commented Jul 12, 2023 •

edited

Loading

feat: a validate subcommand to check whether a .hap file is valid #47

feat: a validate subcommand to check whether a .hap file is valid #47

Comments

aryarm commented Jun 10, 2022 • edited Loading

aryarm commented Jun 17, 2022 • edited Loading

aryarm commented Sep 6, 2022 • edited Loading

aryarm commented Jul 12, 2023 • edited Loading

feat: a `validate` subcommand to check whether a `.hap` file is valid #47

feat: a `validate` subcommand to check whether a `.hap` file is valid #47

aryarm commented Jun 10, 2022 •

edited

Loading

aryarm commented Jun 17, 2022 •

edited

Loading

aryarm commented Sep 6, 2022 •

edited

Loading

aryarm commented Jul 12, 2023 •

edited

Loading