Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

synchronize tests across langs, add helper makefile #767

Merged
merged 6 commits into from
Oct 19, 2024

Conversation

schristley
Copy link
Member

Start building the base so we only need to edit the OpenAPI v3 spec, and the v2 spec can be generated from v3. Also clean up the test suites so there is one common set of test data files used by all languages.

Restarted as I messed up the master merge with the old PR #758

@schristley
Copy link
Member Author

$ make

Helper commands for AIRR Standards repository

make gen-v2       -- Generate OpenAPI V2 spec from the V3 spec
make build-docs   -- Build documentation
make spec-copy    -- Copy spec files to language directories
make data-copy    -- Copy test data files to language directories
make checks       -- Run consistency checks on spec files
make tests        -- Run all language test suites
make python-tests -- Run Python test suite
make r-tests      -- Run R test suite
make js-tests     -- Run Javascript test suite

@schristley
Copy link
Member Author

@javh @bussec Are we allowing NA to be in the rearrangement TSV? I'm reconciling the test data and for the bad_rearrangement.tsv file, the R version has an NA while the python version does not. If I try to use the R version with the NA then python crashes:

Traceback (most recent call last):
  File "/work/lang/python/tests/test_interface.py", line 59, in test_load_rearrangement
    result = airr.load_rearrangement(self.rearrangement_bad)
  File "/work/lang/python/airr/interface.py", line 103, in load_rearrangement
    df = pd.read_csv(filename, sep='\t', header=0, index_col=None,
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 912, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 583, in _read
    return parser.read(nrows)
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/readers.py", line 1704, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1036, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1075, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas/_libs/parsers.pyx", line 1220, in pandas._libs.parsers.TextReader._convert_with_dtype
ValueError: Bool column has NA values in column 4

it seems even if we don't accept NA, this is maybe a python bug?

@schristley
Copy link
Member Author

Catching the exception prevents the crash, but I'm not sure if that is what we want, or if we want pandas to allow NA and transform to None.

diff --git a/lang/python/airr/interface.py b/lang/python/airr/interface.py
index 590c3a7..07b194b 100644
--- a/lang/python/airr/interface.py
+++ b/lang/python/airr/interface.py
@@ -100,10 +100,15 @@ def load_rearrangement(filename, validate=False, debug=False):
     # TODO: test pandas.DataFrame.read_csv with converters argument as an alterative
     schema = RearrangementSchema
 
-    df = pd.read_csv(filename, sep='\t', header=0, index_col=None,
-                     dtype=schema.pandas_types(), true_values=schema.true_values,
-                     false_values=schema.false_values)
-    # added to use RearrangementReader without modifying it:
+    try:
+        df = pd.read_csv(filename, sep='\t', header=0, index_col=None,
+                         dtype=schema.pandas_types(), true_values=schema.true_values,
+                         false_values=schema.false_values)
+        # added to use RearrangementReader without modifying it:
+    except Exception as e:
+        sys.stderr.write('Error occurred while loading AIRR rearrangement file: %s\n' % e)
+        return None
+
     buffer = StringIO()  # create an empty buffer
     df.to_csv(buffer, sep='\t', index=False)  # fill buffer
     buffer.seek(0)  # set to the start of the stream

@javh
Copy link
Contributor

javh commented Mar 6, 2024

The R library will accept "" (empty string), NA, or None for null values. Though, the spec officially only recognizes an empty string as a null value.

I think allowing NA to equate to None in python would be fine (though NA is valid amino acid sequence), but I think it's less of a python bug and more of an invalid bad_rearrangement.tsv file... It is supposed to be "bad", I guess. But, how bad?

@schristley
Copy link
Member Author

It is supposed to be "bad", I guess. But, how bad?

bad, but no so bad it causes a crash!

It isn't so much about what to test in the "bad" file. I'm more worried that in a "good" file there is an NA and R accepts it, but python doesn't, and/or we get incompatibility where an R output file cannot be fed into python because it has NAs.

@bcorrie
Copy link
Contributor

bcorrie commented Mar 6, 2024

The R library will accept "" (empty string), NA, or None for null values. Though, the spec officially only recognizes an empty string as a null value.

I would suggest that on input from an AIRR file, NA/None should not be interpreted as null and this should be rejected. It is non compliant if the data has NA/None for null, no?

Also on output, it should never output NA/None for null, it should always output an empty string.

@javh
Copy link
Contributor

javh commented Mar 6, 2024

Yeah, it's true that the files are non-compliant if they include NA/None. And the R and python libraries do output empty string for NA/None values.

But, NA/None tend to be the default outputs from TSV writers outside the airr reference libraries. So, it's a compromise to deal with typical TSV output.

@schristley
Copy link
Member Author

Sounds like there are two things here. 1) change the test so it works for both R and python. That should be easy then. 2) python and R need more support for null-like values, for cross-language interoperability of AIRR TSV. That should probably be it's own issue, as it's new code to write, with the task to write additional tests to handle the null-like values.

@javh javh force-pushed the issue-739-openapi3 branch from 38234fd to ac5e1e1 Compare April 8, 2024 17:07
@javh
Copy link
Contributor

javh commented Aug 12, 2024

@schristley Can you remind me, did we decide to back out of the V2 conversion script, just make the V3 spec the default, and manually maintain the V2 spec?

@javh javh added this to the AIRR 2.0 milestone Aug 12, 2024
@schristley
Copy link
Member Author

@schristley Can you remind me, did we decide to back out of the V2 conversion script, just make the V3 spec the default, and manually maintain the V2 spec?

yes, I believe so

@schristley schristley changed the title Initial moves to OpenAPI v3 synchronize tests across langs, add helper makefile Oct 19, 2024
@schristley
Copy link
Member Author

@javh I changed this PR to be primarily to synchronize the tests, and add the Makefile.

@schristley schristley merged commit 5c6f228 into master Oct 19, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants