Skip to content

Commit

Permalink
Migrate regression tests into the main test suite (explosion#9655)
Browse files Browse the repository at this point in the history
* Migrate regressions 1-1000

* Move serialize test to correct file

* Remove tests that won't work in v3

* Migrate regressions 1000-1500

Removed regression test 1250 because v3 doesn't support the old LEX
scheme anymore.

* Add missing imports in serializer tests

* Migrate tests 1500-2000

* Migrate regressions from 2000-2500

* Migrate regressions from 2501-3000

* Migrate regressions from 3000-3501

* Migrate regressions from 3501-4000

* Migrate regressions from 4001-4500

* Migrate regressions from 4501-5000

* Migrate regressions from 5001-5501

* Migrate regressions from 5501 to 7000

* Migrate regressions from 7001 to 8000

* Migrate remaining regression tests

* Fixing missing imports

* Update docs with new system [ci skip]

* Update CONTRIBUTING.md

- Fix formatting
- Update wording

* Remove lemmatizer tests in el lang

* Move a few tests into the general tokenizer

* Separate Doc and DocBin tests
  • Loading branch information
ljvmiranda921 authored Dec 4, 2021
1 parent 72f7f4e commit 7d50804
Show file tree
Hide file tree
Showing 60 changed files with 3,778 additions and 4,011 deletions.
24 changes: 17 additions & 7 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,15 +143,25 @@ Changes to `.py` files will be effective immediately.
### Fixing bugs

When fixing a bug, first create an
[issue](https://github.com/explosion/spaCy/issues) if one does not already exist.
The description text can be very short – we don't want to make this too
[issue](https://github.com/explosion/spaCy/issues) if one does not already
exist. The description text can be very short – we don't want to make this too
bureaucratic.

Next, create a test file named `test_issue[ISSUE NUMBER].py` in the
[`spacy/tests/regression`](spacy/tests/regression) folder. Test for the bug
you're fixing, and make sure the test fails. Next, add and commit your test file
referencing the issue number in the commit message. Finally, fix the bug, make
sure your test passes and reference the issue in your commit message.
Next, add a test to the relevant file in the
[`spacy/tests`](spacy/tests)folder. Then add a [pytest
mark](https://docs.pytest.org/en/6.2.x/example/markers.html#working-with-custom-markers),
`@pytest.mark.issue(NUMBER)`, to reference the issue number.

```python
# Assume you're fixing Issue #1234
@pytest.mark.issue(1234)
def test_issue1234():
...
```

Test for the bug you're fixing, and make sure the test fails. Next, add and
commit your test file. Finally, fix the bug, make sure your test passes and
reference the issue number in your pull request description.

📖 **For more information on how to add tests, check out the [tests README](spacy/tests/README.md).**

Expand Down
2 changes: 1 addition & 1 deletion extra/DEVELOPER_DOCS/Code Conventions.md
Original file line number Diff line number Diff line change
Expand Up @@ -444,7 +444,7 @@ spaCy uses the [`pytest`](http://doc.pytest.org/) framework for testing. Tests f

When adding tests, make sure to use descriptive names and only test for one behavior at a time. Tests should be grouped into modules dedicated to the same type of functionality and some test modules are organized as directories of test files related to the same larger area of the library, e.g. `matcher` or `tokenizer`.

Regression tests are tests that refer to bugs reported in specific issues. They should live in the `regression` module and are named according to the issue number (e.g. `test_issue1234.py`). This system allows us to relate tests for specific bugs back to the original reported issue, which is especially useful if we introduce a regression and a previously passing regression tests suddenly fails again. When fixing a bug, it's often useful to create a regression test for it first. Every once in a while, we go through the `regression` module and group tests together into larger files by issue number, in groups of 500 to 1000 numbers. This prevents us from ending up with too many individual files over time.
Regression tests are tests that refer to bugs reported in specific issues. They should live in the relevant module of the test suite, named according to the issue number (e.g., `test_issue1234.py`), and [marked](https://docs.pytest.org/en/6.2.x/example/markers.html#working-with-custom-markers) appropriately (e.g. `@pytest.mark.issue(1234)`). This system allows us to relate tests for specific bugs back to the original reported issue, which is especially useful if we introduce a regression and a previously passing regression tests suddenly fails again. When fixing a bug, it's often useful to create a regression test for it first.

The test suite also provides [fixtures](https://github.com/explosion/spaCy/blob/master/spacy/tests/conftest.py) for different language tokenizers that can be used as function arguments of the same name and will be passed in automatically. Those should only be used for tests related to those specific languages. We also have [test utility functions](https://github.com/explosion/spaCy/blob/master/spacy/tests/util.py) for common operations, like creating a temporary file.

Expand Down
23 changes: 23 additions & 0 deletions spacy/tests/doc/test_array.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,31 @@
import numpy
import pytest

from spacy.tokens import Doc
from spacy.attrs import ORTH, SHAPE, POS, DEP, MORPH


@pytest.mark.issue(2203)
def test_issue2203(en_vocab):
"""Test that lemmas are set correctly in doc.from_array."""
words = ["I", "'ll", "survive"]
tags = ["PRP", "MD", "VB"]
lemmas = ["-PRON-", "will", "survive"]
tag_ids = [en_vocab.strings.add(tag) for tag in tags]
lemma_ids = [en_vocab.strings.add(lemma) for lemma in lemmas]
doc = Doc(en_vocab, words=words)
# Work around lemma corruption problem and set lemmas after tags
doc.from_array("TAG", numpy.array(tag_ids, dtype="uint64"))
doc.from_array("LEMMA", numpy.array(lemma_ids, dtype="uint64"))
assert [t.tag_ for t in doc] == tags
assert [t.lemma_ for t in doc] == lemmas
# We need to serialize both tag and lemma, since this is what causes the bug
doc_array = doc.to_array(["TAG", "LEMMA"])
new_doc = Doc(doc.vocab, words=words).from_array(["TAG", "LEMMA"], doc_array)
assert [t.tag_ for t in new_doc] == tags
assert [t.lemma_ for t in new_doc] == lemmas


def test_doc_array_attr_of_token(en_vocab):
doc = Doc(en_vocab, words=["An", "example", "sentence"])
example = doc.vocab["example"]
Expand Down
225 changes: 221 additions & 4 deletions spacy/tests/doc/test_doc_api.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,17 @@
import weakref

import pytest
import numpy
import pytest
from thinc.api import NumpyOps, get_current_ops

from spacy.attrs import DEP, ENT_IOB, ENT_TYPE, HEAD, IS_ALPHA, MORPH, POS
from spacy.attrs import SENT_START, TAG
from spacy.lang.en import English
from spacy.lang.xx import MultiLanguage
from spacy.language import Language
from spacy.lexeme import Lexeme
from spacy.tokens import Doc, Span, Token
from spacy.vocab import Vocab
from spacy.lexeme import Lexeme
from spacy.lang.en import English
from spacy.attrs import ENT_TYPE, ENT_IOB, SENT_START, HEAD, DEP, MORPH

from .test_underscore import clean_underscore # noqa: F401

Expand All @@ -30,6 +33,220 @@ def test_doc_api_init(en_vocab):
assert [t.is_sent_start for t in doc] == [True, False, True, False]


@pytest.mark.issue(1547)
def test_issue1547():
"""Test that entity labels still match after merging tokens."""
words = ["\n", "worda", ".", "\n", "wordb", "-", "Biosphere", "2", "-", " \n"]
doc = Doc(Vocab(), words=words)
doc.ents = [Span(doc, 6, 8, label=doc.vocab.strings["PRODUCT"])]
with doc.retokenize() as retokenizer:
retokenizer.merge(doc[5:7])
assert [ent.text for ent in doc.ents]


@pytest.mark.issue(1757)
def test_issue1757():
"""Test comparison against None doesn't cause segfault."""
doc = Doc(Vocab(), words=["a", "b", "c"])
assert not doc[0] < None
assert not doc[0] is None
assert doc[0] >= None
assert not doc[:2] < None
assert not doc[:2] is None
assert doc[:2] >= None
assert not doc.vocab["a"] is None
assert not doc.vocab["a"] < None


@pytest.mark.issue(2396)
def test_issue2396(en_vocab):
words = ["She", "created", "a", "test", "for", "spacy"]
heads = [1, 1, 3, 1, 3, 4]
deps = ["dep"] * len(heads)
matrix = numpy.array(
[
[0, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 1, 1],
[1, 1, 2, 3, 3, 3],
[1, 1, 3, 3, 3, 3],
[1, 1, 3, 3, 4, 4],
[1, 1, 3, 3, 4, 5],
],
dtype=numpy.int32,
)
doc = Doc(en_vocab, words=words, heads=heads, deps=deps)
span = doc[:]
assert (doc.get_lca_matrix() == matrix).all()
assert (span.get_lca_matrix() == matrix).all()


@pytest.mark.parametrize("text", ["-0.23", "+123,456", "±1"])
@pytest.mark.parametrize("lang_cls", [English, MultiLanguage])
@pytest.mark.issue(2782)
def test_issue2782(text, lang_cls):
"""Check that like_num handles + and - before number."""
nlp = lang_cls()
doc = nlp(text)
assert len(doc) == 1
assert doc[0].like_num


@pytest.mark.parametrize(
"sentence",
[
"The story was to the effect that a young American student recently called on Professor Christlieb with a letter of introduction.",
"The next month Barry Siddall joined Stoke City on a free transfer, after Chris Pearce had established himself as the Vale's #1.",
"The next month Barry Siddall joined Stoke City on a free transfer, after Chris Pearce had established himself as the Vale's number one",
"Indeed, making the one who remains do all the work has installed him into a position of such insolent tyranny, it will take a month at least to reduce him to his proper proportions.",
"It was a missed assignment, but it shouldn't have resulted in a turnover ...",
],
)
@pytest.mark.issue(3869)
def test_issue3869(sentence):
"""Test that the Doc's count_by function works consistently"""
nlp = English()
doc = nlp(sentence)
count = 0
for token in doc:
count += token.is_alpha
assert count == doc.count_by(IS_ALPHA).get(1, 0)


@pytest.mark.issue(3962)
def test_issue3962(en_vocab):
"""Ensure that as_doc does not result in out-of-bound access of tokens.
This is achieved by setting the head to itself if it would lie out of the span otherwise."""
# fmt: off
words = ["He", "jests", "at", "scars", ",", "that", "never", "felt", "a", "wound", "."]
heads = [1, 7, 1, 2, 7, 7, 7, 7, 9, 7, 7]
deps = ["nsubj", "ccomp", "prep", "pobj", "punct", "nsubj", "neg", "ROOT", "det", "dobj", "punct"]
# fmt: on
doc = Doc(en_vocab, words=words, heads=heads, deps=deps)
span2 = doc[1:5] # "jests at scars ,"
doc2 = span2.as_doc()
doc2_json = doc2.to_json()
assert doc2_json
# head set to itself, being the new artificial root
assert doc2[0].head.text == "jests"
assert doc2[0].dep_ == "dep"
assert doc2[1].head.text == "jests"
assert doc2[1].dep_ == "prep"
assert doc2[2].head.text == "at"
assert doc2[2].dep_ == "pobj"
assert doc2[3].head.text == "jests" # head set to the new artificial root
assert doc2[3].dep_ == "dep"
# We should still have 1 sentence
assert len(list(doc2.sents)) == 1
span3 = doc[6:9] # "never felt a"
doc3 = span3.as_doc()
doc3_json = doc3.to_json()
assert doc3_json
assert doc3[0].head.text == "felt"
assert doc3[0].dep_ == "neg"
assert doc3[1].head.text == "felt"
assert doc3[1].dep_ == "ROOT"
assert doc3[2].head.text == "felt" # head set to ancestor
assert doc3[2].dep_ == "dep"
# We should still have 1 sentence as "a" can be attached to "felt" instead of "wound"
assert len(list(doc3.sents)) == 1


@pytest.mark.issue(3962)
def test_issue3962_long(en_vocab):
"""Ensure that as_doc does not result in out-of-bound access of tokens.
This is achieved by setting the head to itself if it would lie out of the span otherwise."""
# fmt: off
words = ["He", "jests", "at", "scars", ".", "They", "never", "felt", "a", "wound", "."]
heads = [1, 1, 1, 2, 1, 7, 7, 7, 9, 7, 7]
deps = ["nsubj", "ROOT", "prep", "pobj", "punct", "nsubj", "neg", "ROOT", "det", "dobj", "punct"]
# fmt: on
two_sent_doc = Doc(en_vocab, words=words, heads=heads, deps=deps)
span2 = two_sent_doc[1:7] # "jests at scars. They never"
doc2 = span2.as_doc()
doc2_json = doc2.to_json()
assert doc2_json
# head set to itself, being the new artificial root (in sentence 1)
assert doc2[0].head.text == "jests"
assert doc2[0].dep_ == "ROOT"
assert doc2[1].head.text == "jests"
assert doc2[1].dep_ == "prep"
assert doc2[2].head.text == "at"
assert doc2[2].dep_ == "pobj"
assert doc2[3].head.text == "jests"
assert doc2[3].dep_ == "punct"
# head set to itself, being the new artificial root (in sentence 2)
assert doc2[4].head.text == "They"
assert doc2[4].dep_ == "dep"
# head set to the new artificial head (in sentence 2)
assert doc2[4].head.text == "They"
assert doc2[4].dep_ == "dep"
# We should still have 2 sentences
sents = list(doc2.sents)
assert len(sents) == 2
assert sents[0].text == "jests at scars ."
assert sents[1].text == "They never"


@Language.factory("my_pipe")
class CustomPipe:
def __init__(self, nlp, name="my_pipe"):
self.name = name
Span.set_extension("my_ext", getter=self._get_my_ext)
Doc.set_extension("my_ext", default=None)

def __call__(self, doc):
gathered_ext = []
for sent in doc.sents:
sent_ext = self._get_my_ext(sent)
sent._.set("my_ext", sent_ext)
gathered_ext.append(sent_ext)

doc._.set("my_ext", "\n".join(gathered_ext))
return doc

@staticmethod
def _get_my_ext(span):
return str(span.end)


@pytest.mark.issue(4903)
def test_issue4903():
"""Ensure that this runs correctly and doesn't hang or crash on Windows /
macOS."""
nlp = English()
nlp.add_pipe("sentencizer")
nlp.add_pipe("my_pipe", after="sentencizer")
text = ["I like bananas.", "Do you like them?", "No, I prefer wasabi."]
if isinstance(get_current_ops(), NumpyOps):
docs = list(nlp.pipe(text, n_process=2))
assert docs[0].text == "I like bananas."
assert docs[1].text == "Do you like them?"
assert docs[2].text == "No, I prefer wasabi."


@pytest.mark.issue(5048)
def test_issue5048(en_vocab):
words = ["This", "is", "a", "sentence"]
pos_s = ["DET", "VERB", "DET", "NOUN"]
spaces = [" ", " ", " ", ""]
deps_s = ["dep", "adj", "nn", "atm"]
tags_s = ["DT", "VBZ", "DT", "NN"]
strings = en_vocab.strings
for w in words:
strings.add(w)
deps = [strings.add(d) for d in deps_s]
pos = [strings.add(p) for p in pos_s]
tags = [strings.add(t) for t in tags_s]
attrs = [POS, DEP, TAG]
array = numpy.array(list(zip(pos, deps, tags)), dtype="uint64")
doc = Doc(en_vocab, words=words, spaces=spaces)
doc.from_array(attrs, array)
v1 = [(token.text, token.pos_, token.tag_) for token in doc]
doc2 = Doc(en_vocab, words=words, pos=pos_s, deps=deps_s, tags=tags_s)
v2 = [(token.text, token.pos_, token.tag_) for token in doc2]
assert v1 == v2


@pytest.mark.parametrize("text", [["one", "two", "three"]])
def test_doc_api_compare_by_string_position(en_vocab, text):
doc = Doc(en_vocab, words=text)
Expand Down
42 changes: 42 additions & 0 deletions spacy/tests/doc/test_retokenize_split.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,50 @@
import numpy
import pytest

from spacy.vocab import Vocab
from spacy.tokens import Doc, Token


@pytest.mark.issue(3540)
def test_issue3540(en_vocab):
words = ["I", "live", "in", "NewYork", "right", "now"]
tensor = numpy.asarray(
[[1.0, 1.1], [2.0, 2.1], [3.0, 3.1], [4.0, 4.1], [5.0, 5.1], [6.0, 6.1]],
dtype="f",
)
doc = Doc(en_vocab, words=words)
doc.tensor = tensor
gold_text = ["I", "live", "in", "NewYork", "right", "now"]
assert [token.text for token in doc] == gold_text
gold_lemma = ["I", "live", "in", "NewYork", "right", "now"]
for i, lemma in enumerate(gold_lemma):
doc[i].lemma_ = lemma
assert [token.lemma_ for token in doc] == gold_lemma
vectors_1 = [token.vector for token in doc]
assert len(vectors_1) == len(doc)

with doc.retokenize() as retokenizer:
heads = [(doc[3], 1), doc[2]]
attrs = {
"POS": ["PROPN", "PROPN"],
"LEMMA": ["New", "York"],
"DEP": ["pobj", "compound"],
}
retokenizer.split(doc[3], ["New", "York"], heads=heads, attrs=attrs)

gold_text = ["I", "live", "in", "New", "York", "right", "now"]
assert [token.text for token in doc] == gold_text
gold_lemma = ["I", "live", "in", "New", "York", "right", "now"]
assert [token.lemma_ for token in doc] == gold_lemma
vectors_2 = [token.vector for token in doc]
assert len(vectors_2) == len(doc)
assert vectors_1[0].tolist() == vectors_2[0].tolist()
assert vectors_1[1].tolist() == vectors_2[1].tolist()
assert vectors_1[2].tolist() == vectors_2[2].tolist()
assert vectors_1[4].tolist() == vectors_2[5].tolist()
assert vectors_1[5].tolist() == vectors_2[6].tolist()


def test_doc_retokenize_split(en_vocab):
words = ["LosAngeles", "start", "."]
heads = [1, 2, 2]
Expand Down
Loading

0 comments on commit 7d50804

Please sign in to comment.