Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Add test cases for arrays and objects, and introduce verify_schema #250

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

amotl
Copy link
Contributor

@amotl amotl commented Dec 13, 2023

About

This patch just aims to improve the test cases by exercising a few more variants when using container types (array/object resp. list/dictionary).

Rationale

While PostgreSQL just uses the jsonb[] type for storing them, the details may be slightly different with other databases which are otherwise largely compatible with PostgreSQL. In particular, CrateDB uses dedicated ARRAY() and OBJECT() types for achieving the same thing what PostgreSQL does with JSON(B).

However, it is not exclusively about other databases, where target-postgres would not need to be concerned about. It is also about setting the stage for other ARRAY types, which may not be converged into JSONB, but into VECTOR(N) instead, when taking PostgreSQL database extensions into account, which provide other data types conceptionally matching built-in data types. In particular, with VECTOR(N), I am looking at pgvector here.

Details

In PostgreSQL, all boils down to the jsonb[] type, but arrays are reflected as sqlalchemy.dialects.postgresql.ARRAY instead of sqlalchemy.dialects.postgresql.JSONB. The new test cases both verify and outline that fact.

In order to prepare for more advanced type mangling & validation, and to better support databases pretending to be compatible with PostgreSQL, the new test cases exercise arrays with different kinds of inner values, because, on other databases, ARRAYs may need to have uniform content.

References

Along the lines, it adds a verify_schema utility function in the spirit of the verify_data function, refactored and generalized from the test_anyof test case, building upon the recent improvements by @sebastianswms.

In PostgreSQL, all boils down to the `jsonb[]` type, but arrays are
reflected as `sqlalchemy.dialects.postgresql.ARRAY` instead of
`sqlalchemy.dialects.postgresql.JSONB`.

In order to prepare for more advanced type mangling & validation, and to
better support databases pretending to be compatible with PostgreSQL,
the new test cases exercise arrays with different kinds of inner values,
because, on other databases, ARRAYs may need to have uniform content.

Along the lines, it adds a `verify_schema` utility function in the
spirit of the `verify_data` function, refactored and generalized from
the `test_anyof` test case.
@amotl amotl force-pushed the test-array-object branch from e70dd58 to c0d4a89 Compare December 19, 2023 15:16
@amotl amotl changed the title test: Add test cases for arrays and objects test: Add test cases for arrays and objects, and introduce verify_schema Dec 19, 2023
@visch
Copy link
Member

visch commented Dec 19, 2023

From first glance at this from the ticket you mentioned @amotl I wonder if just adding 5-10 more tests would trigger this or if it's specefic to how your test is written. I haven't dove into the code I'm just offering some ideas to help others. If just adding* 5-10 duplicate tests doesn't cause the issue, then there's something up with the specific test here and I'd look into https://github.com/MeltanoLabs/target-postgres/pull/250/files#diff-992edbd5794275b6bacd3fae33d85cc8dc11591daea563b2e65c8fa7b9b5bbf5R155 but that's a wild guess

@amotl
Copy link
Contributor Author

amotl commented Dec 19, 2023

@visch: Thank you for quickly looking into this.

I wonder if just adding 5-10 more tests would trigger this or if it's specefic to how your test is written. I haven't dove into the code I'm just offering some ideas to help others. If just adding* 5-10 duplicate tests doesn't cause the issue, then there's something up with the specific test here.

I think you are right on the spot, and your question sparked my motivation to also have a more closer look. By trying to make progress through trial-and-error, adding 8ec2a15 decreases the number of corresponding FATAL: sorry, too many clients already errors from 4 1 to 1 2.

Footnotes

  1. https://github.com/MeltanoLabs/target-postgres/actions/runs/7263812193/job/19789949506

  2. https://github.com/MeltanoLabs/target-postgres/actions/runs/7265785385/job/19796243399

@amotl amotl force-pushed the test-array-object branch 2 times, most recently from 223a64c to e66085c Compare December 19, 2023 19:18
Dispose the SQLAlchemy engine object after use within test utility functions.
Within `BasePostgresSDKTests`, new database connections via SQLAlchemy
haven't been closed, and started filling up the connection pool,
eventually saturating it.
@amotl amotl force-pushed the test-array-object branch 3 times, most recently from 5a70027 to 1722b80 Compare December 19, 2023 20:34
Dispose the SQLAlchemy engine object after use within
`PostgresConnector`.
@amotl amotl force-pushed the test-array-object branch from 1722b80 to 723e1fa Compare December 19, 2023 20:41
@amotl
Copy link
Contributor Author

amotl commented Dec 19, 2023

By trying to make progress through trial-and-error...

Apparently, all of 8b3ea4f, a9d1796, 723e1fa are needed to remedy the flaw 1. I am not sure if they are correct, but they resolve the problem at hand: The patch passes CI successfully now.

Footnotes

  1. 2eede7e was also tried, but didn't make a difference.

Comment on lines 181 to +186
@contextmanager
def _connect(self) -> t.Iterator[sqlalchemy.engine.Connection]:
with self._engine.connect().execution_options() as conn:
engine = self._engine
with engine.connect().execution_options() as conn:
yield conn
engine.dispose()
Copy link
Contributor Author

@amotl amotl Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This spot relates to an issue we observed after adding more test cases. Other than any other related changes to the test suite, this one affects the connector implementation. Without that change, there seems to be a flaw which apparently qualifies as a resource leak, manifesting in connection pool overflow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would want to understand this fully before merging. What is the root cause of this problem? Are we confident that this is all that's required to close the leak?

Copy link
Contributor Author

@amotl amotl Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Sebastian. Thanks for your review. I can only say that this spot has been found out by educated guessing and a bit more trial-and-error, so it was kind of empirically determined, including the solution.

Apologies that I can't come up with a better reasoning/rationale, but I can offer to submit another repro PR without all the other changes here, in order to better demonstrate the issue.

"array_boolean",
check_columns={
"id": {"type": BIGINT},
"value": {"type": ARRAY},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrays have inner data types such as ARRAY[BOOLEAN] (not sure if that's the right syntax). We should additionally check for that.

Copy link
Contributor Author

@amotl amotl Jan 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. Let me check how verify_schema could be improved accordingly.

We are not looking at multiple dimensions here, or eventual nestedness, right? It would just be about »verifying the type of the first element within the array against its specification, when the array is not empty«, correct?

Comment on lines 181 to +186
@contextmanager
def _connect(self) -> t.Iterator[sqlalchemy.engine.Connection]:
with self._engine.connect().execution_options() as conn:
engine = self._engine
with engine.connect().execution_options() as conn:
yield conn
engine.dispose()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would want to understand this fully before merging. What is the root cause of this problem? Are we confident that this is all that's required to close the leak?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants