Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memray tests for memory leaks for head() and tail() #2199

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

grusev
Copy link
Collaborator

@grusev grusev commented Feb 24, 2025

Reference Issues/PRs

The test creates has a special fixture to setup needed environment (not in test as it slows extremely execution under memray)

  • creates library with predefined by test library options, then adds a symbol and creates many versions and a snapshot for each version. The dataframes created are growing in size when library is dynamic.

The test runs is executed 2 times with different library options for segments size, dynamic type on/off, and encoding version.

It covers head and tails executions with different parameters over different types of versions/snapshots of a symbol

NOTE: utils.py is not part of this PR. It is part of #2185 but is there to reuse code as the other is not yet merged

Additional notes:
Why Linux threshold is so high see -
3.11 - https://github.com/man-group/ArcticDB/actions/runs/13517989453/job/37771476992?pr=2199
3.9 - https://github.com/man-group/ArcticDB/actions/runs/13517989453/job/37771214446
What can be done to address massive leaks which should not be considered leaks is filtering see - https://github.com/man-group/ArcticDB/actions/runs/13517989379/job/37770659554?pr=2199

Overall raised issue to address flakiness and have stress tests run on debug build perhaps only single python version on only 3 runners with 3 different oses. That could give possibility to filter out frames we know are ok as in this example and reduce time for other functional tests

What does this implement or fix?

Change Type (Required)

  • Patch (Bug fix or non-breaking improvement)
  • Minor (New feature, but backward compatible)
  • Major (Breaking changes)
  • Cherry pick

Any other comments?

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

@github-actions github-actions bot added patch Small change, should increase patch version and removed patch Small change, should increase patch version labels Feb 24, 2025
@github-actions github-actions bot added patch Small change, should increase patch version and removed patch Small change, should increase patch version labels Feb 25, 2025
@github-actions github-actions bot added patch Small change, should increase patch version and removed patch Small change, should increase patch version labels Feb 26, 2025
def test_my_test(lmdb_library_any):
.....
"""
params = request.param if hasattr(request, 'param') else {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the example above this should't be param but library_options

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, param will contain all parameters library_options and other if given

@@ -128,6 +128,22 @@ def lmdb_storage(tmp_path) -> Generator[LmdbStorageFixture, None, None]:
def lmdb_library(lmdb_storage, lib_name) -> Library:
return lmdb_storage.create_arctic().create_library(lib_name)

@pytest.fixture
def lmdb_library_any(lmdb_storage, lib_name, request) -> Library:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does any mean in the name? Shouldn't it be lmdb_library_with_options or something like that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see next comment

@@ -128,6 +128,22 @@ def lmdb_storage(tmp_path) -> Generator[LmdbStorageFixture, None, None]:
def lmdb_library(lmdb_storage, lib_name) -> Library:
return lmdb_storage.create_arctic().create_library(lib_name)

@pytest.fixture
def lmdb_library_any(lmdb_storage, lib_name, request) -> Library:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a fixture named version_store_factory (the naming is not great though) which does does the same job. We should reuse that and make it return a V2 library instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, at least have a similar design and API to version_store_factory but perhaps a new fixture than returns a Library like library_factory. There's no reason for them to be meaningfully different to each other

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lmdb_library now is the only remaining. I figured out that if no params are passed then lmdb_library will continue to function like it used to,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as this test is slow test it must run once hence version_store_factory is not opt.

@@ -723,7 +750,106 @@ def test_mem_leak_read_all_arctic_lib_memray(library_with_big_symbol_):
logger.info("Test starting")
st = time.time()
data: pd.DataFrame = lib.read(symbol).data
lib.head
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not calling the head function

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some refacvtory cleanup mess - removed, should not have any line there. Thanks for spotting this


all_columns = df.columns.to_list()
yield (lib, symbol, num_rows_list, snapshot_names, all_columns)
lib.delete(symbol=symbol)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt that this will be called at all

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why, Vasil?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is always called. That is actually one of core functionalities of fixtures. They provide a way to do something before (setup) and things after (cleanup) so that you isolate test logic only in the test method. And that is the reason that they yield. The cleanup phase is also protected of any problems that might aris during test execution - ie it will always execute

df3: pd.DataFrame = store.head(n=rows, as_of=ver, symbol=symbol, columns=number_columns).data
difference = list(set(df3.columns.to_list()).difference(set(number_columns)))
assert len(difference) == 0, f"Columns not included : {difference}"
df4 = store.tail(n=rows, as_of=ver, symbol=symbol, columns=number_columns).data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like how some of the declarations have type annotations and some don't and it's completely arbitrary which ones do have annotations. We should be consistent in our code style otherwise it's hard to follow the code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with that, but not all of us will use type hints. Thus our code is already hard to follow . Intelisense of IDEs use type hints, this it helps in many cases write and maintain the code better even without linters.

Thus a working for both approaches is the intersection of needs - add where there is value - complex objects, no need to ad where there is no value

We will not have linter anytime soon so making all is not adding value.

# constructing a random list of values for versions names for each iteration
versions_list: List[int] = np.random.randint(0, len(num_rows_list) - 1, iterations)
# constructing a random list of values for column selection for each iteration
number_columns_list: List[int] = np.random.randint(0, len(all_columns)-1, iterations)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column parameter takes a list of column names not a list of indexes

Copy link
Collaborator Author

@grusev grusev Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to number_columns_for_selection_list edited the comment also

Comment on lines +839 to +840
difference = list(set(df3.columns.to_list()).difference(set(number_columns)))
assert len(difference) == 0, f"Columns not included : {difference}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this assert here. I have two arguments:

  1. It's obviously not helping and not checking any correctness as it's not doing anything and the test passes. You are passing number_columns_list which is a list of ints and all columns are named something like "int8_0" according to generate_wide_dataframe. Thus the read returns only the index column. The column list is empty and the difference of empty set and something else is always 0.
  2. Separation of concerns. We can add unit tests to check the correctness of the columns parameter (and in fact we have such tests) but this tests is a memory leak test. If we try checking this why not go all the way and check that head/tail return the correct data, etc... This leaves more questions than answers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was and it is working , but because poor choice of words leads to assumption no columns are selected, but they are selected ... not the names are clear for their meaning

As for item 2 with tests this is not true. With tests you can (and perhaps should assume) that a problem is lurking evrywhere. Thus adding checks at places to catch a problem earliest is always advised as longer tests will continue to run even with problem and what you get is PASS instead fail.

Only exclusion of this principle is performance tests - there minimum checks are added becaise of obvious reason - they cary perf penalty

(store, symbol, num_rows_list, snapshot_names, all_columns) = prepare_head_tails_symbol

start_test = time.time()
min_rows = min(num_rows_list)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not used in the code

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do have some remarks on the code here but won't comment and leave it for the other review.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use this PR for those: #2185

@grusev grusev marked this pull request as ready for review February 27, 2025 09:09
start_time=pd.Timestamp(0),seed=64578)
lib.write(symbol,df)
snap = f"{symbol}_{rows}"
lib.snapshot(snap)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused. Why is there a snapshot

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

head and tail has 'as_of' parameter. Later snapshots are used for that purpose ...
store.head(n=rows, as_of=snap, symbol=symbol
store.tail(n=rows, as_of=ver, symbol=symbol, columns=selected_columns)
for versions.

Thus I believer this covers maximum code paths

{'library_options': LibraryOptions(rows_per_segment=233, columns_per_segment=197, dynamic_schema=True, encoding_version=EncodingVersion.V2)},
{'library_options': LibraryOptions(rows_per_segment=99, columns_per_segment=99, dynamic_schema=False, encoding_version=EncodingVersion.V1)}
], indirect=True)
@pytest.mark.limit_leaks(location_limit="52 KB" if not LINUX else "380 KB", filter_fn=is_relevant)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invert the if statement to avoid double negative


del store, symbol, num_rows_list, snapshot_names, all_columns
del num_rows_to_select, important_values, snapshots_list, versions_list, number_columns_list
gc.collect()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the del and the collect()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just making sure nothing is left from this python test. Thus remaining leaks should be searched outside.

number_columns_list: List[int] = np.random.randint(0, len(all_columns)-1, iterations)

count = 0
for rows in num_rows_to_select:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand what this loop is trying to do and as Vasil says below it looks like it is actually incorrect

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment

    # We will execute several time all head/tail operations with specific number of columns.
    # the number of columns consist of random columns and boundary cases see definition above

and in the code above:

constructing a list of head and tail rows to be selected

    num_rows_to_select = []
    important_values = [0, 1, 0 -1, 2, -2, max_rows, -max_rows ]
    num_rows_to_select.extend(important_values)
    num_rows_to_select.extend(np.random.randint(low=5, high=99, size=7)) # add 7 more random values

Copy link
Collaborator

@poodlewars poodlewars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
patch Small change, should increase patch version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants