Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare memory needs for dataframe vs read/write dataframe in lmdb and aws s3 #2135

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

grusev
Copy link
Collaborator

@grusev grusev commented Jan 23, 2025

Reference Issues/PRs

Purpose of this test is to compare read and write operations memory
efficieny over large dataframes with pandas and between storage.
    comparisons
        pandas_dataframe to read_lmdb
        pandas_dataframe to read_aws
        pandas_dataframe to write_lmdb
        pandas_dataframe to write_aws
        write_lmdb write_aws
        read_lmdb read_aws

During experiments it was determined that we can relate creation of a dataframe with read/write processes with lmdb and aws s3 store.

If a dataframe construction tooks X ammount mem the read/write processes might tak from 1.x to 2.5 timex X ammount of that memory for dataframes . The bigger the number of rows the bigger the multiplier.

The test tries to showcase an approach for creating tests that could address some concerns in future and the example below shows that this was an actual case.

NOTE: my s3 Wifi connection reported by windows is approx 1 MBytes/s
NOTE 2: The approach uses memory_profiler which I added to needed libs.

LINK TO GITHUB EXECUTION: https://github.com/man-group/ArcticDB/actions/runs/12946758440/job/36112967427

Log from a failed execution which explains perhaps more than words

========================================================================================================= test session starts =========================================================================================================
platform linux -- Python 3.10.12, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/grusev/source/dependencies_fix
configfile: pyproject.toml
plugins: memray-1.7.0, cpp-2.6.0, xdist-3.6.1, hypothesis-6.72.4, timeout-2.3.1
collected 1 item

python/tests/stress/arcticdb/version_store/test_mem_comparison.py INFO:MemCompare:Creating dataframe with 2000000 rows
INFO:MemCompare:START: create_dataframe
INFO:MemCompare:Time took: 1.589540958404541
INFO:MemCompare:create_dataframe 1820.5390625MB
INFO:MemCompare:START: mem_write_dataframe_arctic_lmdb
INFO:MemCompare:Time took: 2.4767887592315674
INFO:MemCompare:mem_write_dataframe_arctic_lmdb 2754.63671875MB
INFO:MemCompare:START: mem_read_dataframe_arctic_lmdb
INFO:MemCompare:Time took: 2.5483663082122803
INFO:MemCompare:mem_read_dataframe_arctic_lmdb 3388.60546875MB
INFO:MemCompare:START: mem_write_dataframe_arctic_aws_s3
20250123 14:26:45.704482 89818 W arcticdb | Failed to find segment for key 'r:test_dataframe' : No response body.
20250123 14:26:45.772885 89818 W arcticdb | Failed to find segment for key 'V:test_dataframe' : No response body.
20250123 14:28:59.478910 89818 W arcticdb | Failed to find segment for key 'r:test_dataframe' : No response body.
20250123 14:28:59.544956 89818 W arcticdb | Failed to find segment for key 'V:test_dataframe' : No response body.
INFO:MemCompare:Time took: 134.48248195648193
INFO:MemCompare:mem_write_dataframe_arctic_aws_s3 3769.94140625MB
INFO:MemCompare:START: mem_read_dataframe_arctic_aws_s3
INFO:MemCompare:Time took: 136.6903235912323
INFO:MemCompare:mem_read_dataframe_arctic_aws_s3 4069.9921875MB
INFO:MemCompare:REPORTED MEM USAGE: 1042000128 .. NOTE: This is not reliable
INFO:MemCompare:{<function test_read_write_memory_compare..create_dataframe at 0x7f8e33435120>: 1820.5390625, <function test_read_write_memory_compare..mem_write_dataframe_arctic_lmdb at 0x7f8e334351b0>: 2754.63671875, <function test_read_write_memory_compare..mem_read_dataframe_arctic_lmdb at 0x7f8e33435240>: 3388.60546875, <function test_read_write_memory_compare..mem_write_dataframe_arctic_aws_s3 at 0x7f8e334352d0>: 3769.94140625, <function test_read_write_memory_compare..mem_read_dataframe_arctic_aws_s3 at 0x7f8e33435360>: 4069.9921875}
INFO:MemCompare:We assume <function test_read_write_memory_compare..create_dataframe at 0x7f8e33435120> is 1.8 times more efficient than <function test_read_write_memory_compare..mem_write_dataframe_arctic_lmdb at 0x7f8e334351b0>
INFO:MemCompare:ACTUAL Efficiency factor is : 1.5130884997146279
INFO:MemCompare:Check OK for: mem_write_dataframe_arctic_lmdb
INFO:MemCompare:We assume <function test_read_write_memory_compare..create_dataframe at 0x7f8e33435120> is 1.8 times more efficient than <function test_read_write_memory_compare..mem_read_dataframe_arctic_lmdb at 0x7f8e33435240>
INFO:MemCompare:ACTUAL Efficiency factor is : 1.8613198357286003
ERROR:MemCompare:Too big memory for mem_read_dataframe_arctic_lmdb [3388.60546875] MB compared to calculated threshold 3276.9703125 MB
[base was 1820.5390625],
File: /home/grusev/source/dependencies_fix/python/tests/stress/arcticdb/version_store/test_mem_comparison.py:139

INFO:MemCompare:We assume <function test_read_write_memory_compare..create_dataframe at 0x7f8e33435120> is 1.8 times more efficient than <function test_read_write_memory_compare..mem_write_dataframe_arctic_aws_s3 at 0x7f8e334352d0>
INFO:MemCompare:ACTUAL Efficiency factor is : 2.070783035587845
ERROR:MemCompare:Too big memory for mem_write_dataframe_arctic_aws_s3 [3769.94140625] MB compared to calculated threshold 3276.9703125 MB
[base was 1820.5390625],
File: /home/grusev/source/dependencies_fix/python/tests/stress/arcticdb/version_store/test_mem_comparison.py:140

INFO:MemCompare:We assume <function test_read_write_memory_compare..create_dataframe at 0x7f8e33435120> is 1.8 times more efficient than <function test_read_write_memory_compare..mem_read_dataframe_arctic_aws_s3 at 0x7f8e33435360>
INFO:MemCompare:ACTUAL Efficiency factor is : 2.235597286174682
ERROR:MemCompare:Too big memory for mem_read_dataframe_arctic_aws_s3 [4069.9921875] MB compared to calculated threshold 3276.9703125 MB
[base was 1820.5390625],
File: /home/grusev/source/dependencies_fix/python/tests/stress/arcticdb/version_store/test_mem_comparison.py:141

INFO:MemCompare:We assume <function test_read_write_memory_compare..mem_write_dataframe_arctic_lmdb at 0x7f8e334351b0> is 1.4 times more efficient than <function test_read_write_memory_compare..mem_write_dataframe_arctic_aws_s3 at 0x7f8e334352d0>
INFO:MemCompare:ACTUAL Efficiency factor is : 1.3685802489268803
INFO:MemCompare:Check OK for: mem_write_dataframe_arctic_aws_s3
INFO:MemCompare:We assume <function test_read_write_memory_compare..mem_read_dataframe_arctic_lmdb at 0x7f8e33435240> is 1.4 times more efficient than <function test_read_write_memory_compare..mem_read_dataframe_arctic_aws_s3 at 0x7f8e33435360>
INFO:MemCompare:ACTUAL Efficiency factor is : 1.2010817503051934
INFO:MemCompare:Check OK for: mem_read_dataframe_arctic_aws_s3
FDelete library : test_read_write_memory_compare.413_2025-01-23T12_25_43_362884

============================================================================================================== FAILURES ===============================================================================================================
___________________________________________________________________________________________________ test_read_write_memory_compare ____________________________________________________________________________________________________

lmdb_library = Library(Arctic(config=LMDB(path=/tmp/pytest-of-grusev/pytest-25/test_read_write_memory_compare0)), path=test_read_write_memory_compare.413_2025-01-23T12_25_43_362884, storage=lmdb_storage)
real_s3_library = Library(Arctic(config=S3(endpoint=s3.eu-west-1.amazonaws.com, bucket=arcticdb-ci-test-bucket-02)), path=test_read_write_memory_compare.413_2025-01-23T12_25_43_362884, storage=s3_storage)

@REAL_S3_TESTS_MARK
@SLOW_TESTS_MARK
def test_read_write_memory_compare(lmdb_library, real_s3_library):
    '''
    Purpose of this test is to compare read and write operations memory
    efficieny over large dataframes with pandas and between storage.
        comparisons
            pandas_dataframe to read_lmdb
            pandas_dataframe to read_aws
            pandas_dataframe to write_lmdb
            pandas_dataframe to write_aws
            write_lmdb write_aws
            read_lmdb read_aws
    '''

    lmdb = lmdb_library
    s3 = real_s3_library
    size = 2000000
    logger.info(f"Creating dataframe with {size} rows")
    frame = create_in_memory_frame(size)
    symbol = "test_dataframe"
    df: pd.DataFrame = None

    errors = []
    mem_allocs = {}

    def get_caller_line():
        """
        This function provides caller's line and file.
        To be ued for soft assertion types of checks so that
        when error arises the actual line is kept but execution proceeds
        """
        current_frame = inspect.currentframe()
        caller_frame = current_frame.f_back.f_back # caller of the function is actually

        file_name = caller_frame.f_code.co_filename
        line_number = caller_frame.f_lineno

        return f"File: {file_name}:{line_number}"

    def check_effectiveness_of_2similar_operations(timesEff,
                                                   arctic_function_1st_storage,
                                                   arctic_function_2nd_storage):
        nonlocal errors
        nonlocal mem_allocs
        mem_threshold = timesEff * mem_allocs[arctic_function_1st_storage]
        func_mem = mem_allocs[arctic_function_2nd_storage]
        logger.info(f"We assume {arctic_function_1st_storage} is {timesEff} times more efficient than {arctic_function_2nd_storage}")
        logger.info(f"ACTUAL Efficiency factor is : {func_mem / mem_allocs[arctic_function_1st_storage]}")
        if (mem_threshold < func_mem):
            err = f"Too big memory for {arctic_function_2nd_storage.__name__} [{func_mem}] "
            err += f" MB compared to calculated threshold  {mem_threshold} MB \n"
            err += f" [base was {mem_allocs[arctic_function_1st_storage]}],\n {get_caller_line()}\n\n"
            errors.append(err)
            logger.error(err)
        else:
            logger.info(f"Check OK for: {arctic_function_2nd_storage.__name__}")

    def create_dataframe():
        nonlocal df
        df = pd.DataFrame(frame)

    def mem_write_dataframe_arctic_lmdb():
        lmdb.write(symbol, df)

    def mem_read_dataframe_arctic_lmdb():
        lmdb.read(symbol)

    def mem_write_dataframe_arctic_aws_s3():
        s3.write(symbol, df)

    def mem_read_dataframe_arctic_aws_s3():
        s3.read(symbol)

    funcs = [create_dataframe,
             mem_write_dataframe_arctic_lmdb,
             mem_read_dataframe_arctic_lmdb,
             mem_write_dataframe_arctic_aws_s3,
             mem_read_dataframe_arctic_aws_s3]

    for func in funcs:
        logger.info(f"START: {func.__name__} ")
        st = time.time()
        mem_usage = memory_profiler.memory_usage(func, interval=0.1, max_iterations=1)
        logger.info(f"Time took: {time.time() - st} ")
        max_memory = max(mem_usage)
        mem_allocs[func] = max_memory
        logger.info(f"{func.__name__} {max_memory}MB")

    memory_usage_per_column = df.memory_usage(index=True,deep=True)
    total_memory_usage = memory_usage_per_column.sum()

    logger.info(f"REPORTED MEM USAGE: {total_memory_usage} .. NOTE: This is not reliable")

    logger.info(f"{mem_allocs}")

    check_effectiveness_of_2similar_operations(1.8, create_dataframe,mem_write_dataframe_arctic_lmdb)
    check_effectiveness_of_2similar_operations(1.8, create_dataframe,mem_read_dataframe_arctic_lmdb)
    check_effectiveness_of_2similar_operations(1.8, create_dataframe,mem_write_dataframe_arctic_aws_s3)
    check_effectiveness_of_2similar_operations(1.8, create_dataframe,mem_read_dataframe_arctic_aws_s3)
    check_effectiveness_of_2similar_operations(1.4, mem_write_dataframe_arctic_lmdb, mem_write_dataframe_arctic_aws_s3)
    check_effectiveness_of_2similar_operations(1.4, mem_read_dataframe_arctic_lmdb, mem_read_dataframe_arctic_aws_s3)
  if len(errors) > 0:

E AssertionError: Errors ['Too big memory for mem_read_dataframe_arctic_lmdb [3388.60546875] MB compared to calculated threshold 3276.9703125 MB \n [base was 1820.5390625],\n File: /home/grusev/source/dependencies_fix/python/tests/stress/arcticdb/version_store/test_mem_comparison.py:139\n\n', 'Too big memory for mem_write_dataframe_arctic_aws_s3 [3769.94140625] MB compared to calculated threshold 3276.9703125 MB \n [base was 1820.5390625],\n File: /home/grusev/source/dependencies_fix/python/tests/stress/arcticdb/version_store/test_mem_comparison.py:140\n\n', 'Too big memory for mem_read_dataframe_arctic_aws_s3 [4069.9921875] MB compared to calculated threshold 3276.9703125 MB \n [base was 1820.5390625],\n File: /home/grusev/source/dependencies_fix/python/tests/stress/arcticdb/version_store/test_mem_comparison.py:141\n\n']
E assert False

python/tests/stress/arcticdb/version_store/test_mem_comparison.py:146: AssertionError
---------------------------------------------------------------------------------------------------------- Captured log call ----------------------------------------------------------------------------------------------------------
INFO MemCompare:test_mem_comparison.py:60 Creating dataframe with 2000000 rows
INFO MemCompare:test_mem_comparison.py:123 START: create_dataframe
INFO MemCompare:test_mem_comparison.py:126 Time took: 1.589540958404541
INFO MemCompare:test_mem_comparison.py:129 create_dataframe 1820.5390625MB
INFO MemCompare:test_mem_comparison.py:123 START: mem_write_dataframe_arctic_lmdb
INFO MemCompare:test_mem_comparison.py:126 Time took: 2.4767887592315674
INFO MemCompare:test_mem_comparison.py:129 mem_write_dataframe_arctic_lmdb 2754.63671875MB
INFO MemCompare:test_mem_comparison.py:123 START: mem_read_dataframe_arctic_lmdb
INFO MemCompare:test_mem_comparison.py:126 Time took: 2.5483663082122803
INFO MemCompare:test_mem_comparison.py:129 mem_read_dataframe_arctic_lmdb 3388.60546875MB
INFO MemCompare:test_mem_comparison.py:123 START: mem_write_dataframe_arctic_aws_s3
INFO MemCompare:test_mem_comparison.py:126 Time took: 134.48248195648193
INFO MemCompare:test_mem_comparison.py:129 mem_write_dataframe_arctic_aws_s3 3769.94140625MB
INFO MemCompare:test_mem_comparison.py:123 START: mem_read_dataframe_arctic_aws_s3
INFO MemCompare:test_mem_comparison.py:126 Time took: 136.6903235912323
INFO MemCompare:test_mem_comparison.py:129 mem_read_dataframe_arctic_aws_s3 4069.9921875MB
INFO MemCompare:test_mem_comparison.py:134 REPORTED MEM USAGE: 1042000128 .. NOTE: This is not reliable
INFO MemCompare:test_mem_comparison.py:136 {<function test_read_write_memory_compare..create_dataframe at 0x7f8e33435120>: 1820.5390625, <function test_read_write_memory_compare..mem_write_dataframe_arctic_lmdb at 0x7f8e334351b0>: 2754.63671875, <function test_read_write_memory_compare..mem_read_dataframe_arctic_lmdb at 0x7f8e33435240>: 3388.60546875, <function test_read_write_memory_compare..mem_write_dataframe_arctic_aws_s3 at 0x7f8e334352d0>: 3769.94140625, <function test_read_write_memory_compare..mem_read_dataframe_arctic_aws_s3 at 0x7f8e33435360>: 4069.9921875}
INFO MemCompare:test_mem_comparison.py:89 We assume <function test_read_write_memory_compare..create_dataframe at 0x7f8e33435120> is 1.8 times more efficient than <function test_read_write_memory_compare..mem_write_dataframe_arctic_lmdb at 0x7f8e334351b0>
INFO MemCompare:test_mem_comparison.py:90 ACTUAL Efficiency factor is : 1.5130884997146279
INFO MemCompare:test_mem_comparison.py:98 Check OK for: mem_write_dataframe_arctic_lmdb
INFO MemCompare:test_mem_comparison.py:89 We assume <function test_read_write_memory_compare..create_dataframe at 0x7f8e33435120> is 1.8 times more efficient than <function test_read_write_memory_compare..mem_read_dataframe_arctic_lmdb at 0x7f8e33435240>
INFO MemCompare:test_mem_comparison.py:90 ACTUAL Efficiency factor is : 1.8613198357286003
ERROR MemCompare:test_mem_comparison.py:96 Too big memory for mem_read_dataframe_arctic_lmdb [3388.60546875] MB compared to calculated threshold 3276.9703125 MB
[base was 1820.5390625],
File: /home/grusev/source/dependencies_fix/python/tests/stress/arcticdb/version_store/test_mem_comparison.py:139

INFO MemCompare:test_mem_comparison.py:89 We assume <function test_read_write_memory_compare..create_dataframe at 0x7f8e33435120> is 1.8 times more efficient than <function test_read_write_memory_compare..mem_write_dataframe_arctic_aws_s3 at 0x7f8e334352d0>
INFO MemCompare:test_mem_comparison.py:90 ACTUAL Efficiency factor is : 2.070783035587845
ERROR MemCompare:test_mem_comparison.py:96 Too big memory for mem_write_dataframe_arctic_aws_s3 [3769.94140625] MB compared to calculated threshold 3276.9703125 MB
[base was 1820.5390625],
File: /home/grusev/source/dependencies_fix/python/tests/stress/arcticdb/version_store/test_mem_comparison.py:140

INFO MemCompare:test_mem_comparison.py:89 We assume <function test_read_write_memory_compare..create_dataframe at 0x7f8e33435120> is 1.8 times more efficient than <function test_read_write_memory_compare..mem_read_dataframe_arctic_aws_s3 at 0x7f8e33435360>
INFO MemCompare:test_mem_comparison.py:90 ACTUAL Efficiency factor is : 2.235597286174682
ERROR MemCompare:test_mem_comparison.py:96 Too big memory for mem_read_dataframe_arctic_aws_s3 [4069.9921875] MB compared to calculated threshold 3276.9703125 MB
[base was 1820.5390625],
File: /home/grusev/source/dependencies_fix/python/tests/stress/arcticdb/version_store/test_mem_comparison.py:141

INFO MemCompare:test_mem_comparison.py:89 We assume <function test_read_write_memory_compare..mem_write_dataframe_arctic_lmdb at 0x7f8e334351b0> is 1.4 times more efficient than <function test_read_write_memory_compare..mem_write_dataframe_arctic_aws_s3 at 0x7f8e334352d0>
INFO MemCompare:test_mem_comparison.py:90 ACTUAL Efficiency factor is : 1.3685802489268803
INFO MemCompare:test_mem_comparison.py:98 Check OK for: mem_write_dataframe_arctic_aws_s3
INFO MemCompare:test_mem_comparison.py:89 We assume <function test_read_write_memory_compare..mem_read_dataframe_arctic_lmdb at 0x7f8e33435240> is 1.4 times more efficient than <function test_read_write_memory_compare..mem_read_dataframe_arctic_aws_s3 at 0x7f8e33435360>
INFO MemCompare:test_mem_comparison.py:90 ACTUAL Efficiency factor is : 1.2010817503051934
INFO MemCompare:test_mem_comparison.py:98 Check OK for: mem_read_dataframe_arctic_aws_s3
========================================================================================================== warnings summary ===========================================================================================================
python/tests/stress/arcticdb/version_store/test_mem_comparison.py::test_read_write_memory_compare
python/tests/stress/arcticdb/version_store/test_mem_comparison.py::test_read_write_memory_compare
/home/grusev/venvs/310/lib/python3.10/site-packages/pandas/core/frame.py:717: DeprecationWarning: Passing a BlockManagerUnconsolidated to DataFrame is deprecated and will raise in a future version. Use public APIs instead.
warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
======================================================================================================= short test summary info =======================================================================================================
FAILED python/tests/stress/arcticdb/version_store/test_mem_comparison.py::test_read_write_memory_compare - AssertionError: Errors ['Too big memory for mem_read_dataframe_arctic_lmdb [3388.60546875] MB compared to calculated threshold 3276.9703125 MB \n [base was 1820.5390625],\n File: /home/grusev/source/dependencies_fix/python/t...
============================================================================================== 1 failed, 2 warnings in 338.61s (0:05:38) ==============================================================================================

What does this implement or fix?

Any other comments?

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant