Append creates very slow node to read #2093

giuse88 · 2024-12-26T01:17:19Z

Describe the bug

Hi,

I noticed that when I append a dataframe to a node, the read of that node becomes very very slow. To clarify the problem, this is the code which reproduce the problem:

lib = ac['test_lib']
lib.write('test', pd.Series([1,1,1]))
print(lib.read('test').data)

import time
import pandas as pd
import numpy as np
from datetime import datetime
cols = ['COL_%d' % i for i in range(50)]
df = pd.DataFrame(np.random.randint(0, 50, size=(1000, 50)), columns=cols)
df.index = pd.date_range(datetime(2000, 1, 1, 5), periods=1000, freq="h")
print(lib.write('test', df), 'Data Written')

start = time.time()
print(lib.read('test').data)
end = time.time()
print(end - start)

lib.delete('test_ap')
for idx in range(len(df)):
  lib.append('test_ap', df.iloc[[idx]])
  print(idx)

print('append done')
start = time.time()
print(lib.read('test_ap').data)
end = time.time()
print(end - start)

output:

Lib available
0    1
1    1
2    1
dtype: int64
VersionedItem(symbol='test', library='test_lib', data=n/a, version=15, metadata=None, host='S3(endpoint=s3.eu-west-2.amazonaws.com, bucket=crypto-data-s3)', timestamp=1735175412730032201) Data Written
                     COL_0  COL_1  COL_2  COL_3  COL_4  COL_5  COL_6  COL_7  COL_8  COL_9  COL_10  COL_11  ...  COL_38  COL_39  COL_40  COL_41  COL_42  COL_43  COL_44  COL_45  COL_46  COL_47  COL_48  COL_49
2000-01-01 05:00:00     19     49     34     13     40      1     32     36     32     14      38       2  ...      44       2      23       0      33       9       3      22      33      20      11      26
2000-01-01 06:00:00      8     25     41     26     33     48     32     36      1      5      21      45  ...      25      21       6      16      12      47       6      11      48      37      23      48
2000-01-01 07:00:00     21     14     45     21     10      7      5     22     24     27      49       8  ...       3      10      22      29      33       4      44      12       4      27      43      26
2000-01-01 08:00:00     33      7     49     19     40     47     26     32      7     20      28      30  ...      21       1      23      45      22      31      18      12      43      11       2       3
2000-01-01 09:00:00     38      6     15     13     17      7     11     22     12     39      35       1  ...       3       6      13      46       6       1       6      12       0      25       7      18
...                    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...     ...     ...  ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...
2001-02-20 16:00:00     32      0     25     15     17     35     15     34     19     46      32      29  ...       1      18      17       5      42      26      24      16       5      31      48      40
2001-02-20 17:00:00     31      9     32     18     46     12      5     13      0     49      28      31  ...      33       5      49      45      33       4      22      38      41      15      23      25
2001-02-20 18:00:00      0     24     13     35     13     43     34     41     35      0      19       8  ...      28       7      31      14      47      10      17      38       5      41      32      47
2001-02-20 19:00:00     29     45     21     40     19     12     13     25     39     38      16      10  ...      36      26      49      23       8       2      18      46      42      39      27      29
2001-02-20 20:00:00     46      5     37     41     14     25     17     37      0     15       0       6  ...      39      19      28       2       2      25      28      48       7      13      35      42

[10000 rows x 50 columns]
0.3140294551849365
append done
                     COL_0  COL_1  COL_2  COL_3  COL_4  COL_5  COL_6  COL_7  COL_8  COL_9  COL_10  COL_11  ...  COL_38  COL_39  COL_40  COL_41  COL_42  COL_43  COL_44  COL_45  COL_46  COL_47  COL_48  COL_49
2000-01-01 05:00:00     11     18     21     42     42     28     48     35      5     35      35      37  ...       2      44      44      46       5      49      13      26      35      49       6       7
2000-01-01 06:00:00      9     48     16     14     21     39     33     27     21      0      40      31  ...      28      13      23      24      39      44      25      26      43      40      16       7
2000-01-01 07:00:00     11     25      1     35     29     18     19     32     31     10      29      21  ...      31      49       8      19      17       7      35      32      35      31      43       4
2000-01-01 08:00:00      2     22      6     12      6     34     12     42     21     49       6      43  ...      47       6      46      46      45      15      16       7      14      37      29       5
2000-01-01 09:00:00     28      8      3     34     39      9      2     32     34      1      29       1  ...      19      32      22      43      24       8      19      43      41      15       4      47
...                    ...    ...    ...    ...    ...    ...    ...    ...    ...    ...     ...     ...  ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...     ...
2000-04-26 21:00:00     44     10      9     15     41     11     12     38     31      7      22      47  ...      40       4      12      31      23      38      30      25      38      20      42      39
2000-04-26 22:00:00      9     28     31     18     10     21     21      5     36     33      18      45  ...      30      33       8      19      42      24      24      29      18      19       8      34
2000-04-26 23:00:00      4     46     42     26     10     36     23      0     46     23       6      26  ...      31      42      27      32      13      32      35      36      35      10      10       5
2000-04-27 00:00:00     35     44     46      3     37     42      3      5     41     31      44      13  ...       5      19       7      15      44      26      46      33       2      17      25      49
2000-04-27 01:00:00      5     40     40     37     28     16      5     48     36     37      43      38  ...      49      45      18      46       3      45      13      14      40      29      35      49

[2805 rows x 50 columns]
47.270530462265015

You can see that reading the data from the appended node takes 47s.

The library is exactly the same, the problem is the append function.

Expected Results

The read takes the same amount of time.

OS, Python Version and ArcticDB Version

Python: 3.11.11 (main, Dec 4 2024, 08:55:08) [GCC 13.2.0]
OS: Linux-6.8.0-1018-aws-x86_64-with-glibc2.39
ArcticDB: 5.1.2

Backend storage used

AWS S3

Additional Context

None

The text was updated successfully, but these errors were encountered:

G-D-Petrov · 2024-12-27T10:53:25Z

hi @giuse88,
This is more-or-less expected as appends create a separate version in storage.
When reading, all of the individual versions need to be read and it the repro above this means that 1000 individual IO operations need to be performed, which can be expensive for operations over the network.

It does seems slower than expected though, so I will investigate further.
In the mean time, a workaround for your use case can be to read the data after the appends and rewrite it to the symbol, e.g. something like lib.write("test_ap", lib.read("test_ap").data)
This will make subsequent reads much faster i.e. as fast as the read after the initial write in your repro.

giuse88 · 2024-12-27T13:51:23Z

This is a bit weird I thought the data scruture would be the same between a write and append. ! Apart from what you suggested, isn’t there a possibility to turn off this behaviour?

The other things I've noticed is that I am creating a version for each append. Is it possible to turn it off?

giuse88 · 2024-12-27T13:58:47Z

Does artic db support write with an index?

G-D-Petrov · 2024-12-31T09:49:05Z

isn’t there a possibility to turn off this behaviour?

The other things I've noticed is that I am creating a version for each append. Is it possible to turn it off?

It is not possible to turn off the behavior for append, it is intentional that every new 'write' operation creates a new version (e.g. write, append, update.
That is what the architecture of ArcticDB is based around.

Does artic db support write with an index?

I am not sure what you mean by this, we support writing data frames to symbols.
You can:

write a whole symbol/data frame with write
add to an existing symbol/data frame with append
change an existing symbol/data frame with update

giuse88 added the bug Something isn't working label Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Append creates very slow node to read #2093

Append creates very slow node to read #2093

giuse88 commented Dec 26, 2024 •

edited

Loading

G-D-Petrov commented Dec 27, 2024 •

edited

Loading

giuse88 commented Dec 27, 2024 •

edited

Loading

giuse88 commented Dec 27, 2024

G-D-Petrov commented Dec 31, 2024

Append creates very slow node to read #2093

Append creates very slow node to read #2093

Comments

giuse88 commented Dec 26, 2024 • edited Loading

Describe the bug

Expected Results

OS, Python Version and ArcticDB Version

Backend storage used

Additional Context

G-D-Petrov commented Dec 27, 2024 • edited Loading

giuse88 commented Dec 27, 2024 • edited Loading

giuse88 commented Dec 27, 2024

G-D-Petrov commented Dec 31, 2024

giuse88 commented Dec 26, 2024 •

edited

Loading

G-D-Petrov commented Dec 27, 2024 •

edited

Loading

giuse88 commented Dec 27, 2024 •

edited

Loading