Restructure Matrix read operations #314

NikolaosPapailiou · 2024-04-05T14:12:53Z

Restructure Matrix read operations to handle incomplete queries and batching

…atching

shortcut-integration · 2024-04-05T14:12:56Z

This pull request has been linked to Shortcut Story #43913: Restructure Matrix read operations.

jparismorgan

Thanks for doing this!

(Optional)(Non-blocking) It would be really quite nice if we could unit test this, as if it breaks right now we'll have no way to know. Maybe we could set readBatchSize in the constructor and default to READ_BATCH_SIZE, and then in some unit tests set this to a smaller number.

src/include/tdb_defs.h

src/include/detail/linalg/tdb_matrix_with_ids.h

src/include/detail/linalg/tdb_partitioned_matrix.h

lums658 · 2024-04-22T03:11:15Z

src/include/tdb_defs.h

@@ -36,6 +36,28 @@
 #include <string>
 #include <tiledb/tiledb>

+// Default batch size for all TileDB read operations.
+// This is expressed in number of array cells read per request.
+constexpr size_t DEFAULT_READ_BATCH_SIZE_CELLS = 100000000;


I suggest using 100'000'000 for legibility.

lums658

It looks like the reads are reading in batches until the total size of the array is reached. I don't think this is a good way to do this. The initial request should the total_size we want to read, as that will complete in the vast vast majority of cases. If it doesn't, then we should continue the query and get more.

There is no reason to batch things in this way -- that is a pattern for a different use case.

Especially if there is a fixed overhead per query, we want to minimize the number of actual queries submitted.

lums658 · 2024-04-22T03:16:56Z

src/include/detail/linalg/tdb_matrix.h

-    tiledb_helpers::submit_query(tdb_func__, uri_, query);
+    query.set_subarray(subarray).set_layout(layout_order);
+    tiledb::Query::Status status;
+    do {


Does this result in incomplete queries whenever read_batch_size_cells is less than total_size? I don't think we should batch read unless we actually need to. That is, we should always try to read total_size and then iterate if we don't get total_size.

Having incomplete queries here is a very edge case. Reading in batches may or may not solve that issue, but it adds latency to the common case.

The point of read_batch_size_cells is to setup a cap on the individual request size to TileDB. The cap is configurable, so we could set it to a large value in order to perform all requests in one go.

To explain why I expect this to reduce our memory footprint without large latency overhead, lets use an example:

Assume we want to read 15GB of data from a tiledb:// array

Assume that TileDB requires some extra memory (20%) to perform the filter pipeline (decompression, etc.)

Assume read I/O throughput and network speed is 200MB/s and all operations are IO bound(no CPU overhead)

Assume each time we do an extra call to core we have 500ms latency overhead

One go

Peak allocated memory on client: 15GB

Peak allocated memory on server: 15GB + 3GB = 18GB

Peak allocated memory(REST+client when collocated during a realtime UDF): 33GB

Latency: 500ms + 2 * 15GB/200MB/s = 150,5s

With batch size 1GB

Peak allocated memory on client: 15GB

Peak allocated memory on server: 1GB + 0,3GB = 1,3GB

Peak allocated memory(REST+client when collocated during a realtime UDF): 16,3GB

Latency: 15*500ms + 2 * 15GB/200MB/s = 157,5s

One rule in software is "don't pay for what you don't need." It is true that there is only a 5% penalty with this particular set of assumptions. However, I expect the server is going to deal with its memory footprint in its own way, which is why we might get incomplete reads. I would say just request what is needed and let the server say how much memory it wants to devote to that or not.

Do other apps try to be parsimonious this way?

I guess, on the other hand, since this is configurable, we can always just make it really large.

lums658

See in-line comment.

lums658 · 2024-04-22T23:00:20Z

src/include/detail/linalg/tdb_matrix.h

-    tiledb_helpers::submit_query(tdb_func__, uri_, query);
+    query.set_subarray(subarray).set_layout(layout_order);
+    tiledb::Query::Status status;
+    do {


One rule in software is "don't pay for what you don't need." It is true that there is only a 5% penalty with this particular set of assumptions. However, I expect the server is going to deal with its memory footprint in its own way, which is why we might get incomplete reads. I would say just request what is needed and let the server say how much memory it wants to devote to that or not.

Do other apps try to be parsimonious this way?

I guess, on the other hand, since this is configurable, we can always just make it really large.

Restructure Matrix read operations to handle incomplete queries and b…

037fa34

…atching

NikolaosPapailiou requested review from jparismorgan, lums658 and ihnorton April 5, 2024 14:12

jparismorgan approved these changes Apr 7, 2024

View reviewed changes

Shelnutt2 requested changes Apr 8, 2024

View reviewed changes

NikolaosPapailiou added 2 commits April 10, 2024 16:18

Address review comments

b76bcde

Make batch size configurable

e885d82

NikolaosPapailiou requested a review from Shelnutt2 April 16, 2024 10:05

lums658 reviewed Apr 22, 2024

View reviewed changes

lums658 suggested changes Apr 22, 2024

View reviewed changes

lums658 reviewed Apr 22, 2024

View reviewed changes

NikolaosPapailiou added the backlog label Sep 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure Matrix read operations #314

Restructure Matrix read operations #314

NikolaosPapailiou commented Apr 5, 2024

shortcut-integration bot commented Apr 5, 2024

jparismorgan left a comment

lums658 Apr 22, 2024

lums658 left a comment

lums658 Apr 22, 2024

NikolaosPapailiou Apr 22, 2024

lums658 Apr 22, 2024

lums658 left a comment

lums658 Apr 22, 2024

Restructure Matrix read operations #314

Are you sure you want to change the base?

Restructure Matrix read operations #314

Conversation

NikolaosPapailiou commented Apr 5, 2024

shortcut-integration bot commented Apr 5, 2024

jparismorgan left a comment

Choose a reason for hiding this comment

lums658 Apr 22, 2024

Choose a reason for hiding this comment

lums658 left a comment

Choose a reason for hiding this comment

lums658 Apr 22, 2024

Choose a reason for hiding this comment

NikolaosPapailiou Apr 22, 2024

Choose a reason for hiding this comment

lums658 Apr 22, 2024

Choose a reason for hiding this comment

lums658 left a comment

Choose a reason for hiding this comment

lums658 Apr 22, 2024

Choose a reason for hiding this comment