Memory estimate fixes #720

magdyksaleh · 2024-12-16T18:19:27Z

Update mem wiggle room to 0.9 (from 0.8)
Ensure free_memory is never negative
Lazy load graph for larger compile batch sizes
Add new argument compile_batch_size (default 32) which is the batch size we will initially use for cuda compilation of models
make the precommit use ruff instead of flake8

magdyksaleh · 2024-12-16T22:05:35Z

server/lorax_server/utils/graph.py

+            or not graph.input_state.traced_adapter_layer_names.issuperset(adapter_data.layer_names())
+            # This is the case where COMPILE_BATCH_SIZE < batch_size <= MAX_BATCH_SIZE so
+            # we just retrace the graph for that new size
+            or batch_size > self.batch_size


lorax/server/lorax_server/utils/graph.py

Lines 478 to 498 in 12e530a

def can_use_graph(

self,

batch: "FlashCausalLMBatch",

adapter_data: AdapterBatchData,

) -> bool:

ranks = adapter_data.ranks()

nranks = len(ranks)

max_rank = max(ranks) if len(ranks) > 0 else 0

batch_size = batch.input_ids.shape[0]

max_s = batch.max_current_length

# TODO(travis): allow using CUDA graphs with multi-rank batches

return (

torch.cuda.is_available()

and batch_size <= MAX_BATCH_SIZE

and max_s <= self.max_total_tokens

and max_rank <= MAX_RANK

and nranks <= 1

and max_rank in _allowed_ranks

)

This ensures that we stay within the range

tgaddair

LGTM!

tgaddair

Actually, I think we need changes to warmup.

server/lorax_server/utils/graph.py

tgaddair

LGTM!

magdyksaleh added 3 commits December 16, 2024 13:10

fix wiggle room value

6ee26bc

enure free memory stays non-negative

2b271d1

add flag and allow lazy loading for graph

12e530a

magdyksaleh marked this pull request as ready for review December 16, 2024 22:02

magdyksaleh requested review from tgaddair, ajtejankar and arnavgarg1 December 16, 2024 22:02

magdyksaleh commented Dec 16, 2024

View reviewed changes

tgaddair approved these changes Dec 16, 2024

View reviewed changes

tgaddair requested changes Dec 16, 2024

View reviewed changes

tgaddair reviewed Dec 16, 2024

View reviewed changes

server/lorax_server/utils/graph.py Show resolved Hide resolved

fix

d2df568

magdyksaleh requested a review from tgaddair December 16, 2024 22:36

tgaddair approved these changes Dec 16, 2024

View reviewed changes

magdyksaleh merged commit 6dfb215 into main Dec 17, 2024
3 checks passed

magdyksaleh deleted the memory-estimate-fixes branch December 17, 2024 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory estimate fixes #720

Memory estimate fixes #720

magdyksaleh commented Dec 16, 2024 •

edited

Loading

magdyksaleh Dec 16, 2024

tgaddair left a comment

tgaddair left a comment

tgaddair left a comment

	def can_use_graph(
	self,
	batch: "FlashCausalLMBatch",
	adapter_data: AdapterBatchData,
	) -> bool:
	ranks = adapter_data.ranks()
	nranks = len(ranks)
	max_rank = max(ranks) if len(ranks) > 0 else 0

	batch_size = batch.input_ids.shape[0]
	max_s = batch.max_current_length

	# TODO(travis): allow using CUDA graphs with multi-rank batches
	return (
	torch.cuda.is_available()
	and batch_size <= MAX_BATCH_SIZE
	and max_s <= self.max_total_tokens
	and max_rank <= MAX_RANK
	and nranks <= 1
	and max_rank in _allowed_ranks
	)

Memory estimate fixes #720

Memory estimate fixes #720

Conversation

magdyksaleh commented Dec 16, 2024 • edited Loading

magdyksaleh Dec 16, 2024

Choose a reason for hiding this comment

tgaddair left a comment

Choose a reason for hiding this comment

tgaddair left a comment

Choose a reason for hiding this comment

tgaddair left a comment

Choose a reason for hiding this comment

magdyksaleh commented Dec 16, 2024 •

edited

Loading