Expanded sharded support for alternative sharding mechanisms #680

rsuderman · 2024-12-12T02:12:24Z

Single-logical-multi-physical sharding allows tensor access between
different devices and tighter synchronization on execution. This means
that sharding needs to support more than differing device ordinals but
also configre multiple queues for the same device. Sharded tensor types
are reworked to support tracking both the supported device AND the queue
it is enqueued on.

To support this each sharded tensor now tracks the DeviceAffinity it is
associated with, along with reassigning affinities post construction.
This allows pre-sharded models to have their affinities updated with an
alternative transfer mechanism.

If device affinity is not specified the default arrangement assumes
separate device ordinals for each shard.

Single-logical-multi-physical sharding allows tensor access between different devices and tighter synchronization on execution. This means that sharding needs to support more than differing device ordinals but also configre multiple queues for the same device. Sharded tensor types are reworked to support tracking both the supported device AND the queue it is enqueued on. To support this each sharded tensor now tracks the DeviceAffinity it is associated with, along with reassigning affinities post construction. This allows pre-sharded models to have their affinities updated with an alternative transfer mechanism. If device affinity is not specified the default arrangement assumes separate device ordinals for each shard.

archana-ramalingam · 2024-12-13T01:20:54Z

sharktank/sharktank/examples/paged_llm_v1.py

@@ -279,16 +283,15 @@ def main():
        tensor_parallelism_size=args.tensor_parallelism_size,
        fake_quant=args.fake_quant,
    )
-    if config.tensor_parallelism_size > 1:
-        dataset.root_theta = shard_theta(dataset.root_theta, config)


Can remove shard_theta import if unused.

stbaione · 2024-12-13T19:42:52Z

I saw the following error this morning when attempting to validate toy_llama_tp2 from iree-test-suites by exporting, and compiling with intent to then verify with iree-run-module, just to make sure patch worked. Using shortfin/sharktank w/ locally built IREE at HEAD:

Obtain assets from iree-test-suites, specifically toy_llama_tp2.irpa, toy_llama_tp2.rank0.irpa, toy_llama_tp2.rank1.irpa
Export to MLIR:

python -m sharktank.examples.export_paged_llm_v1 --bs=1  --irpa-file assets/toy_llama_tp2.irpa --output-mlir=llama.mlir --output-config=config.json --use-queue-affinities

Attempt to compile to vmfb. Started with compiling sharded llama for single device for simplest validation:

iree-compile llama.mlir -o llama.vmfb --iree-hip-target=gfx942 --iree-hal-target-device=hip[0]

Received the following error:

/toy_new/llama.mlir:4027:12: error: op affinity #hal.device.affinity<@__device_0> is not compatible with the partition affinity #hal.device.affinity<@__device_0, [0]>
    %153 = torch.prims.convert_element_type %1, %int5_87 : !torch.vtensor<[256,256],f32>, !torch.int -> !torch.vtensor<[256,256],f16>
           ^
./toy_new/llama.mlir:4027:12: note: see current operation: %190 = "stream.async.transfer"(%189, %10, %10) <{result_affinity = #hal.device.affinity<@__device_0>, source_affinity = #hal.device.affinity<@__device_0, [1]>}> : (!stream.resource<constant>, index, index) -> !stream.resource<constant>

Feedback from Rob this morning before sync:

Hmmm, see if you can figure out where the wrong affinity is. 
Looks like something is not placed correctly. 
Given its an async transfer I would guess we need to strip the transfers in sharded_impls.py

archana-ramalingam · 2024-12-14T05:18:19Z

sharktank/sharktank/layers/rotary_embedding.py

        self.rope_dimension_count = rope_dimension_count
        self.max_seqlen = max_seqlen
        self.use_hf = use_hf
        self.static_tables = static_tables
        self.use_table = use_table

        self.rope_freq_base = rope_freq_base if rope_freq_base is not None else 10000.0
-        self.tensor_parallelism_size = tensor_parallelism_size
+        self.devices = devices


This is redundant with L34, can be removed.

rsuderman added 2 commits December 11, 2024 18:10

fix tests

9a061e6

rsuderman force-pushed the queue_affinities branch from e486ad4 to 9a061e6 Compare December 12, 2024 02:12

rsuderman requested a review from archana-ramalingam December 12, 2024 03:58

Merge branch 'main' into queue_affinities

4ca9d83

archana-ramalingam reviewed Dec 13, 2024

View reviewed changes

archana-ramalingam reviewed Dec 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expanded sharded support for alternative sharding mechanisms #680

Expanded sharded support for alternative sharding mechanisms #680

rsuderman commented Dec 12, 2024

archana-ramalingam Dec 13, 2024

stbaione commented Dec 13, 2024

archana-ramalingam Dec 14, 2024

Expanded sharded support for alternative sharding mechanisms #680

Are you sure you want to change the base?

Expanded sharded support for alternative sharding mechanisms #680

Conversation

rsuderman commented Dec 12, 2024

archana-ramalingam Dec 13, 2024

Choose a reason for hiding this comment

stbaione commented Dec 13, 2024

Received the following error:

Feedback from Rob this morning before sync:

archana-ramalingam Dec 14, 2024

Choose a reason for hiding this comment