Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix global linear indexing (fill!) #496

Merged
merged 1 commit into from
Dec 16, 2024
Merged

Fix global linear indexing (fill!) #496

merged 1 commit into from
Dec 16, 2024

Conversation

christiangnrd
Copy link
Contributor

Implementation borrowed from CUDA.jl version.

Close #466

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metal Benchmarks

Benchmark suite Current: 28fe952 Previous: 634cec7 Ratio
private array/construct 27387 ns 26887 ns 1.02
private array/broadcast 460417 ns 462333 ns 1.00
private array/random/randn/Float32 806479.5 ns 865083 ns 0.93
private array/random/randn!/Float32 631125 ns 658875 ns 0.96
private array/random/rand!/Int64 566333 ns 548542 ns 1.03
private array/random/rand!/Float32 598083.5 ns 584292 ns 1.02
private array/random/rand/Int64 750791 ns 756500 ns 0.99
private array/random/rand/Float32 617292 ns 605542 ns 1.02
private array/copyto!/gpu_to_gpu 683708 ns 658041 ns 1.04
private array/copyto!/cpu_to_gpu 651791.5 ns 701083 ns 0.93
private array/copyto!/gpu_to_cpu 822896 ns 824854.5 ns 1.00
private array/accumulate/1d 1313500 ns 1325958.5 ns 0.99
private array/accumulate/2d 1381083 ns 1382792 ns 1.00
private array/iteration/findall/int 2062187 ns 2067791 ns 1.00
private array/iteration/findall/bool 1816000 ns 1825375 ns 0.99
private array/iteration/findfirst/int 1688250 ns 1674750 ns 1.01
private array/iteration/findfirst/bool 1645416 ns 1646437.5 ns 1.00
private array/iteration/scalar 3873833 ns 3884646 ns 1.00
private array/iteration/logical 3163875 ns 3164458 ns 1.00
private array/iteration/findmin/1d 1734833.5 ns 1740125 ns 1.00
private array/iteration/findmin/2d 1348875 ns 1346458 ns 1.00
private array/reductions/reduce/1d 1034000 ns 1020729 ns 1.01
private array/reductions/reduce/2d 651208.5 ns 664250 ns 0.98
private array/reductions/mapreduce/1d 1033791 ns 1032125 ns 1.00
private array/reductions/mapreduce/2d 658334 ns 659542 ns 1.00
private array/permutedims/4d 2540500 ns 2720542 ns 0.93
private array/permutedims/2d 1011000 ns 1011208 ns 1.00
private array/permutedims/3d 1579959 ns 1574854 ns 1.00
private array/copy 603542 ns 557250 ns 1.08
latency/precompile 5146918500 ns 5138248250 ns 1.00
latency/ttfp 6634638146 ns 6754936625 ns 0.98
latency/import 1162510500 ns 1151697916.5 ns 1.01
integration/metaldevrt 712750 ns 719667 ns 0.99
integration/byval/slices=1 1567645.5 ns 1560604.5 ns 1.00
integration/byval/slices=3 10250000 ns 10389833 ns 0.99
integration/byval/reference 1546208 ns 1566416.5 ns 0.99
integration/byval/slices=2 2583708 ns 2542084 ns 1.02
kernel/indexing 459667 ns 487187.5 ns 0.94
kernel/indexing_checked 451250 ns 469791.5 ns 0.96
kernel/launch 9895.833333333332 ns 8042 ns 1.23
metal/synchronization/stream 14708 ns 14125 ns 1.04
metal/synchronization/context 15250 ns 14542 ns 1.05
shared array/construct 26496.583333333336 ns 26607.14285714286 ns 1.00
shared array/broadcast 470917 ns 453208 ns 1.04
shared array/random/randn/Float32 820708 ns 790125 ns 1.04
shared array/random/randn!/Float32 666750 ns 668750 ns 1.00
shared array/random/rand!/Int64 564875 ns 573042 ns 0.99
shared array/random/rand!/Float32 590042 ns 596500 ns 0.99
shared array/random/rand/Int64 771042 ns 780209 ns 0.99
shared array/random/rand/Float32 590292 ns 618625 ns 0.95
shared array/copyto!/gpu_to_gpu 86583 ns 87334 ns 0.99
shared array/copyto!/cpu_to_gpu 88292 ns 98084 ns 0.90
shared array/copyto!/gpu_to_cpu 82375 ns 77041 ns 1.07
shared array/accumulate/1d 1325583.5 ns 1330458 ns 1.00
shared array/accumulate/2d 1384459 ns 1388291 ns 1.00
shared array/iteration/findall/int 1801250 ns 1768750 ns 1.02
shared array/iteration/findall/bool 1564583 ns 1573208 ns 0.99
shared array/iteration/findfirst/int 1384375 ns 1392333 ns 0.99
shared array/iteration/findfirst/bool 1364167 ns 1363479.5 ns 1.00
shared array/iteration/scalar 158270.5 ns 152500 ns 1.04
shared array/iteration/logical 2951000 ns 2956500 ns 1.00
shared array/iteration/findmin/1d 1471166.5 ns 1459687.5 ns 1.01
shared array/iteration/findmin/2d 1370666.5 ns 1350209 ns 1.02
shared array/reductions/reduce/1d 732833 ns 718916 ns 1.02
shared array/reductions/reduce/2d 653000 ns 670875 ns 0.97
shared array/reductions/mapreduce/1d 732583 ns 722583 ns 1.01
shared array/reductions/mapreduce/2d 667542 ns 669292 ns 1.00
shared array/permutedims/4d 2555500 ns 2722021 ns 0.94
shared array/permutedims/2d 1021812.5 ns 1005625 ns 1.02
shared array/permutedims/3d 1586041 ns 1573479 ns 1.01
shared array/copy 242416 ns 248062.5 ns 0.98

This comment was automatically generated by workflow using github-action-benchmark.

@maleadt
Copy link
Member

maleadt commented Dec 14, 2024

What is the underlying issue here? The blame on CUDA.jl's implementation goes back to when we imported that code from KA.jl, so cc @vchuravy.

@christiangnrd
Copy link
Contributor Author

christiangnrd commented Dec 14, 2024

I think the real fix may be to switch from dispatchThreadgroups to dispatchThreads by default since all the devices we support support nonuniform threadgroup sizes.

However, this would be a very big (potentially breaking) change.

@christiangnrd
Copy link
Contributor Author

I opened #497 for discussion. In the meantime I think this PR should me merged as-is (assuming the code is sound)

@maleadt maleadt merged commit 8c119cf into main Dec 16, 2024
2 checks passed
@maleadt maleadt deleted the kafill branch December 16, 2024 08:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fill broken after KA integration
2 participants