VectorizationBase

This library provides some basic functionality meant to aid in in vectorizing code on x86 architectures (most laptops, desktops, and servers). Building it depends on CpuID.jl, which creates a file defining a fewconstants important to vectorization, such as the SIMD widths. For example:

julia> using VectorizationBase

julia> VectorizationBase.pick_vector_width(Float64)
8

julia> VectorizationBase.pick_vector_width(Float32)
16

julia> VectorizationBase.pick_vector(Float64)
NTuple{8,VecElement{Float64}}

This means that on the computer that ran this code, a single vector register can hold 8 elements of type Float64 or 16 Float32.

For more information on VecElements, please see Julia's documentation or Kristoffer Carlson's blog post SIMD and SIMD-intrinsics in Julia. In short, they are lowered as LLVM vectors, which (while housing multiple elements) are treated as single values, to be operated on in parallel with SIMD instructions. This is contrast with an array, which is an aggregate collection of values.

The libraries SIMDPirates and SLEEFPirates (which are themselves forks of SIMD.jl and SLEEF.jl) define many functions operating on tuples of VecElements, for writing explicit SIMD code.

Perhaps used more often than VectorizationBase.pick_vector_width is VectorizationBase.pick_vector_width_shift:

julia> W, Wshift = VectorizationBase.pick_vector_width_shift(Float64)
(8, 3)

This function returns the ideal vector width for the type, as well as log2(width). This is useful for calculating the number of loop iterations. For example, if we wish to iterate over a vector of length 117:

julia> 117 >> Wshift
14

julia> 117 & (W - 1)
5

julia> 14W + 5
117

We would want 14 iterations of 8, leaving us a remainder of 5. We could either loop over this remainder one at a time, or use masked instructions. VectorizationBase.mask provides a convenience function for producing a mask:

julia> VectorizationBase.mask(Float64, 5)
0x1f

julia> bitstring(ans)
"00011111"

Masks are to be read from right to left. For example, we could define a masked load as follows:

@inline function vload8d(ptr::Ptr{Float64}, mask::UInt8)
	Base.llvmcall(
		("declare <8 x double> @llvm.masked.load.v8f64(<8 x double>*, i32, <8 x i1>, <8 x double>)",
		"""%ptr = inttoptr i64 %0 to <8 x double>*
		%mask = bitcast i8 %1 to <8 x i1>
		%res = call <8 x double> @llvm.masked.load.v8f64(<8 x double>* %ptr, i32 8, <8 x i1> %mask, <8 x double> zeroinitializer)
		ret <8 x double> %res"""),
		Vec{8, Float64}, Tuple{Ptr{Float64}, UInt8}, ptr, mask
	)
end

This allows us to load just 5 elements (and therefore avoid segmentation faults):

julia> x = rand(5); x'
1×5 LinearAlgebra.Adjoint{Float64,Array{Float64,1}}:
 0.928247  0.889502  0.533114  0.285248  0.275795

julia> y = rand(8); y'
1×8 LinearAlgebra.Adjoint{Float64,Array{Float64,1}}:
 0.402572  0.771600  0.242454  0.699283  0.618579  0.804612  0.904894  0.234704
 
julia> vload8d(pointer(x), 0x1f)
(VecElement{Float64}(0.9282465954844357), VecElement{Float64}(0.8895020822839887), VecElement{Float64}(0.5331136178366147), VecElement{Float64}(0.28524793374254176), VecElement{Float64}(0.2757945162086832), VecElement{Float64}(0.0), VecElement{Float64}(0.0), VecElement{Float64}(0.0))

julia> vload8d(pointer(y), 0x1f)
(VecElement{Float64}(0.4025719724148169), VecElement{Float64}(0.7715998492280507), VecElement{Float64}(0.242453946944301), VecElement{Float64}(0.6992828239389028), VecElement{Float64}(0.6185788376359711), VecElement{Float64}(0.0), VecElement{Float64}(0.0), VecElement{Float64}(0.0))

julia> vload8d(pointer(y) + 3sizeof(Float64), 0x1f)
(VecElement{Float64}(0.6992828239389028), VecElement{Float64}(0.6185788376359711), VecElement{Float64}(0.8046118255195078), VecElement{Float64}(0.904893953223624), VecElement{Float64}(0.23470368695369492), VecElement{Float64}(0.0), VecElement{Float64}(0.0), VecElement{Float64}(0.0))

This compiles to efficient native code on the host machine:

julia> @code_native debuginfo=:none vload8d(pointer(x), 0x1f)
	.text
	kmovd	%esi, %k1
	vmovupd	(%rdi), %zmm0 {%k1} {z}
	retq
	nopl	(%rax,%rax)

You don't need to know LLVM to use this library, SIMDPirates, or SLEEFPirates; they should abstract everything away. If you have any feature requests, or would like any helper functions such as those mentioned here, please don't hesitate to file an issue or open a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 389 Commits
.github/workflows		.github/workflows
docs		docs
src		src
test		test
.appveyor.yml		.appveyor.yml
.gitignore		.gitignore
LICENSE		LICENSE
Manifest.toml		Manifest.toml
Project.toml		Project.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VectorizationBase

About

Releases

Packages

Languages

License

turingtest37/VectorizationBase.jl

Folders and files

Latest commit

History

Repository files navigation

VectorizationBase

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages