SIMD Array-of-Structures-of-Arrays in nalgebra and comparison with ultraviolet

Hello everyone!

In this post I'd like to introduce the next major change that will be released in nalgebra at the end of this month (March 2020). This change is about adding the support for SIMD AoSoA to nalgebra. I'll explain what I mean by SIMD AoSoA (Array-of-Structures-of-Arrays with explicit SIMD) and how it relates to SoA (Structure-of-Arrays) and AoS (Array-of-Structures). To give you an idea, SIMD AoSoA is actually what the recent ultraviolet crate has been using to achieve its amazing performance.

Here is a benchmark including the next (to-be-released) version of nalgebra. The best times within a 2.5% range of the minimum of each row are highlighted. Here, the ultraviolet column uses the non-wide types of ultraviolet (Vec2, Mat2, Isometry3, etc.) while the column ultraviolet_f32x4 uses its wide types (Wec2, Wat2, WIsometry3, etc.):

benchmarknalgebra_f32x4ultraviolet_f32x4nalgebraultravioletglamvekcgmatheuclid
euler 2d x100002.992 us3.014 us9.028 us5.28 us5.166 us5.258 us5.259 us8.631 us
euler 3d x100004.546 us4.587 us17.84 us18.34 us6.311 us17.57 us18.04 us17.97 us
isometry transform point2 x18.1024 ns8.6412 ns23.4637 ns54.5840 nsN/AN/AN/AN/A
isometry transform point2 x1002.801 us2.837 us3.109 us4.918 usN/AN/AN/AN/A
isometry transform point3 x116.1052 ns20.6723 ns61.8824 ns330.1143 nsN/AN/AN/AN/A
isometry transform point3 x1004.515 us4.794 us6.546 us18.19 usN/AN/AN/AN/A
isometry2 inverse7.3004 ns7.7692 ns24.8090 ns59.3838 nsN/AN/AN/AN/A
isometry2 mul isometry212.3278 ns12.7816 ns34.5714 ns110.7837 nsN/AN/AN/AN/A
isometry3 inverse17.1929 ns21.4199 ns63.9200 ns212.2193 nsN/AN/AN/AN/A
isometry3 mul isometry327.3814 ns40.3955 ns100.4628 ns447.4780 nsN/AN/AN/AN/A
matrix2 determinantN/AN/A16.0198 nsN/A11.2070 ns15.8621 ns16.0906 nsN/A
matrix2 inverseN/AN/A26.5165 nsN/A18.9069 nsN/A26.2478 nsN/A
matrix2 mul matrix212.5352 ns10.7532 ns25.4533 ns25.7148 ns16.3060 ns301.3775 ns25.6988 nsN/A
matrix2 mul vector2 x17.1695 ns8.0907 ns23.0932 ns24.4506 ns16.4141 ns93.3601 ns25.0774 nsN/A
matrix2 mul vector2 x1002.684 us2.698 us3.137 us3.174 us2.828 us9.558 us3.122 usN/A
matrix2 transpose6.4984 nsN/A11.0205 nsN/A10.9890 ns10.8180 ns10.3812 nsN/A
matrix3 determinantN/AN/A33.1252 nsN/A22.2592 ns31.8128 ns31.7086 nsN/A
matrix3 inverseN/A26.7493 ns92.8270 ns284.1032 ns93.5328 nsN/A82.3820 nsN/A
matrix3 mul matrix30.02883 us0.02638 us0.09077 us0.08728 us0.0474 us1.005 us0.08579 usN/A
matrix3 mul vector3 x113.7548 ns13.6309 ns46.4205 ns46.5016 ns21.5466 ns317.8168 ns47.3542 nsN/A
matrix3 mul vector3 x1004.835 us4.917 us6.138 us6.261 us6.633 us32.93 us6.297 usN/A
matrix3 transpose15.7501 ns15.4925 ns52.7300 ns53.0845 ns24.1975 ns55.4793 ns52.0874 nsN/A
matrix4 determinantN/AN/A690.7027 nsN/A85.5778 ns176.1770 ns110.6990 ns174.6754 ns
matrix4 inverseN/A0.08342 us0.5282 us0.3953 us0.257 us3.792 us0.5096 us0.651 us
matrix4 mul matrix40.06897 us0.07068 us0.1285 us0.1493 us0.07653 us2.024 us0.1185 us0.1152 us
matrix4 mul vector4 x121.1243 ns20.6690 ns31.5352 ns30.3570 ns24.7876 ns512.0570 ns30.9776 nsN/A
matrix4 mul vector4 x1007.595 us7.662 us8.472 us8.379 us7.823 us51.42 us8.366 usN/A
matrix4 transpose26.2948 ns27.1962 ns103.0829 ns101.3104 ns29.2236 ns105.6003 ns98.1151 nsN/A
quaternion conjugate6.6130 ns6.6179 ns24.5468 ns25.5657 ns11.1447 ns24.2229 ns25.7246 ns24.9413 ns
quaternion mul quaternion14.6380 ns16.6299 ns39.3908 ns274.4676 ns27.3533 ns62.0697 ns35.0299 ns59.5286 ns
quaternion mul vector314.7579 ns18.5064 ns52.7303 ns170.8328 ns44.1336 ns81.1243 ns52.5011 ns54.1958 ns
transform point2 x1N/AN/A36.6350 nsN/A25.6516 ns337.3442 ns30.5781 ns28.5532 ns
transform point2 x100N/AN/A5.475 usN/A5.732 us33.81 us5.118 us4.003 us
transform point3 x1N/AN/A66.3694 nsN/A24.0601 ns517.9224 ns67.8205 ns68.0836 ns
transform point3 x100N/AN/A9.403 usN/A7.824 us52.5 us9.559 us10.07 us
transform vector2 x1N/AN/A32.5667 nsN/A21.3971 ns330.5713 nsN/A23.5693 ns
transform vector2 x100N/AN/A5.392 usN/A5.579 us33.45 usN/A3.909 us
transform vector3 x1N/AN/A55.3600 nsN/A24.3628 ns515.1730 ns59.1129 ns47.9383 ns
transform vector3 x100N/AN/A9.185 usN/A7.958 us52.39 us8.992 us8.665 us
transform2 inverseN/AN/A87.2060 nsN/AN/AN/AN/A45.0272 ns
transform2 mul transform2N/AN/A87.5702 nsN/AN/AN/AN/A40.1995 ns
transform3 inverseN/AN/A542.8401 nsN/AN/AN/AN/A748.2648 ns
transform3 mul transform3dN/AN/A118.0659 nsN/AN/AN/AN/A115.3294 ns
vector3 cross6.4232 ns6.2395 ns25.0006 ns36.5961 ns15.0303 ns28.6082 ns28.6343 ns29.5467 ns
vector3 dot6.6820 ns6.3391 ns25.5593 ns26.1824 ns13.6831 ns25.4740 ns25.9099 ns26.1271 ns
vector3 length5.3741 ns5.2332 ns20.5369 ns34.2988 ns20.5652 ns20.6259 ns20.9281 ns20.6052 ns
vector3 normalize15.5892 ns15.6585 ns59.1804 ns60.9510 ns35.7763 ns61.3666 ns36.7304 ns61.3199 ns

Please see the last section of this post for a more comprehensive benchmark (including the use of f32x8 and f32x16 with nalgebra) and details about the benchmark conditions.

What is AoSoA with explicit SIMD?

The data layout I call here AoSoA with explicit SIMD (or SIMD AoSoA for short) is a straightforward variant of the AoSoA (Array-of-Structures-of-Arrays) layout which is itself a combination of two more common layouts: SoA (Structure-of-Arrays) and AoS (Array-of-Structures). So let's see what is the difference of all those layouts with the simple example using 3D vectors (vectors with three components x, y, z): given two arrays of 1024 3D vectors, we want to compute the sum of each pairs of vectors with the same index.

Note: The explanations here are merely a superficial introduction of the AoS vs SoA vs AoSoA concepts. I just want to explain some of the differences and some of their advantages/disadvantages. I won't give any detailed analysis of the generated assembly code after compiling the examples provided. The benchmarks at the end of this post will show the performance difference between AoS and SIMD AoSoA. You may be interested by that article too.

Note 2: For iterating through the arrays, I'll be using explicit indices for i in 0..1024 instead of iterators. This is on purpose to make the number of iterations explicit and to make the code easier to understand for readers that are not used to Rust iterators.

Array-of-Structures (AoS)

The Array-of-Structures layout is the most common and intuitive layout. You define a structure as the aggregation of all its fields. And multiple structures of the same type are simply stored one after the other inside of an array. Here is what our 3D vector would look like:

struct Vector3 {
x: f32,
y: f32,
z: f32
}
/// An array containing 1024 vectors.
type SetOfVector3 = [Vector3; 1024];
fn add_arrays_of_vectors(a: &mut SetOfVector3, b: &SetOfVector3) {
for i in 0..1024 {
va.x += vb.x;
va.y += vb.y;
va.z += vb.z;
}
}

Here, we need to iterate through each pair of vectors, one from each set, and execute our sum. This is arguably the most intuitive way of doing this, but not necessarily the most efficient. All Rust linear algebra crates from this benchmark can work with this layout.

Pros of AoS

  • It is easy to read/write/modify each vector individually.
  • AoS are generally easier to reason with when designing algorithms.

Cons of AoS

  • Not as efficient as other layouts when working on a large number of elements of the same type.

Structure-of-Arrays (SoA)

The Structure-of-Arrays layout is less intuitive to work with because it will store each field of a struct into its own array. Thus, our set of vector set would look like that:

struct SetOfVector3 {
x: [f32; 1024],
y: [f32; 1024],
z: [f32; 1024],
}

These is no explicit Vector3 structure here because they are all packed into the array. Accessing the components of the i-th vector of the set means we access set.x[i], set.y[i] and set.z[i]. With this structure, our vector sum becomes the following:

fn add_arrays_of_vectors(a: &mut SetOfVector3, b: &SetOfVector3) {
for i in 0..1024 {
a.x[i] += b.x[i];
}
for i in 0..1024 {
a.y[i] += b.y[i];
}
for i in 0..1024 {
a.z[i] += b.z[i];
}
}

There is more code here than in our AoS example because we need to iterate on each components individually. However this will be much more efficient than our AoS approach because this is extremely vectorization-friendly: the compiler will be able to group consecutive vector entries for each component, and use SIMD instructions to perform the 2, 4, 8, or even 16 additions at a time instead of a single one.

Pros of SoA

  • Extremely fast.

Cons of SoA

  • Algorithms need to be explicitly designed with SoA in mind to be efficient.
  • Manipulating vectors individually is more difficult and can be less efficient than AoS.
  • It is more difficult to think in terms of SoA than in terms of AoS.

Array-of-Structures-of-Arrays (AoSoA)

Now let's combine both SoA and AoS to obtain: Array-of-Structures-of-Arrays (AoSoA). The idea is to first define a wide 3D vector, i.e., we pack several vectors into a single struct:

struct WideVector3 {
x: [f32; 4],
y: [f32; 4],
z: [f32; 4],
}

The term wide 3D vector is inspired from the terminology used in the ultraviolet crate's documentation. Here, our WideVector3 actually represents 4 vectors laid out in a SoA fashion. If we want our set of vector to contain more than just 4 vectors, we can define an array of WideVector3 (so we end up with an array of structure of array) with only 1024 / 4 elements because each element already represent 4 vectors:

type SetOfWideVector3 = [WideVector3; 1024 / 4];

Our vector sum then becomes quite similar to the AoS approach, except that we have to add each 4 components in the inner loop in an SoA fashion:

fn add_arrays_of_vectors(a: &mut SetOfWideVector3, b: &SetOfWideVector3) {
for i in 0..1024 / 4 {
// NOTE: each of those groups of four sums can be executed as
// a single SIMD instruction.
va[i].x[0] += vb[i].x[0];
va[i].x[1] += vb[i].x[1];
va[i].x[2] += vb[i].x[2];
va[i].x[3] += vb[i].x[3];
va[i].y[0] += vb[i].y[0];
va[i].y[1] += vb[i].y[1];
va[i].y[2] += vb[i].y[2];
va[i].y[3] += vb[i].y[3];
va[i].z[0] += vb[i].z[0];
va[i].z[1] += vb[i].z[1];
va[i].z[2] += vb[i].z[2];
va[i].z[3] += vb[i].z[3];
}
}

So with this data layout, we still achieve great performance because components are grouped by 4, and thus their sum can easily be auto-vectorized to a single SIMD instruction. To manipulate an individual vector we first have to access the corresponding wide vector, and then its components on the x,y,z arrays.

Pros of AoSoA

  • Great performance compared to AoS.
  • Can even beat the performance of SoA considering AoSoA can be more cache-friendly in modern architecture.
  • Designing algorithms around AoSoA is somewhat easier than with plain SoA.
  • Using AoSoA, most linear-algebra operations can be easily vectorized as long as they don't rely too much on branching.

Cons of AoSoA

  • We still need to design our algorithms carefully to use AoSoA efficiently.

Array-of-Structures-of-arrays with explicit SIMD (SIMD AoSoA)

Finally, let's talk about what I call here SIMD AoSoA. This layout is actually exactly the same as AoSoA presented before, except we that we use explicit SIMD primitives like, e.g., the types from the packed_simd crate instead of plain arrays. For example we can use 4-lanes SIMD floats packed_simd::f32x4 instead of [f32; 4]:

struct WideVector3 {
x: packed_simd::f32x4,
y: packed_simd::f32x4,
z: packed_simd::f32x4,
}
type SetOfWideVector3 = [WideVector3; 1024 / 4];

Then our sum will be exactly the same as in the AoSoA approach, except that we don't have to deal with each 4 lanes explicitly because packed_simd::f32x4 implements the Add trait:

fn add_arrays_of_vectors(a: &mut SetOfWideVector3, b: &SetOfWideVector3) {
for i in 0..1024 / 4 {
// NOTE: each 4-lanes sum will be executed as a single SIMD operation.
va[i].x += vb[i].x;
va[i].y += vb[i].y;
va[i].z += vb[i].z;
}
}

If we wanted to use 16-lanes vectors here (based on AVX instructions), we could simply do:

struct WideVector3 {
x: packed_simd::f32x16,
y: packed_simd::f32x16,
z: packed_simd::f32x16,
}
type SetOfWideVector3 = [WideVector3; 1024 / 16];

Another benefit of using SIMD types explicitly here is that we no longer rely on the compiler's auto-vectorization. So we can get SIMD instructions even in debug mode, and in some situations where the compiler would fail to auto-vectorize properly.

Using SIMD AoSoA for linear-algebra in Rust: ultraviolet and nalgebra

As far as I know, the first crate that implemented this concept for (gamedev) linear algebra in Rust is ultraviolet. At the end of this month (March 2020), you will also be able to use this approach with nalgebra, in its upcoming version 0.21.0.

With ultraviolet, you have the choice between two families of types: regular types (Vec2, Mat3, Isometry3, etc.) and wide types (Wec2, Wat3, WIsometry3). The regular types are designed to be usable with the common AoS layout, with vector/matrix components set to f32. The wide types on the other hand are designed to be used with the SIMD AoSoA layout: each wide type packs the corresponding four non-wide types into a single structure (e.g. one Wec3 represents four Vec3), and the wide vector/wide matrix components are of type f32x4 from the wide crate. As of today, ultraviolet is limited to 32-bits floats, and 4-lanes 32-bits SIMD floats. You can't use it with SIMD integers, nor with 64-bits floats or 32-bits SIMD floats with a number of lanes different from 4.

With nalgebra, all types of vectors/matrices/tansformations are generic wrt. their component type. Therefore, for a AoS layout, you can use, e.g., the Vector2<f32> type. And if you want to leverage SIMD AoSoA, you can use Vector2<f32x4> instead and that will give you four 2D vectors for the price of one. Note that the f32x4 type here comes from the new simba crate and is a newtype for the f32x4 from packed_simd (the upcoming standard safe SIMD implementation in Rust). A newtype was required because of the orphan rules, and the need to implement some traits from the num crate for the SIMD types. Simba is not limited to f32x4 though as it defines a newtype for every single numeric type of packed_simd. Therefore nalgebra will also support all the integer and float SIMD types, with 2, 4, 8, or 16 lanes. You can for example write and use Isometry2<simba::f32x16> as a SoA set of 16 Isometry2<f32>. Finally nalgebra has support for SIMD on all the platforms supported by packed_simd itself.

Here are some (subjective) pros and cons for ultraviolet and nalgebra:

  • Pros of ultraviolet: no generics so it is simple to use, and efficient even without compiler optimizations enabled. Works on stable Rust.
  • Pros of nalgebra: generics allow the use of all SIMD types. More feature-complete and tested than ultraviolet. Based on packed_simd with great platform support but we could make it work with the wide crate too.
  • Cons of ultraviolet: cannot use 64-bits floats nor SIMD types other than 32-bit 4-lanes floats. It has a more limited feature set (but that may be enough for gamedev).
  • Cons of nalgebra: generics make it harder to use, and make the doc harder to browse. Also nalgebra is much less efficient without compiler optimizations enabled. Because SIMD AoSoA is based on packed_simd, the use of SIMD types will only work with the nightly compiler.

Benchmarks of Rust linear-algebra crates

Now is the time to see if SIMD AoSoA is useful at all. The benchmarks I run here are a modified version of the mathbench-rs benchmark suite you may have already head of. For example it was used when glam and ultraviolet were published.

If you want to run the benchmarks on your machine, you can clone my fork and either run cargo bench or RUSTFLAGS='-C target-cpu=native' cargo bench.

The modifications I made to the benchmark were the following:

  • I added codegen-units = 1 to the [profile.bench] section. This allows to get as much optimization as we can get from the compiler (this is typically what you want before releasing a binary). It turns out that this can affect the performance of nalgebra which benefits significantly from compiler optimizations.
  • I added support for ultraviolet regular types (identified by the column ultraviolet in the benchmarks) as well as its wide types (identified by the column ultraviolet_f32x4).
  • Because ultraviolet use the concepts of rotors instead of quaternions for its rotation, I used its Rotor/WRotor types for all its quaternion benchmark rows.
  • I added support for using nalgebra with f32 SIMD with 4, 8, and 16 lanes identified by nalgebra_f32x4, nalgebra_f32x8 and nalgebra_f32x16.
  • I added benchmarks for 3D isometries of both ultraviolet and nalgebra.
  • I modified the benchmark of unary and binary operations so they measure the time to process 16 elements. For example when benchmarking 2D matrix transposition, we are actually seeing the computation time for transposing 16 matrices (instead of just one). This allows us to measure the gain of SIMD AoSoA without penalizing neither non-AoSoA crates nor AoSoA crates.

This last modification follows the same principle as the one introduced for the benchmarks presented by ultraviolet in their README.

Benchmark with -C target-cpu=native

In this benchmark I am compiling with the RUSTFLAGS='-C target-cpu=native option so that the compiler emits AVX instructions for f32x8 and f32x16. It appears that some crates (namely glam and vek) are not as efficient as they could be when this option is enabled so you will find a second benchmark without this option enabled in the next section.

Benchmark options:

  • command line: RUSTFLAGS='-C target-cpu=native' cargo bench
  • profiling option in Cargo.toml: codegen-units = 1
  • CPU: AMD Ryzen 9 3900X 12-Core Processor
benchmarknalgebra_f32x16nalgebra_f32x8nalgebra_f32x4ultraviolet_f32x4nalgebraultravioletglamvekcgmatheuclid
euler 2d x100002.076 us2.224 us3.05 us3.05 us9.674 us12.0 us12.42 us11.85 us11.92 us9.585 us
euler 3d x100003.014 us2.809 us4.791 us4.639 us18.18 us17.58 us6.27 us18.12 us18.05 us17.91 us
isometry transform point2 x15.7179 ns5.6563 ns7.8197 ns8.1874 ns22.8197 ns24.4878 nsN/AN/AN/AN/A
isometry transform point2 x1002.582 us2.587 us2.739 us2.787 us3.007 us3.169 usN/AN/AN/AN/A
isometry transform point3 x110.5417 ns10.1237 ns15.4410 ns19.5679 ns60.0877 ns64.7755 nsN/AN/AN/AN/A
isometry transform point3 x1004.704 us4.377 us4.64 us4.941 us6.567 us6.68 usN/AN/AN/AN/A
isometry2 inverse6.0317 ns6.4494 ns6.9687 ns7.4126 ns24.2412 ns24.8876 nsN/AN/AN/AN/A
isometry2 mul isometry28.1413 ns9.2351 ns10.1867 ns10.2618 ns34.5250 ns33.1397 nsN/AN/AN/AN/A
isometry3 inverse12.3065 ns11.2699 ns16.2452 ns21.6418 ns62.2947 ns76.3882 nsN/AN/AN/AN/A
isometry3 mul isometry329.0822 ns16.5287 ns26.0439 ns31.5684 ns97.8058 ns109.7477 nsN/AN/AN/AN/A
matrix2 determinantN/AN/AN/AN/A15.4800 nsN/A9.9942 ns15.5422 ns15.6876 nsN/A
matrix2 inverseN/AN/AN/AN/A25.0731 nsN/A15.7486 nsN/A25.8362 nsN/A
matrix2 mul matrix210.6500 ns9.0379 ns10.3309 ns10.4283 ns24.7601 ns24.9938 ns16.1426 ns291.7833 ns25.0133 nsN/A
matrix2 mul vector2 x15.7680 ns5.2159 ns6.7758 ns6.9242 ns22.9934 ns24.4428 ns15.5422 ns85.4171 ns23.7398 nsN/A
matrix2 mul vector2 x1002.641 us2.663 us2.647 us2.617 us3.14 us3.182 us2.953 us8.998 us3.232 usN/A
matrix2 transpose5.8718 ns6.8760 ns6.6906 nsN/A10.6439 nsN/A10.9130 ns10.6631 ns10.5080 nsN/A
matrix3 determinantN/AN/AN/AN/A31.9111 nsN/A19.7343 ns31.3051 ns32.2809 nsN/A
matrix3 inverseN/AN/AN/A25.6965 ns83.4324 ns82.1388 ns120.2562 nsN/A79.4379 nsN/A
matrix3 mul matrix370.6877 ns19.6932 ns27.7722 ns26.8799 ns83.2946 ns83.9021 ns44.3916 ns944.3034 ns80.8441 nsN/A
matrix3 mul vector3 x110.6031 ns10.7913 ns13.6117 ns13.7169 ns46.1231 ns47.5651 ns20.6856 ns315.4715 ns45.3403 nsN/A
matrix3 mul vector3 x1004.7 us4.807 us4.845 us4.738 us6.163 us6.142 us6.701 us32.47 us6.091 usN/A
matrix3 transpose14.3912 ns12.8725 ns15.0385 ns15.9171 ns19.9624 ns19.7458 ns22.7494 ns19.5610 ns20.0838 nsN/A
matrix4 determinantN/AN/AN/AN/A1.489 usN/A0.083 us0.1515 us0.1021 us0.1565 us
matrix4 inverseN/AN/AN/A0.09051 us0.5339 us0.3744 us0.2844 us4.203 us0.4745 us0.5851 us
matrix4 mul matrix40.2354 us0.06835 us0.07657 us0.07581 us0.1247 us0.135 us0.07736 us1.916 us0.1146 us0.1089 us
matrix4 mul vector4 x139.0102 ns15.4389 ns20.7324 ns22.1028 ns30.6785 ns30.8711 ns25.6006 ns487.1873 ns30.3248 nsN/A
matrix4 mul vector4 x1008.339 us7.419 us7.437 us7.428 us8.14 us8.209 us8.078 us47.98 us8.332 usN/A
matrix4 transpose95.1337 ns23.2408 ns25.8665 ns25.7914 ns97.9916 ns101.3651 ns30.0239 ns103.5180 ns102.0192 nsN/A
quaternion conjugate6.0223 ns6.3760 ns7.5628 ns6.9775 ns23.2724 ns23.0902 ns10.3465 ns23.2522 ns23.2696 ns23.0664 ns
quaternion mul quaternion9.3504 ns9.4567 ns12.3669 ns12.6805 ns30.0095 ns38.4430 ns25.8066 ns42.2113 ns42.6575 ns40.9668 ns
quaternion mul vector38.5309 ns7.7526 ns13.3755 ns16.7836 ns49.3278 ns56.7010 ns41.2790 ns71.9719 ns48.9911 ns48.2763 ns
transform point2 x1N/AN/AN/AN/A35.6464 nsN/A23.5642 ns325.1071 ns29.7918 ns28.4816 ns
transform point2 x100N/AN/AN/AN/A5.386 usN/A5.612 us33.66 us5.255 us4.066 us
transform point3 x1N/AN/AN/AN/A68.6997 nsN/A23.9545 ns504.0400 ns67.0988 ns68.1284 ns
transform point3 x100N/AN/AN/AN/A9.393 usN/A7.898 us49.31 us9.282 us10.0 us
transform vector2 x1N/AN/AN/AN/A32.5603 nsN/A24.0791 ns327.3927 nsN/A24.8445 ns
transform vector2 x100N/AN/AN/AN/A5.385 usN/A5.649 us33.51 usN/A3.891 us
transform vector3 x1N/AN/AN/AN/A52.9997 nsN/A25.2423 ns487.6624 ns55.1891 ns46.3865 ns
transform vector3 x100N/AN/AN/AN/A9.083 usN/A7.963 us49.47 us8.994 us8.771 us
transform2 inverseN/AN/AN/AN/A83.1814 nsN/AN/AN/AN/A38.3216 ns
transform2 mul transform2N/AN/AN/AN/A79.8949 nsN/AN/AN/AN/A53.0397 ns
transform3 inverseN/AN/AN/AN/A509.3638 nsN/AN/AN/AN/A573.8688 ns
transform3 mul transform3dN/AN/AN/AN/A116.3846 nsN/AN/AN/AN/A112.3300 ns
vector3 cross4.4663 ns4.5852 ns6.3479 ns6.1689 ns26.0655 ns24.9554 ns15.6013 ns27.9000 ns27.6733 ns28.1612 ns
vector3 dot4.4944 ns4.5232 ns6.4285 ns6.2531 ns25.3941 ns24.9869 ns13.6267 ns24.5923 ns25.9651 ns25.8382 ns
vector3 length3.5110 ns3.5936 ns5.4675 ns5.4127 ns20.7343 ns21.0470 ns20.4590 ns21.0161 ns20.7430 ns20.5710 ns
vector3 normalize7.8013 ns8.3383 ns15.4976 ns15.3918 ns61.5877 ns61.3314 ns35.6648 ns59.7074 ns35.7934 ns61.2340 ns

To get a better comparative view of the performance of AoSoA, here are the benchmarks restricted to ultraviolet's wide types (Wec2, Wat2, WIsometry2, etc.) and nalgebra types paramaterized by SIMD types (Vector2<f32x4>, Matrix2<f32x8>, Isometry2<f32x16>, etc.):

benchmarkultraviolet_f32x4nalgebra_f32x4nalgebra_f32x8nalgebra_f32x16
euler 2d x100003.05 us3.05 us2.224 us2.076 us
euler 3d x100004.639 us4.791 us2.809 us3.014 us
isometry transform point2 x18.1874 ns7.8197 ns5.6563 ns5.7179 ns
isometry transform point2 x1002.787 us2.739 us2.587 us2.582 us
isometry transform point3 x119.5679 ns15.4410 ns10.1237 ns10.5417 ns
isometry transform point3 x1004.941 us4.64 us4.377 us4.704 us
isometry2 inverse7.4126 ns6.9687 ns6.4494 ns6.0317 ns
isometry2 mul isometry210.2618 ns10.1867 ns9.2351 ns8.1413 ns
isometry3 inverse21.6418 ns16.2452 ns11.2699 ns12.3065 ns
isometry3 mul isometry331.5684 ns26.0439 ns16.5287 ns29.0822 ns
matrix2 mul matrix210.4283 ns10.3309 ns9.0379 ns10.6500 ns
matrix2 mul vector2 x16.9242 ns6.7758 ns5.2159 ns5.7680 ns
matrix2 mul vector2 x1002.617 us2.647 us2.663 us2.641 us
matrix2 transposeN/A6.6906 ns6.8760 ns5.8718 ns
matrix3 inverse25.6965 nsN/AN/AN/A
matrix3 mul matrix326.8799 ns27.7722 ns19.6932 ns70.6877 ns
matrix3 mul vector3 x113.7169 ns13.6117 ns10.7913 ns10.6031 ns
matrix3 mul vector3 x1004.738 us4.845 us4.807 us4.7 us
matrix3 transpose15.9171 ns15.0385 ns12.8725 ns14.3912 ns
matrix4 inverse90.5105 nsN/AN/AN/A
matrix4 mul matrix475.8094 ns76.5675 ns68.3461 ns235.3853 ns
matrix4 mul vector4 x122.1028 ns20.7324 ns15.4389 ns39.0102 ns
matrix4 mul vector4 x1007.428 us7.437 us7.419 us8.339 us
matrix4 transpose25.7914 ns25.8665 ns23.2408 ns95.1337 ns
quaternion conjugate6.9775 ns7.5628 ns6.3760 ns6.0223 ns
quaternion mul quaternion12.6805 ns12.3669 ns9.4567 ns9.3504 ns
quaternion mul vector316.7836 ns13.3755 ns7.7526 ns8.5309 ns
vector3 cross6.1689 ns6.3479 ns4.5852 ns4.4663 ns
vector3 dot6.2531 ns6.4285 ns4.5232 ns4.4944 ns
vector3 length5.4127 ns5.4675 ns3.5936 ns3.5110 ns
vector3 normalize15.3918 ns15.4976 ns8.3383 ns7.8013 ns

As we can see, both ultraviolet_f32x4 and nalgebra_f32x4 yield roughly the same computation times. The most efficient option for my processor is nalgebra with f32x8: it often reaches the best performance, and f32x16 comes with little performance gain, and even significant regressions for some tests.

Benchmark without -C target-cpu=native

It appears some rust linear algebra crates do not perform as well as they could if -C target-cpu=native is passed to the compiler. This is for example the case of glam and vek. However, it appears that not using -C target-cpu=native makes some methods of non-wide types of ultraviolet much less efficient.

Because we don't use -C target-cpu=native here, both f32x8 and f32x16 won't emit 8- and 16-lanes AVX instructions, making them only as efficient as f32x4.

Here is the same benchmark with:

  • command line: cargo bench
  • options in Cargo.toml: codegen-units = 1.
  • CPU: AMD Ryzen 9 3900X 12-Core Processor
benchmarknalgebra_f32x16nalgebra_f32x8nalgebra_f32x4ultraviolet_f32x4nalgebraultravioletglamvekcgmatheuclid
euler 2d x100003.057 us3.069 us2.992 us3.014 us9.028 us5.28 us5.166 us5.258 us5.259 us8.631 us
euler 3d x100004.585 us4.587 us4.546 us4.587 us17.84 us18.34 us6.311 us17.57 us18.04 us17.97 us
isometry transform point2 x17.7226 ns6.7828 ns8.1024 ns8.6412 ns23.4637 ns54.5840 nsN/AN/AN/AN/A
isometry transform point2 x1002.625 us2.701 us2.801 us2.837 us3.109 us4.918 usN/AN/AN/AN/A
isometry transform point3 x121.0932 ns16.1782 ns16.1052 ns20.6723 ns61.8824 ns330.1143 nsN/AN/AN/AN/A
isometry transform point3 x1006.466 us4.598 us4.515 us4.794 us6.546 us18.19 usN/AN/AN/AN/A
isometry2 inverse7.4879 ns7.0432 ns7.3004 ns7.7692 ns24.8090 ns59.3838 nsN/AN/AN/AN/A
isometry2 mul isometry213.3355 ns10.5486 ns12.3278 ns12.7816 ns34.5714 ns110.7837 nsN/AN/AN/AN/A
isometry3 inverse29.6336 ns19.3869 ns17.1929 ns21.4199 ns63.9200 ns212.2193 nsN/AN/AN/AN/A
isometry3 mul isometry351.0123 ns39.2781 ns28.5785 ns37.5971 ns98.0279 ns459.4930 nsN/AN/AN/AN/A
matrix2 determinantN/AN/AN/AN/A16.0198 nsN/A11.2070 ns15.8621 ns16.0906 nsN/A
matrix2 inverseN/AN/AN/AN/A26.5165 nsN/A18.9069 nsN/A26.2478 nsN/A
matrix2 mul matrix260.2594 ns10.6371 ns12.5352 ns10.7532 ns25.4533 ns25.7148 ns16.3060 ns301.3775 ns25.6988 nsN/A
matrix2 mul vector2 x129.6210 ns8.0023 ns7.1695 ns8.0907 ns23.0932 ns24.4506 ns16.4141 ns93.3601 ns25.0774 nsN/A
matrix2 mul vector2 x1003.5 us2.591 us2.684 us2.698 us3.137 us3.174 us2.828 us9.558 us3.122 usN/A
matrix2 transpose6.8152 ns7.4080 ns6.4984 nsN/A11.0205 nsN/A10.9890 ns10.8180 ns10.3812 nsN/A
matrix3 determinantN/AN/AN/AN/A33.1252 nsN/A22.2592 ns31.8128 ns31.7086 nsN/A
matrix3 inverseN/AN/AN/A26.7493 ns92.8270 ns284.1032 ns93.5328 nsN/A82.3820 nsN/A
matrix3 mul matrix30.1494 us0.03354 us0.02883 us0.02638 us0.09077 us0.08728 us0.0474 us1.005 us0.08579 usN/A
matrix3 mul vector3 x152.3315 ns14.0998 ns13.7548 ns13.6309 ns46.4205 ns46.5016 ns21.5466 ns317.8168 ns47.3542 nsN/A
matrix3 mul vector3 x10010.68 us5.033 us4.835 us4.917 us6.138 us6.261 us6.633 us32.93 us6.297 usN/A
matrix3 transpose20.9653 ns15.7305 ns15.7501 ns15.4925 ns52.7300 ns53.0845 ns24.1975 ns55.4793 ns52.0874 nsN/A
matrix4 determinantN/AN/AN/AN/A690.7027 nsN/A85.5778 ns176.1770 ns110.6990 ns174.6754 ns
matrix4 inverseN/AN/AN/A0.08342 us0.5282 us0.3953 us0.257 us3.792 us0.5096 us0.651 us
matrix4 mul matrix40.4467 us0.08671 us0.06897 us0.07068 us0.1285 us0.1493 us0.07653 us2.024 us0.1185 us0.1152 us
matrix4 mul vector4 x184.1744 ns20.9291 ns21.1243 ns20.6690 ns31.5352 ns30.3570 ns24.7876 ns512.0570 ns30.9776 nsN/A
matrix4 mul vector4 x10010.32 us7.592 us7.595 us7.662 us8.472 us8.379 us7.823 us51.42 us8.366 usN/A
matrix4 transpose103.5080 ns32.5330 ns26.2948 ns27.1962 ns103.0829 ns101.3104 ns29.2236 ns105.6003 ns98.1151 nsN/A
quaternion conjugate6.7367 ns7.6066 ns6.6130 ns6.6179 ns24.5468 ns25.5657 ns11.1447 ns24.2229 ns25.7246 ns24.9413 ns
quaternion mul quaternion20.0365 ns16.6831 ns14.6380 ns16.6299 ns39.3908 ns274.4676 ns27.3533 ns62.0697 ns35.0299 ns59.5286 ns
quaternion mul vector319.3629 ns14.7722 ns14.7579 ns18.5064 ns52.7303 ns170.8328 ns44.1336 ns81.1243 ns52.5011 ns54.1958 ns
transform point2 x1N/AN/AN/AN/A36.6350 nsN/A25.6516 ns337.3442 ns30.5781 ns28.5532 ns
transform point2 x100N/AN/AN/AN/A5.475 usN/A5.732 us33.81 us5.118 us4.003 us
transform point3 x1N/AN/AN/AN/A66.3694 nsN/A24.0601 ns517.9224 ns67.8205 ns68.0836 ns
transform point3 x100N/AN/AN/AN/A9.403 usN/A7.824 us52.5 us9.559 us10.07 us
transform vector2 x1N/AN/AN/AN/A32.5667 nsN/A21.3971 ns330.5713 nsN/A23.5693 ns
transform vector2 x100N/AN/AN/AN/A5.392 usN/A5.579 us33.45 usN/A3.909 us
transform vector3 x1N/AN/AN/AN/A55.3600 nsN/A24.3628 ns515.1730 ns59.1129 ns47.9383 ns
transform vector3 x100N/AN/AN/AN/A9.185 usN/A7.958 us52.39 us8.992 us8.665 us
transform2 inverseN/AN/AN/AN/A87.2060 nsN/AN/AN/AN/A45.0272 ns
transform2 mul transform2N/AN/AN/AN/A87.5702 nsN/AN/AN/AN/A40.1995 ns
transform3 inverseN/AN/AN/AN/A542.8401 nsN/AN/AN/AN/A748.2648 ns
transform3 mul transform3dN/AN/AN/AN/A118.0659 nsN/AN/AN/AN/A115.3294 ns
vector3 cross5.9052 ns5.7307 ns6.4232 ns6.2395 ns25.0006 ns36.5961 ns15.0303 ns28.6082 ns28.6343 ns29.5467 ns
vector3 dot5.7524 ns5.8113 ns6.6820 ns6.3391 ns25.5593 ns26.1824 ns13.6831 ns25.4740 ns25.9099 ns26.1271 ns
vector3 length5.3441 ns5.4417 ns5.3741 ns5.2332 ns20.5369 ns34.2988 ns20.5652 ns20.6259 ns20.9281 ns20.6052 ns
vector3 normalize15.6224 ns15.3854 ns15.5892 ns15.6585 ns59.1804 ns60.9510 ns35.7763 ns61.3666 ns36.7304 ns61.3199 ns

Conclusion

The use of SIMD AoSoA is extremely promising for doing efficiently lots of linear algebra on large arrays of entities with the same types. This will be possible in nalgebra in its next version 0.21.0 and will perform as efficiently as the current implementation in ultraviolet when using f32x4 and will also be compatible with other SIMD types as well (e.g. f32x16, f64x2, i32x4, u8x4, etc.) Those types are newtypes wrapping the types from packed_simd. Those newtypes will come from a new crate named simba which I will present in more depth in the next edition of the This month in Rustsim blog posts.

Finaly here are two elements to keep in mind:

  • Compilation flags matter a lot when measuring performance. -C target-cpu or codegen-units=1 can have significant impacts on your release performance.
  • SIMD AoSoA comes with a price: algorithms must be carefully designed to obtain as much gain as possible compared to an AoS approach.

Thank you all for reading, and for your support on patreon or GitHub sponsor!

Don't hesitate to share your thoughts on this topic, or to correct me if something seems off to you.