4 月之前 · b0f53a6eaf
--- a/core/simd/simd.odin
+++ b/core/simd/simd.odin
@@ -2512,7 +2512,7 @@ recip :: #force_inline proc "contextless" (v: $T/#simd[$LANES]$E) -> T where int
 
															 }
														
 
															 /*
														
 
															-Creates a vector where each lane contains the index of that lane.
														
 
															+Create a vector where each lane contains the index of that lane.
														
 
															 Inputs:
														
 
															 - `V`: The type of the vector to create.
														
@@ -2558,10 +2558,10 @@ indices :: #force_inline proc "contextless" ($V: typeid/#simd[$N]$E) -> V where
 
															 Reduce a vector to a scalar by adding up all the lanes in a pairwise fashion.
														
 
															 This procedure returns a scalar that is the sum of all lanes, calculated by
														
 
															-adding each even-numbered element with the following odd-numbered element. This
														
 
															-is repeated until only a single element remains. This order is supported by
														
 
															-hardware instructions for some types/architectures (e.g. i16/i32/f32/f64 on x86
														
 
															-SSE, i8/i16/i32/f32 on ARM NEON).
														
 
															+adding each even-indexed element with the following odd-indexed element to
														
 
															+produce N/2 values. This is repeated until only a single element remains. This
														
 
															+order is supported by hardware instructions for some types/architectures (e.g.
														
 
															+i16/i32/f32/f64 on x86 SSE, i8/i16/i32/f32 on ARM NEON).
														
 
															 The order of the sum may be important for accounting for precision errors in
														
 
															 floating-point computation, as floating-point addition is not associative, that
														
@@ -2657,13 +2657,14 @@ reduce_add_pairs :: #force_inline proc "contextless" (v: #simd[$N]$E) -> E
 
															 }
														
 
															 /*
														
 
															-Reduce a vector to a scalar by adding up all the lanes in a binary fashion.
														
 
															+Reduce a vector to a scalar by adding up all the lanes in a bisecting fashion.
														
 
															 This procedure returns a scalar that is the sum of all lanes, calculated by
														
 
															-splitting the vector in two parts and adding the two halves together
														
 
															-element-wise. This is repeated until only a single element remains. This order
														
 
															-will typically be faster to compute than the ordered sum for floats, as it can
														
 
															-be better parallelized.
														
 
															+bisecting the vector into two parts, where the first contains lanes [0, N/2)
														
 
															+and the second contains lanes [N/2, N), and adding the two halves element-wise
														
 
															+to produce N/2 values. This is repeated until only a single element remains.
														
 
															+This order may be faster to compute than the ordered sum for floats, as it can
														
 
															+often be better parallelized.
														
 
															 The order of the sum may be important for accounting for precision errors in
														
 
															 floating-point computation, as floating-point addition is not associative, that
														
@@ -2701,7 +2702,7 @@ Graphical representation of the operation for N=4:
 
															 	result: | y0  |
														
 
															 	        +-----+
														
 
															 */
														
 
															-reduce_add_split :: #force_inline proc "contextless" (v: #simd[$N]$E) -> E
														
 
															+reduce_add_bisect :: #force_inline proc "contextless" (v: #simd[$N]$E) -> E
														
 
															 	where intrinsics.type_is_numeric(E) {
														
 
															 	when N == 64 { v64 := v }
														
 
															 	when N == 32 { v32 := v }
														
@@ -2763,10 +2764,12 @@ reduce_add_split :: #force_inline proc "contextless" (v: #simd[$N]$E) -> E
 
															 Reduce a vector to a scalar by multiplying all the lanes in a pairwise fashion.
														
 
															 This procedure returns a scalar that is the product of all lanes, calculated by
														
 
															-multiplying each even-numbered element with the following odd-numbered element.
														
 
															-This is repeated until only a single element remains. This order may be faster
														
 
															-to compute than the ordered product for floats, as it can be better
														
 
															-parallelized.
														
 
															+bisecting the vector into two parts, where the first contains lanes [0, N/2)
														
 
															+and the second contains lanes [N/2, N), and multiplying the two halves together
														
 
															+multiplying each even-indexed element with the following odd-indexed element to
														
 
															+produce N/2 values. This is repeated until only a single element remains. This
														
 
															+order may be faster to compute than the ordered product for floats, as it can
														
 
															+often be better parallelized.
														
 
															 The order of the product may be important for accounting for precision errors
														
 
															 in floating-point computation, as floating-point multiplication is not
														
@@ -2862,13 +2865,14 @@ reduce_mul_pairs :: #force_inline proc "contextless" (v: #simd[$N]$E) -> E
 
															 }
														
 
															 /*
														
 
															-Reduce a vector to a scalar by multiplying up all the lanes in a binary fashion.
														
 
															+Reduce a vector to a scalar by multiplying up all the lanes in a bisecting fashion.
														
 
															 This procedure returns a scalar that is the product of all lanes, calculated by
														
 
															-splitting the vector in two parts and multiplying the two halves together
														
 
															-element-wise until only a single element remains. This is repeated until only a
														
 
															+bisecting the vector into two parts, where the first contains indices [0, N/2)
														
 
															+and the second contains indices [N/2, N), and multiplying the two halves
														
 
															+together element-wise to produce N/2 values. This is repeated until only a
														
 
															 single element remains. This order may be faster to compute than the ordered
														
 
															-product for floats, as it can be better parallelized.
														
 
															+product for floats, as it can often be better parallelized.
														
 
															 The order of the product may be important for accounting for precision errors
														
 
															 in floating-point computation, as floating-point multiplication is not
														
@@ -2906,7 +2910,7 @@ Graphical representation of the operation for N=4:
 
															 	result: | y0  |
														
 
															 	        +-----+
														
 
															 */
														
 
															-reduce_mul_split :: #force_inline proc "contextless" (v: #simd[$N]$E) -> E
														
 
															+reduce_mul_bisect :: #force_inline proc "contextless" (v: #simd[$N]$E) -> E
														
 
															 	where intrinsics.type_is_numeric(E) {
														
 
															 	when N == 64 { v64 := v }
														
 
															 	when N == 32 { v32 := v }