Explorar o código

Made member functions in simd.h into global functions to work with templates.

David Piuva hai 11 meses
pai
achega
85666a9cd1

+ 14 - 10
Source/DFPSR/History.txt

@@ -30,18 +30,19 @@ Changes from version 0.1.0 to version 0.2.0 (Bug fixes)
 	* If you used a custom theme before the system was finished, you will now have to add the assignment "filter = 1" for components where rounded edges became black from adding the filter setting.
 	* If you used a custom theme before the system was finished, you will now have to add the assignment "filter = 1" for components where rounded edges became black from adding the filter setting.
 		Because one can not let default values depend on which component is used when theme classes are shared freely between components.
 		Because one can not let default values depend on which component is used when theme classes are shared freely between components.
 
 
-Changes from version 0.2.0 to version 0.3.0 (Performance and safety improvements)
+Changes from version 0.2.0 to version 0.3.0 (Performance, safety and template improvements)
 	* To make SafePointer fully typesafe so that one can't accidentally give write access to write protected data, the recursive constness had to be removed.
 	* To make SafePointer fully typesafe so that one can't accidentally give write access to write protected data, the recursive constness had to be removed.
 		Replace 'const SafePointer<' with 'SafePointer<const '
 		Replace 'const SafePointer<' with 'SafePointer<const '
 		Replace 'const dsr::SafePointer<' with 'dsr::SafePointer<const '
 		Replace 'const dsr::SafePointer<' with 'dsr::SafePointer<const '
 	* The function given to image_dangerous_replaceDestructor no longer frees the allocation itself, only external resources associated with the data.
 	* The function given to image_dangerous_replaceDestructor no longer frees the allocation itself, only external resources associated with the data.
 		Because heap_free is called automatically after the destructor in the new memory allocator.
 		Because heap_free is called automatically after the destructor in the new memory allocator.
 	* simd.h has moved into the dsr namespace because it was getting too big for the global namespace.
 	* simd.h has moved into the dsr namespace because it was getting too big for the global namespace.
-		gather has been renamed into gather_U32, gather_I32 and gather_F32.
+		* gather has been renamed into gather_U32, gather_I32 and gather_F32.
 			This avoids potential ambiguity.
 			This avoids potential ambiguity.
-		The 'a == b' and 'a != b' operators have been replaced with 'allLanesEqual(a, b)' and '!allLanesEqual(a, b)'.
+		* The 'a == b' and 'a != b' operators have been replaced with 'allLanesEqual(a, b)' and '!allLanesEqual(a, b)'.
 			This reserves the comparison operators for future use with multiple boolean results.
 			This reserves the comparison operators for future use with multiple boolean results.
-		Immediate bit shifting now use the bitShiftLeftImmediate and bitShiftRightImmediate functions with a template argument for the number of bits to shift.
+		* Immediate bit shifting now use the bitShiftLeftImmediate and bitShiftRightImmediate functions with a template argument for the number of bits to shift.
+			Because it was very easy to forget that the offset had to be constant with some SIMD instructions.
 			Replace any << or >> operator that takes a constant offset with the new functions to prevent slowing down.
 			Replace any << or >> operator that takes a constant offset with the new functions to prevent slowing down.
 				Replace a << 3 with bitShiftLeftImmediate<3>(a).
 				Replace a << 3 with bitShiftLeftImmediate<3>(a).
 				Replace a >> 5 with bitShiftRightImmediate<5>(a).
 				Replace a >> 5 with bitShiftRightImmediate<5>(a).
@@ -49,11 +50,14 @@ Changes from version 0.2.0 to version 0.3.0 (Performance and safety improvements
 				Replace a << b with a << U32x4(b), a << U16x8(b), a << U8x16(b), a << U32x8(b), a << U16x16(b), a << U8x32(b), a << U32xX(b), a << U16xX(b) or a << U8xX(b).
 				Replace a << b with a << U32x4(b), a << U16x8(b), a << U8x16(b), a << U32x8(b), a << U16x16(b), a << U8x32(b), a << U32xX(b), a << U16xX(b) or a << U8xX(b).
 				Replace a >> b with a >> U32x4(b), a >> U16x8(b), a >> U8x16(b), a >> U32x8(b), a >> U16x16(b), a >> U8x32(b), a >> U32xX(b), a >> U16xX(b) or a >> U8xX(b).
 				Replace a >> b with a >> U32x4(b), a >> U16x8(b), a >> U8x16(b), a >> U32x8(b), a >> U16x16(b), a >> U8x32(b), a >> U32xX(b), a >> U16xX(b) or a >> U8xX(b).
 				The more lanes you use, the slower it becomes when not available in SIMD hardware, so try to use at least 32-bit integers for faster fallback implementations.
 				The more lanes you use, the slower it becomes when not available in SIMD hardware, so try to use at least 32-bit integers for faster fallback implementations.
-			If you know that the offset is always evenly divisible by 8, you can use byteShiftLeft and byteShiftRight instead.
-				Replace a << 8 with byteShiftLeft(a, 8).
-				Replace a >> 16 with byteShiftRight(a, 16).
-			This makes sure that one does not accidentally use an immediate bit shift with a variable offset.
-				Using a template argument for the offset also allow detecting offsets outside of the deterministic range in compile time.
+		* clamp, clampLower and clampUpper are global methods instead of member methods, to work the same for scalar operations in template functions.
+			Replace myVector.clamp(min, max) with clamp(VectorType(min), myVector, VectorType(max)).
+			Replace myVector.clampLower(min) with clampLower(VectorType(min), myVector).
+			Replace myVector.clampUpper(max) with clampUpper(myVector, VectorType(max)).
+		* reciprocal, reciprocalSquareRoot and squareRoot are now global functions, to work the same for scalar operations in template functions.
+			Replace myVector.reciprocal() with reciprocal(myVector).
+			Replace myVector.reciprocalSquareRoot() with reciprocalSquareRoot(myVector).
+			Replace myVector.squareRoot() with squareRoot(myVector).
 	* Textures have been separated from images to allow using them as separate value types.
 	* Textures have been separated from images to allow using them as separate value types.
 		Because it was very difficult to re-use internal texture sampling methods for custom rendering pipelines.
 		Because it was very difficult to re-use internal texture sampling methods for custom rendering pipelines.
 		  Now images and textures have immutable value allocated heads and all side-effects are in the pixel buffers.		  
 		  Now images and textures have immutable value allocated heads and all side-effects are in the pixel buffers.		  
@@ -62,7 +66,7 @@ Changes from version 0.2.0 to version 0.3.0 (Performance and safety improvements
 		Replace 'image_generatePyramid' with 'texture_generatePyramid'.
 		Replace 'image_generatePyramid' with 'texture_generatePyramid'.
 		Create a texture from the image using texture_create_RgbaU8 with the image and the number of resolutions.
 		Create a texture from the image using texture_create_RgbaU8 with the image and the number of resolutions.
 		  Then assign the texture instead of the image.
 		  Then assign the texture instead of the image.
-s	* PackOrder.h has a new packOrder_ prefix for global functions to prevent naming conflicts.
+	* PackOrder.h has a new packOrder_ prefix for global functions to prevent naming conflicts.
 		Replace 'getRed' with 'packOrder_getRed'.
 		Replace 'getRed' with 'packOrder_getRed'.
 		Replace 'getGreen' with 'packOrder_getGreen'.
 		Replace 'getGreen' with 'packOrder_getGreen'.
 		Replace 'getBlue' with 'packOrder_getBlue'.
 		Replace 'getBlue' with 'packOrder_getBlue'.

+ 12 - 0
Source/DFPSR/base/DsrTraits.h

@@ -90,6 +90,18 @@
 		DSR_DECLARE_PROPERTY(DsrTrait_Any_F32)
 		DSR_DECLARE_PROPERTY(DsrTrait_Any_F32)
 		DSR_APPLY_PROPERTY(DsrTrait_Any_F32, float)
 		DSR_APPLY_PROPERTY(DsrTrait_Any_F32, float)
 
 
+		DSR_DECLARE_PROPERTY(DsrTrait_Any)
+		DSR_APPLY_PROPERTY(DsrTrait_Any,   int8_t)
+		DSR_APPLY_PROPERTY(DsrTrait_Any,  int16_t)
+		DSR_APPLY_PROPERTY(DsrTrait_Any,  int32_t)
+		DSR_APPLY_PROPERTY(DsrTrait_Any,  int64_t)
+		DSR_APPLY_PROPERTY(DsrTrait_Any,  uint8_t)
+		DSR_APPLY_PROPERTY(DsrTrait_Any, uint16_t)
+		DSR_APPLY_PROPERTY(DsrTrait_Any, uint32_t)
+		DSR_APPLY_PROPERTY(DsrTrait_Any, uint64_t)
+		DSR_APPLY_PROPERTY(DsrTrait_Any,    float)
+		DSR_APPLY_PROPERTY(DsrTrait_Any,   double)
+
 		DSR_DECLARE_PROPERTY(DsrTrait_Scalar_SignedInteger)
 		DSR_DECLARE_PROPERTY(DsrTrait_Scalar_SignedInteger)
 		DSR_APPLY_PROPERTY(DsrTrait_Scalar_SignedInteger,  int8_t)
 		DSR_APPLY_PROPERTY(DsrTrait_Scalar_SignedInteger,  int8_t)
 		DSR_APPLY_PROPERTY(DsrTrait_Scalar_SignedInteger, int16_t)
 		DSR_APPLY_PROPERTY(DsrTrait_Scalar_SignedInteger, int16_t)

+ 52 - 0
Source/DFPSR/base/noSimd.h

@@ -27,7 +27,9 @@
 #define DFPSR_NO_SIMD
 #define DFPSR_NO_SIMD
 
 
 #include <stdint.h>
 #include <stdint.h>
+#include <cmath>
 #include "SafePointer.h"
 #include "SafePointer.h"
+#include "DsrTraits.h"
 
 
 namespace dsr {
 namespace dsr {
 	// Type conversions.
 	// Type conversions.
@@ -106,6 +108,56 @@ namespace dsr {
 		return left >> bitOffset;
 		return left >> bitOffset;
 	}
 	}
 
 
+	// A minimum function that can take more than two arguments.
+	// Post-condition: Returns the smallest of all given values, which must be comparable using the < operator and have the same type.
+	template <typename T, DSR_ENABLE_IF(DSR_CHECK_PROPERTY(DsrTrait_Scalar, T))>
+	inline T min(const T &a, const T &b) {
+		return (a < b) ? a : b;
+	}
+	template <typename T, typename... TAIL, DSR_ENABLE_IF(DSR_CHECK_PROPERTY(DsrTrait_Scalar, T))>
+	inline T min(const T &a, const T &b, TAIL... tail) {
+		return min(min(a, b), tail...);
+	}
+
+	// A maximum function that can take more than two arguments.
+	// Post-condition: Returns the largest of all given values, which must be comparable using the > operator and have the same type.
+	template <typename T, DSR_ENABLE_IF(DSR_CHECK_PROPERTY(DsrTrait_Scalar, T))>
+	inline T max(const T &a, const T &b) {
+		return (a > b) ? a : b;
+	}
+	template <typename T, typename... TAIL, DSR_ENABLE_IF(DSR_CHECK_PROPERTY(DsrTrait_Scalar, T))>
+	inline T max(const T &a, const T &b, TAIL... tail) {
+		return max(max(a, b), tail...);
+	}
+
+	// TODO: Implement min and max for integer vectors in simd.h.
+	//       Start by implementing vectorized comparisons and blend functions as a fallback for unsupported types.
+
+	// Pre-condition: minValue <= maxValue
+	// Post-condition: Returns value clamped from minValue to maxValue.
+	template <typename T, DSR_ENABLE_IF(DSR_CHECK_PROPERTY(DsrTrait_Any, T))>
+	inline T clamp(const T &minValue, const T &value, const T &maxValue) {
+		return max(minValue, min(value, maxValue));
+	}
+
+	// Post-condition: Returns value clamped to minValue.
+	template <typename T, DSR_ENABLE_IF(DSR_CHECK_PROPERTY(DsrTrait_Any, T))>
+	inline T clampLower(const T &minValue, const T &value) {
+		return max(minValue, value);
+	}
+
+	// Post-condition: Returns value clamped to maxValue.
+	template <typename T, DSR_ENABLE_IF(DSR_CHECK_PROPERTY(DsrTrait_Any, T))>
+	inline T clampUpper(const T &value, const T &maxValue) {
+		return min(value, maxValue);
+	}
+
+	inline float reciprocal(float value) { return 1.0f / value; }
+
+	inline float reciprocalSquareRoot(float value) { return 1.0f / sqrt(value); }
+
+	inline float squareRoot(float value) { return sqrt(value); }
+
 	// TODO: Add more functions from simd.h.
 	// TODO: Add more functions from simd.h.
 }
 }
 
 

+ 142 - 255
Source/DFPSR/base/simd.h

@@ -529,124 +529,74 @@
 			#endif
 			#endif
 			this->writeAlignedUnsafe(pointer);
 			this->writeAlignedUnsafe(pointer);
 		}
 		}
-		// 1 / x
-		//   Useful for multiple divisions with the same denominator
-		//   Useless if the denominator is a constant
-		F32x4 reciprocal() const {
-			#if defined USE_BASIC_SIMD
-				#if defined USE_SSE2
-					// Approximate
-					SIMD_F32x4 lowQ = _mm_rcp_ps(this->v);
-					// Refine
-					return F32x4(SUB_F32_SIMD(ADD_F32_SIMD(lowQ, lowQ), MUL_F32_SIMD(this->v, MUL_F32_SIMD(lowQ, lowQ))));
-				#elif defined USE_NEON
-					// Approximate
-					SIMD_F32x4 result = vrecpeq_f32(this->v);
-					// Refine
-					result = MUL_F32_SIMD(vrecpsq_f32(this->v, result), result);
-					return F32x4(MUL_F32_SIMD(vrecpsq_f32(this->v, result), result));
-				#else
-					assert(false);
-					return F32x4(0);
-				#endif
-			#else
-				return F32x4(1.0f / this->scalars[0], 1.0f / this->scalars[1], 1.0f / this->scalars[2], 1.0f / this->scalars[3]);
-			#endif
-		}
-		// 1 / sqrt(x)
-		//   Useful for normalizing vectors
-		F32x4 reciprocalSquareRoot() const {
-			#if defined USE_BASIC_SIMD
-				#if defined USE_SSE2
-					SIMD_F32x4 reciRoot = _mm_rsqrt_ps(this->v);
-					SIMD_F32x4 mul = MUL_F32_SIMD(MUL_F32_SIMD(this->v, reciRoot), reciRoot);
-					reciRoot = MUL_F32_SIMD(MUL_F32_SIMD(LOAD_SCALAR_F32_SIMD(0.5f), reciRoot), SUB_F32_SIMD(LOAD_SCALAR_F32_SIMD(3.0f), mul));
-					return F32x4(reciRoot);
-				#elif defined USE_NEON
-					// Approximate
-					SIMD_F32x4 reciRoot = vrsqrteq_f32(this->v);
-					// Refine
-					reciRoot = MUL_F32_SIMD(vrsqrtsq_f32(MUL_F32_SIMD(this->v, reciRoot), reciRoot), reciRoot);
-					return F32x4(reciRoot);
-				#else
-					assert(false);
-					return F32x4(0);
-				#endif
-			#else
-				return F32x4(1.0f / sqrt(this->scalars[0]), 1.0f / sqrt(this->scalars[1]), 1.0f / sqrt(this->scalars[2]), 1.0f / sqrt(this->scalars[3]));
-			#endif
-		}
-		// sqrt(x)
-		//   Useful for getting lengths of vectors
-		F32x4 squareRoot() const {
-			#if defined USE_BASIC_SIMD
-				#if defined USE_SSE2
-					SIMD_F32x4 half = LOAD_SCALAR_F32_SIMD(0.5f);
-					// Approximate
-					SIMD_F32x4 root = _mm_sqrt_ps(this->v);
-					// Refine
-					root = _mm_mul_ps(_mm_add_ps(root, _mm_div_ps(this->v, root)), half);
-					return F32x4(root);
-				#elif defined USE_NEON
-					return F32x4(MUL_F32_SIMD(this->v, this->reciprocalSquareRoot().v));
-				#else
-					assert(false);
-					return F32x4(0);
-				#endif
-			#else
-				return F32x4(sqrt(this->scalars[0]), sqrt(this->scalars[1]), sqrt(this->scalars[2]), sqrt(this->scalars[3]));
-			#endif
-		}
-		F32x4 clamp(float minimum, float maximum) const {
-			#if defined USE_BASIC_SIMD
-				return F32x4(MIN_F32_SIMD(MAX_F32_SIMD(this->v, LOAD_SCALAR_F32_SIMD(minimum)), LOAD_SCALAR_F32_SIMD(maximum)));
+	};
+
+	// 1 / value
+	inline F32x4 reciprocal(const F32x4 &value) {
+		#if defined USE_BASIC_SIMD
+			#if defined USE_SSE2
+				// Approximate
+				SIMD_F32x4 lowQ = _mm_rcp_ps(value.v);
+				// Refine
+				return F32x4(SUB_F32_SIMD(ADD_F32_SIMD(lowQ, lowQ), MUL_F32_SIMD(value.v, MUL_F32_SIMD(lowQ, lowQ))));
+			#elif defined USE_NEON
+				// Approximate
+				SIMD_F32x4 result = vrecpeq_f32(value.v);
+				// Refine
+				result = MUL_F32_SIMD(vrecpsq_f32(value.v, result), result);
+				return F32x4(MUL_F32_SIMD(vrecpsq_f32(value.v, result), result));
 			#else
 			#else
-				float val0 = this->scalars[0];
-				float val1 = this->scalars[1];
-				float val2 = this->scalars[2];
-				float val3 = this->scalars[3];
-				if (minimum > val0) { val0 = minimum; }
-				if (maximum < val0) { val0 = maximum; }
-				if (minimum > val1) { val1 = minimum; }
-				if (maximum < val1) { val1 = maximum; }
-				if (minimum > val2) { val2 = minimum; }
-				if (maximum < val2) { val2 = maximum; }
-				if (minimum > val3) { val3 = minimum; }
-				if (maximum < val3) { val3 = maximum; }
-				return F32x4(val0, val1, val2, val3);
+				assert(false);
+				return F32x4(0);
 			#endif
 			#endif
-		}
-		F32x4 clampLower(float minimum) const {
-			#if defined USE_BASIC_SIMD
-				return F32x4(MAX_F32_SIMD(this->v, LOAD_SCALAR_F32_SIMD(minimum)));
+		#else
+			return F32x4(1.0f / value.scalars[0], 1.0f / value.scalars[1], 1.0f / value.scalars[2], 1.0f / value.scalars[3]);
+		#endif
+	}
+
+	// 1 / sqrt(value)
+	inline F32x4 reciprocalSquareRoot(const F32x4 &value) {
+		#if defined USE_BASIC_SIMD
+			#if defined USE_SSE2
+				SIMD_F32x4 reciRoot = _mm_rsqrt_ps(value.v);
+				SIMD_F32x4 mul = MUL_F32_SIMD(MUL_F32_SIMD(value.v, reciRoot), reciRoot);
+				reciRoot = MUL_F32_SIMD(MUL_F32_SIMD(LOAD_SCALAR_F32_SIMD(0.5f), reciRoot), SUB_F32_SIMD(LOAD_SCALAR_F32_SIMD(3.0f), mul));
+				return F32x4(reciRoot);
+			#elif defined USE_NEON
+				// Approximate
+				SIMD_F32x4 reciRoot = vrsqrteq_f32(value.v);
+				// Refine
+				reciRoot = MUL_F32_SIMD(vrsqrtsq_f32(MUL_F32_SIMD(value.v, reciRoot), reciRoot), reciRoot);
+				return F32x4(reciRoot);
 			#else
 			#else
-				float val0 = this->scalars[0];
-				float val1 = this->scalars[1];
-				float val2 = this->scalars[2];
-				float val3 = this->scalars[3];
-				if (minimum > val0) { val0 = minimum; }
-				if (minimum > val1) { val1 = minimum; }
-				if (minimum > val2) { val2 = minimum; }
-				if (minimum > val3) { val3 = minimum; }
-				return F32x4(val0, val1, val2, val3);
+				assert(false);
+				return F32x4(0);
 			#endif
 			#endif
-		}
-		F32x4 clampUpper(float maximum) const {
-			#if defined USE_BASIC_SIMD
-				return F32x4(MIN_F32_SIMD(this->v, LOAD_SCALAR_F32_SIMD(maximum)));
+		#else
+			return F32x4(1.0f / sqrt(value.scalars[0]), 1.0f / sqrt(value.scalars[1]), 1.0f / sqrt(value.scalars[2]), 1.0f / sqrt(value.scalars[3]));
+		#endif
+	}
+
+	// sqrt(value)
+	inline F32x4 squareRoot(const F32x4 &value) {
+		#if defined USE_BASIC_SIMD
+			#if defined USE_SSE2
+				SIMD_F32x4 half = LOAD_SCALAR_F32_SIMD(0.5f);
+				// Approximate
+				SIMD_F32x4 root = _mm_sqrt_ps(value.v);
+				// Refine
+				root = _mm_mul_ps(_mm_add_ps(root, _mm_div_ps(value.v, root)), half);
+				return F32x4(root);
+			#elif defined USE_NEON
+				return F32x4(MUL_F32_SIMD(value.v, value.reciprocalSquareRoot().v));
 			#else
 			#else
-				float val0 = this->scalars[0];
-				float val1 = this->scalars[1];
-				float val2 = this->scalars[2];
-				float val3 = this->scalars[3];
-				if (maximum < val0) { val0 = maximum; }
-				if (maximum < val1) { val1 = maximum; }
-				if (maximum < val2) { val2 = maximum; }
-				if (maximum < val3) { val3 = maximum; }
-				return F32x4(val0, val1, val2, val3);
+				assert(false);
+				return F32x4(0);
 			#endif
 			#endif
-		}
-	};
+		#else
+			return F32x4(sqrt(value.scalars[0]), sqrt(value.scalars[1]), sqrt(value.scalars[2]), sqrt(value.scalars[3]));
+		#endif
+	}
 
 
 	union I32x4 {
 	union I32x4 {
 		private:
 		private:
@@ -1258,151 +1208,72 @@
 			#endif
 			#endif
 			this->writeAlignedUnsafe(pointer);
 			this->writeAlignedUnsafe(pointer);
 		}
 		}
-		// 1 / x
-		//   Useful for multiple divisions with the same denominator
-		//   Useless if the denominator is a constant
-		F32x8 reciprocal() const {
-			#if defined USE_AVX2
-				// Approximate
-				SIMD_F32x8 lowQ = _mm256_rcp_ps(this->v);
-				// Refine
-				return F32x8(SUB_F32_SIMD256(ADD_F32_SIMD256(lowQ, lowQ), MUL_F32_SIMD256(this->v, MUL_F32_SIMD256(lowQ, lowQ))));
-			#else
-				return F32x8(
-				  1.0f / this->scalars[0],
-				  1.0f / this->scalars[1],
-				  1.0f / this->scalars[2],
-				  1.0f / this->scalars[3],
-				  1.0f / this->scalars[4],
-				  1.0f / this->scalars[5],
-				  1.0f / this->scalars[6],
-				  1.0f / this->scalars[7]
-				);
-			#endif
-		}
-		// 1 / sqrt(x)
-		//   Useful for normalizing vectors
-		F32x8 reciprocalSquareRoot() const {
-			#if defined USE_AVX2
-				//__m128 reciRoot = _mm256_rsqrt_ps(this->v);
-				SIMD_F32x8 reciRoot = _mm256_rsqrt_ps(this->v);
-				SIMD_F32x8 mul = MUL_F32_SIMD256(MUL_F32_SIMD256(this->v, reciRoot), reciRoot);
-				reciRoot = MUL_F32_SIMD256(MUL_F32_SIMD256(LOAD_SCALAR_F32_SIMD256(0.5f), reciRoot), SUB_F32_SIMD256(LOAD_SCALAR_F32_SIMD256(3.0f), mul));
-				return F32x8(reciRoot);
-			#else
-				return F32x8(
-				  1.0f / sqrt(this->scalars[0]),
-				  1.0f / sqrt(this->scalars[1]),
-				  1.0f / sqrt(this->scalars[2]),
-				  1.0f / sqrt(this->scalars[3]),
-				  1.0f / sqrt(this->scalars[4]),
-				  1.0f / sqrt(this->scalars[5]),
-				  1.0f / sqrt(this->scalars[6]),
-				  1.0f / sqrt(this->scalars[7])
-				);
-			#endif
-		}
-		// sqrt(x)
-		//   Useful for getting lengths of vectors
-		F32x8 squareRoot() const {
-			#if defined USE_AVX2
-				SIMD_F32x8 half = LOAD_SCALAR_F32_SIMD256(0.5f);
-				// Approximate
-				SIMD_F32x8 root = _mm256_sqrt_ps(this->v);
-				// Refine
-				root = _mm256_mul_ps(_mm256_add_ps(root, _mm256_div_ps(this->v, root)), half);
-				return F32x8(root);
-			#else
-				return F32x8(
-				  sqrt(this->scalars[0]),
-				  sqrt(this->scalars[1]),
-				  sqrt(this->scalars[2]),
-				  sqrt(this->scalars[3]),
-				  sqrt(this->scalars[4]),
-				  sqrt(this->scalars[5]),
-				  sqrt(this->scalars[6]),
-				  sqrt(this->scalars[7]));
-			#endif
-		}
-		F32x8 clamp(float minimum, float maximum) const {
-			#if defined USE_256BIT_F_SIMD
-				return F32x8(MIN_F32_SIMD256(MAX_F32_SIMD256(this->v, LOAD_SCALAR_F32_SIMD256(minimum)), LOAD_SCALAR_F32_SIMD256(maximum)));
-			#else
-				float val0 = this->scalars[0];
-				float val1 = this->scalars[1];
-				float val2 = this->scalars[2];
-				float val3 = this->scalars[3];
-				float val4 = this->scalars[4];
-				float val5 = this->scalars[5];
-				float val6 = this->scalars[6];
-				float val7 = this->scalars[7];
-				if (minimum > val0) { val0 = minimum; }
-				if (maximum < val0) { val0 = maximum; }
-				if (minimum > val1) { val1 = minimum; }
-				if (maximum < val1) { val1 = maximum; }
-				if (minimum > val2) { val2 = minimum; }
-				if (maximum < val2) { val2 = maximum; }
-				if (minimum > val3) { val3 = minimum; }
-				if (maximum < val3) { val3 = maximum; }
-				if (minimum > val4) { val4 = minimum; }
-				if (maximum < val4) { val4 = maximum; }
-				if (minimum > val5) { val5 = minimum; }
-				if (maximum < val5) { val5 = maximum; }
-				if (minimum > val6) { val6 = minimum; }
-				if (maximum < val6) { val6 = maximum; }
-				if (minimum > val7) { val7 = minimum; }
-				if (maximum < val7) { val7 = maximum; }
-				return F32x8(val0, val1, val2, val3, val4, val5, val6, val7);
-			#endif
-		}
-		F32x8 clampLower(float minimum) const {
-			#if defined USE_256BIT_F_SIMD
-				return F32x8(MAX_F32_SIMD256(this->v, LOAD_SCALAR_F32_SIMD256(minimum)));
-			#else
-				float val0 = this->scalars[0];
-				float val1 = this->scalars[1];
-				float val2 = this->scalars[2];
-				float val3 = this->scalars[3];
-				float val4 = this->scalars[4];
-				float val5 = this->scalars[5];
-				float val6 = this->scalars[6];
-				float val7 = this->scalars[7];
-				if (minimum > val0) { val0 = minimum; }
-				if (minimum > val1) { val1 = minimum; }
-				if (minimum > val2) { val2 = minimum; }
-				if (minimum > val3) { val3 = minimum; }
-				if (minimum > val4) { val4 = minimum; }
-				if (minimum > val5) { val5 = minimum; }
-				if (minimum > val6) { val6 = minimum; }
-				if (minimum > val7) { val7 = minimum; }
-				return F32x8(val0, val1, val2, val3, val4, val5, val6, val7);
-			#endif
-		}
-		F32x8 clampUpper(float maximum) const {
-			#if defined USE_256BIT_F_SIMD
-				return F32x8(MIN_F32_SIMD256(this->v, LOAD_SCALAR_F32_SIMD256(maximum)));
-			#else
-				float val0 = this->scalars[0];
-				float val1 = this->scalars[1];
-				float val2 = this->scalars[2];
-				float val3 = this->scalars[3];
-				float val4 = this->scalars[4];
-				float val5 = this->scalars[5];
-				float val6 = this->scalars[6];
-				float val7 = this->scalars[7];
-				if (maximum < val0) { val0 = maximum; }
-				if (maximum < val1) { val1 = maximum; }
-				if (maximum < val2) { val2 = maximum; }
-				if (maximum < val3) { val3 = maximum; }
-				if (maximum < val4) { val4 = maximum; }
-				if (maximum < val5) { val5 = maximum; }
-				if (maximum < val6) { val6 = maximum; }
-				if (maximum < val7) { val7 = maximum; }
-				return F32x8(val0, val1, val2, val3, val4, val5, val6, val7);
-			#endif
-		}
 	};
 	};
 
 
+	// 1 / value
+	inline F32x8 reciprocal(const F32x8 &value) {
+		#if defined USE_AVX2
+			// Approximate
+			SIMD_F32x8 lowQ = _mm256_rcp_ps(value.v);
+			// Refine
+			return F32x8(SUB_F32_SIMD256(ADD_F32_SIMD256(lowQ, lowQ), MUL_F32_SIMD256(value.v, MUL_F32_SIMD256(lowQ, lowQ))));
+		#else
+			return F32x8(
+			  1.0f / value.scalars[0],
+			  1.0f / value.scalars[1],
+			  1.0f / value.scalars[2],
+			  1.0f / value.scalars[3],
+			  1.0f / value.scalars[4],
+			  1.0f / value.scalars[5],
+			  1.0f / value.scalars[6],
+			  1.0f / value.scalars[7]
+			);
+		#endif
+	}
+
+	// 1 / sqrt(value)
+	inline F32x8 reciprocalSquareRoot(const F32x8 &value) {
+		#if defined USE_AVX2
+			SIMD_F32x8 reciRoot = _mm256_rsqrt_ps(value.v);
+			SIMD_F32x8 mul = MUL_F32_SIMD256(MUL_F32_SIMD256(value.v, reciRoot), reciRoot);
+			reciRoot = MUL_F32_SIMD256(MUL_F32_SIMD256(LOAD_SCALAR_F32_SIMD256(0.5f), reciRoot), SUB_F32_SIMD256(LOAD_SCALAR_F32_SIMD256(3.0f), mul));
+			return F32x8(reciRoot);
+		#else
+			return F32x8(
+			  1.0f / sqrt(value.scalars[0]),
+			  1.0f / sqrt(value.scalars[1]),
+			  1.0f / sqrt(value.scalars[2]),
+			  1.0f / sqrt(value.scalars[3]),
+			  1.0f / sqrt(value.scalars[4]),
+			  1.0f / sqrt(value.scalars[5]),
+			  1.0f / sqrt(value.scalars[6]),
+			  1.0f / sqrt(value.scalars[7])
+			);
+		#endif
+	}
+
+	// sqrt(value)
+	inline F32x8 squareRoot(const F32x8 &value) {
+		#if defined USE_AVX2
+			SIMD_F32x8 half = LOAD_SCALAR_F32_SIMD256(0.5f);
+			// Approximate
+			SIMD_F32x8 root = _mm256_sqrt_ps(value.v);
+			// Refine
+			root = _mm256_mul_ps(_mm256_add_ps(root, _mm256_div_ps(value.v, root)), half);
+			return F32x8(root);
+		#else
+			return F32x8(
+			  sqrt(value.scalars[0]),
+			  sqrt(value.scalars[1]),
+			  sqrt(value.scalars[2]),
+			  sqrt(value.scalars[3]),
+			  sqrt(value.scalars[4]),
+			  sqrt(value.scalars[5]),
+			  sqrt(value.scalars[6]),
+			  sqrt(value.scalars[7]));
+		#endif
+	}
+
 	union I32x8 {
 	union I32x8 {
 		private:
 		private:
 			// The uninitialized default constructor is private for safety reasons.
 			// The uninitialized default constructor is private for safety reasons.
@@ -3912,6 +3783,16 @@
 	DSR_APPLY_PROPERTY(DsrTrait_Any_I32, I32x8)
 	DSR_APPLY_PROPERTY(DsrTrait_Any_I32, I32x8)
 	DSR_APPLY_PROPERTY(DsrTrait_Any_F32, F32x4)
 	DSR_APPLY_PROPERTY(DsrTrait_Any_F32, F32x4)
 	DSR_APPLY_PROPERTY(DsrTrait_Any_F32, F32x8)
 	DSR_APPLY_PROPERTY(DsrTrait_Any_F32, F32x8)
+	DSR_APPLY_PROPERTY(DsrTrait_Any , U8x16)
+	DSR_APPLY_PROPERTY(DsrTrait_Any , U8x32)
+	DSR_APPLY_PROPERTY(DsrTrait_Any, U16x8)
+	DSR_APPLY_PROPERTY(DsrTrait_Any, U16x16)
+	DSR_APPLY_PROPERTY(DsrTrait_Any, U32x4)
+	DSR_APPLY_PROPERTY(DsrTrait_Any, U32x8)
+	DSR_APPLY_PROPERTY(DsrTrait_Any, I32x4)
+	DSR_APPLY_PROPERTY(DsrTrait_Any, I32x8)
+	DSR_APPLY_PROPERTY(DsrTrait_Any, F32x4)
+	DSR_APPLY_PROPERTY(DsrTrait_Any, F32x8)
 
 
 	// TODO: Use as independent types when the largest vector lengths are not known in compile time on ARM SVE.
 	// TODO: Use as independent types when the largest vector lengths are not known in compile time on ARM SVE.
 	//DSR_APPLY_PROPERTY(DsrTrait_Any_U8 , U8xX)
 	//DSR_APPLY_PROPERTY(DsrTrait_Any_U8 , U8xX)
@@ -3920,6 +3801,12 @@
 	//DSR_APPLY_PROPERTY(DsrTrait_Any_I32, I32xX)
 	//DSR_APPLY_PROPERTY(DsrTrait_Any_I32, I32xX)
 	//DSR_APPLY_PROPERTY(DsrTrait_Any_F32, F32xX)
 	//DSR_APPLY_PROPERTY(DsrTrait_Any_F32, F32xX)
 	//DSR_APPLY_PROPERTY(DsrTrait_Any_F32, F32xF)
 	//DSR_APPLY_PROPERTY(DsrTrait_Any_F32, F32xF)
+	//DSR_APPLY_PROPERTY(DsrTrait_Any , U8xX)
+	//DSR_APPLY_PROPERTY(DsrTrait_Any, U16xX)
+	//DSR_APPLY_PROPERTY(DsrTrait_Any, U32xX)
+	//DSR_APPLY_PROPERTY(DsrTrait_Any, I32xX)
+	//DSR_APPLY_PROPERTY(DsrTrait_Any, F32xX)
+	//DSR_APPLY_PROPERTY(DsrTrait_Any, F32xF)
 
 
 	}
 	}
 
 

+ 4 - 4
Source/DFPSR/base/simd3D.h

@@ -72,10 +72,10 @@ inline SIMD_TYPE squareLength(const VECTOR_TYPE &v) { \
 	return dotProduct(v, v); \
 	return dotProduct(v, v); \
 } \
 } \
 inline SIMD_TYPE length(const VECTOR_TYPE &v) { \
 inline SIMD_TYPE length(const VECTOR_TYPE &v) { \
-	return squareLength(v).squareRoot(); \
+	return squareRoot(squareLength(v)); \
 } \
 } \
 inline VECTOR_TYPE normalize(const VECTOR_TYPE &v) { \
 inline VECTOR_TYPE normalize(const VECTOR_TYPE &v) { \
-	return v * squareLength(v).reciprocalSquareRoot(); \
+	return v * reciprocalSquareRoot(squareLength(v)); \
 }
 }
 
 
 // These are the infix operations for 3D SIMD vectors F32x4x3, F32x8x3...
 // These are the infix operations for 3D SIMD vectors F32x4x3, F32x8x3...
@@ -117,10 +117,10 @@ inline SIMD_TYPE squareLength(const VECTOR_TYPE &v) { \
 	return dotProduct(v, v); \
 	return dotProduct(v, v); \
 } \
 } \
 inline SIMD_TYPE length(const VECTOR_TYPE &v) { \
 inline SIMD_TYPE length(const VECTOR_TYPE &v) { \
-	return squareLength(v).squareRoot(); \
+	return squareRoot(squareLength(v)); \
 } \
 } \
 inline VECTOR_TYPE normalize(const VECTOR_TYPE &v) { \
 inline VECTOR_TYPE normalize(const VECTOR_TYPE &v) { \
-	return v * squareLength(v).reciprocalSquareRoot(); \
+	return v * reciprocalSquareRoot(squareLength(v)); \
 }
 }
 
 
 // These are the available in-plaxe operations for 2D SIMD vectors F32x4x2, F32x8x2...
 // These are the available in-plaxe operations for 2D SIMD vectors F32x4x2, F32x8x2...

+ 9 - 11
Source/DFPSR/image/PackOrder.h

@@ -27,9 +27,8 @@
 #include <cstdint>
 #include <cstdint>
 #include "Color.h"
 #include "Color.h"
 #include "../base/endian.h"
 #include "../base/endian.h"
-#include "../base/DsrTraits.h"
+#include "../base/noSimd.h"
 #include "../api/stringAPI.h"
 #include "../api/stringAPI.h"
-#include "../math/scalar.h"
 
 
 namespace dsr {
 namespace dsr {
 
 
@@ -185,17 +184,16 @@ U packOrder_packBytes(const U &s0, const U &s1, const U &s2, const U &s3, const
 //   From F32x4 to U32x4
 //   From F32x4 to U32x4
 //   From F32x8 to U32x8
 //   From F32x8 to U32x8
 //   From F32xX to U32xX
 //   From F32xX to U32xX
-//   From F32xF to U32xF
 template<typename U, typename F, DSR_ENABLE_IF(
 template<typename U, typename F, DSR_ENABLE_IF(
 	 DSR_CHECK_PROPERTY(DsrTrait_Any_U32, U)
 	 DSR_CHECK_PROPERTY(DsrTrait_Any_U32, U)
   && DSR_CHECK_PROPERTY(DsrTrait_Any_F32, F)
   && DSR_CHECK_PROPERTY(DsrTrait_Any_F32, F)
 )>
 )>
 inline U packOrder_floatToSaturatedByte(const F &s0, const F &s1, const F &s2, const F &s3) {
 inline U packOrder_floatToSaturatedByte(const F &s0, const F &s1, const F &s2, const F &s3) {
 	return packOrder_packBytes(
 	return packOrder_packBytes(
-	  truncateToU32(s0.clamp(0.1f, 255.1f)),
-	  truncateToU32(s1.clamp(0.1f, 255.1f)),
-	  truncateToU32(s2.clamp(0.1f, 255.1f)),
-	  truncateToU32(s3.clamp(0.1f, 255.1f))
+	  truncateToU32(clampUpper(s0, F(255.1f))),
+	  truncateToU32(clampUpper(s1, F(255.1f))),
+	  truncateToU32(clampUpper(s2, F(255.1f))),
+	  truncateToU32(clampUpper(s3, F(255.1f)))
 	);
 	);
 }
 }
 // Using a specified pack order
 // Using a specified pack order
@@ -205,10 +203,10 @@ template<typename U, typename F, DSR_ENABLE_IF(
 )>
 )>
 inline U packOrder_floatToSaturatedByte(const F &s0, const F &s1, const F &s2, const F &s3, const PackOrder &order) {
 inline U packOrder_floatToSaturatedByte(const F &s0, const F &s1, const F &s2, const F &s3, const PackOrder &order) {
 	return packOrder_packBytes(
 	return packOrder_packBytes(
-	  truncateToU32(s0.clamp(0.1f, 255.1f)),
-	  truncateToU32(s1.clamp(0.1f, 255.1f)),
-	  truncateToU32(s2.clamp(0.1f, 255.1f)),
-	  truncateToU32(s3.clamp(0.1f, 255.1f)),
+	  truncateToU32(clampUpper(s0, F(255.1f))),
+	  truncateToU32(clampUpper(s1, F(255.1f))),
+	  truncateToU32(clampUpper(s2, F(255.1f))),
+	  truncateToU32(clampUpper(s3, F(255.1f))),
 	  order
 	  order
 	);
 	);
 }
 }

+ 1 - 33
Source/DFPSR/math/scalar.h

@@ -24,42 +24,10 @@
 #ifndef DFPSR_MATH_SCALAR
 #ifndef DFPSR_MATH_SCALAR
 #define DFPSR_MATH_SCALAR
 #define DFPSR_MATH_SCALAR
 
 
-#include <cmath>
-#include "../base/DsrTraits.h"
+#include "../base/noSimd.h"
 
 
 namespace dsr {
 namespace dsr {
 
 
-// A minimum function that can take more than two arguments.
-// Post-condition: Returns the smallest of all given values, which must be comparable using the < operator and have the same type.
-template <typename T, DSR_ENABLE_IF(DSR_CHECK_PROPERTY(DsrTrait_Scalar, T))>
-inline T min(const T &a, const T &b) {
-	return (a < b) ? a : b;
-}
-template <typename T, typename... TAIL, DSR_ENABLE_IF(DSR_CHECK_PROPERTY(DsrTrait_Scalar, T))>
-inline T min(const T &a, const T &b, TAIL... tail) {
-	return min(min(a, b), tail...);
-}
-
-// A maximum function that can take more than two arguments.
-// Post-condition: Returns the largest of all given values, which must be comparable using the > operator and have the same type.
-template <typename T, DSR_ENABLE_IF(DSR_CHECK_PROPERTY(DsrTrait_Scalar, T))>
-inline T max(const T &a, const T &b) {
-	return (a > b) ? a : b;
-}
-template <typename T, typename... TAIL, DSR_ENABLE_IF(DSR_CHECK_PROPERTY(DsrTrait_Scalar, T))>
-inline T max(const T &a, const T &b, TAIL... tail) {
-	return max(max(a, b), tail...);
-}
-
-// Pre-condition: minValue <= maxValue
-// Post-condition: Returns value clamped from minValue to maxValue.
-template <typename T, DSR_ENABLE_IF(DSR_CHECK_PROPERTY(DsrTrait_Scalar, T))>
-T clamp(const T &minValue, T value, const T &maxValue) {
-	if (value > maxValue) value = maxValue;
-	if (value < minValue) value = minValue;
-	return value;
-}
-
 // Returns a modulo b where 0 <= a < b
 // Returns a modulo b where 0 <= a < b
 template <typename I, typename U, DSR_ENABLE_IF(DSR_CHECK_PROPERTY(DsrTrait_Scalar_SignedInteger, I) && DSR_CHECK_PROPERTY(DsrTrait_Scalar_Integer, U))>
 template <typename I, typename U, DSR_ENABLE_IF(DSR_CHECK_PROPERTY(DsrTrait_Scalar_SignedInteger, I) && DSR_CHECK_PROPERTY(DsrTrait_Scalar_Integer, U))>
 inline int32_t signedModulo(I a, U b) {
 inline int32_t signedModulo(I a, U b) {

+ 1 - 1
Source/DFPSR/render/shader/fillerTemplates.h

@@ -224,7 +224,7 @@ inline void fillRowSuper(void *data, PixelShadingCallback pixelShaderFunction, S
 			FVector4D depth = vRecDepth.get();
 			FVector4D depth = vRecDepth.get();
 			// After linearly interpolating (1 / W, U / W, V / W) based on the affine weights...
 			// After linearly interpolating (1 / W, U / W, V / W) based on the affine weights...
 			// Divide 1 by 1 / W to get the linear depth W
 			// Divide 1 by 1 / W to get the linear depth W
-			F32x4 vLinearDepth = vRecDepth.reciprocal();
+			F32x4 vLinearDepth = reciprocal(vRecDepth);
 			// Multiply the vertex weights to the second and third edges with the depth to compensate for that we divided them by depth before interpolating.
 			// Multiply the vertex weights to the second and third edges with the depth to compensate for that we divided them by depth before interpolating.
 			F32x4 weightB = vRecU * vLinearDepth;
 			F32x4 weightB = vRecU * vLinearDepth;
 			F32x4 weightC = vRecV * vLinearDepth;
 			F32x4 weightC = vRecV * vLinearDepth;

+ 10 - 10
Source/SDK/SpriteEngine/lightAPI.cpp

@@ -44,13 +44,13 @@ void directedLight(const FMatrix3x3& normalToWorldSpace, OrderedImageRgbaU8& lig
 				F32xXx3 negativeSurfaceNormal = unpackRgb_U32xX_to_F32xXx3(normalColor) - 128.0f;
 				F32xXx3 negativeSurfaceNormal = unpackRgb_U32xX_to_F32xXx3(normalColor) - 128.0f;
 				// Calculate light intensity
 				// Calculate light intensity
 				//   Normalization and negation is already pre-multiplied into reverseLightDirection
 				//   Normalization and negation is already pre-multiplied into reverseLightDirection
-				F32xX intensity = dotProduct(negativeSurfaceNormal, reverseLightDirection).clampLower(0.0f);
+				F32xX intensity = clampLower(dotProduct(negativeSurfaceNormal, reverseLightDirection), F32xX(0.0f));
 				F32xX red = intensity * colorR;
 				F32xX red = intensity * colorR;
 				F32xX green = intensity * colorG;
 				F32xX green = intensity * colorG;
 				F32xX blue = intensity * colorB;
 				F32xX blue = intensity * colorB;
-				red = red.clampUpper(255.1f);
-				green = green.clampUpper(255.1f);
-				blue = blue.clampUpper(255.1f);
+				red = clampUpper(red, F32xX(255.1f));
+				green = clampUpper(green, F32xX(255.1f));
+				blue = clampUpper(blue, F32xX(255.1f));
 				// TODO: Let color packing handle arbitrary vector lengths.
 				// TODO: Let color packing handle arbitrary vector lengths.
 				U8xX light = reinterpret_U8FromU32(packOrder_packBytes(truncateToU32(red), truncateToU32(green), truncateToU32(blue)));
 				U8xX light = reinterpret_U8FromU32(packOrder_packBytes(truncateToU32(red), truncateToU32(green), truncateToU32(blue)));
 				if (ADD_LIGHT) {
 				if (ADD_LIGHT) {
@@ -248,9 +248,9 @@ static void addPointLightSuper(const OrthoView& camera, const IVector2D& worldCe
 					F32xX red = intensity * colorR;
 					F32xX red = intensity * colorR;
 					F32xX green = intensity * colorG;
 					F32xX green = intensity * colorG;
 					F32xX blue = intensity * colorB;
 					F32xX blue = intensity * colorB;
-					red = red.clampUpper(255.1f);
-					green = green.clampUpper(255.1f);
-					blue = blue.clampUpper(255.1f);
+					red = clampUpper(red, F32xX(255.1f));
+					green = clampUpper(green, F32xX(255.1f));
+					blue = clampUpper(blue, F32xX(255.1f));
 					// Add light to the image
 					// Add light to the image
 					U8xX morelight = reinterpret_U8FromU32(packOrder_packBytes(truncateToU32(red), truncateToU32(green), truncateToU32(blue)));
 					U8xX morelight = reinterpret_U8FromU32(packOrder_packBytes(truncateToU32(red), truncateToU32(green), truncateToU32(blue)));
 					addLight(lightPixel, morelight);
 					addLight(lightPixel, morelight);
@@ -306,9 +306,9 @@ void blendLight(AlignedImageRgbaU8& colorBuffer, const OrderedImageRgbaU8& diffu
 				F32xX red = (floatFromU32(packOrder_getRed(diffuse)) * floatFromU32(packOrder_getRed(light))) * scale;
 				F32xX red = (floatFromU32(packOrder_getRed(diffuse)) * floatFromU32(packOrder_getRed(light))) * scale;
 				F32xX green = (floatFromU32(packOrder_getGreen(diffuse)) * floatFromU32(packOrder_getGreen(light))) * scale;
 				F32xX green = (floatFromU32(packOrder_getGreen(diffuse)) * floatFromU32(packOrder_getGreen(light))) * scale;
 				F32xX blue = (floatFromU32(packOrder_getBlue(diffuse)) * floatFromU32(packOrder_getBlue(light))) * scale;
 				F32xX blue = (floatFromU32(packOrder_getBlue(diffuse)) * floatFromU32(packOrder_getBlue(light))) * scale;
-				red = red.clampUpper(255.1f);
-				green = green.clampUpper(255.1f);
-				blue = blue.clampUpper(255.1f);
+				red = clampUpper(red, F32xX(255.1f));
+				green = clampUpper(green, F32xX(255.1f));
+				blue = clampUpper(blue, F32xX(255.1f));
 				U32xX color = packOrder_packBytes(truncateToU32(red), truncateToU32(green), truncateToU32(blue), targetOrder);
 				U32xX color = packOrder_packBytes(truncateToU32(red), truncateToU32(green), truncateToU32(blue), targetOrder);
 				color.writeAligned(targetPixel, "blendLight: writing color");
 				color.writeAligned(targetPixel, "blendLight: writing color");
 				targetPixel += laneCountX_32Bit;
 				targetPixel += laneCountX_32Bit;

+ 2 - 2
Source/soundManagers/AlsaSound.cpp

@@ -90,8 +90,8 @@ bool sound_streamToSpeakers(int channels, int sampleRate, std::function<bool(Saf
 			// SIMD vectorized sound conversion with scaling and clamping to signed 16-bit integers.
 			// SIMD vectorized sound conversion with scaling and clamping to signed 16-bit integers.
 			F32x4 lowerFloats = F32x4::readAligned(floatData + t, "sound_streamToSpeakers: Reading lower floats");
 			F32x4 lowerFloats = F32x4::readAligned(floatData + t, "sound_streamToSpeakers: Reading lower floats");
 			F32x4 upperFloats = F32x4::readAligned(floatData + t + 4, "sound_streamToSpeakers: Reading upper floats");
 			F32x4 upperFloats = F32x4::readAligned(floatData + t + 4, "sound_streamToSpeakers: Reading upper floats");
-			I32x4 lowerInts = truncateToI32((lowerFloats * 32767.0f).clamp(-32768.0f, 32767.0f));
-			I32x4 upperInts = truncateToI32((upperFloats * 32767.0f).clamp(-32768.0f, 32767.0f));
+			I32x4 lowerInts = truncateToI32(clamp(F32x4(-32768.0f), lowerFloats * 32767.0f, F32x4(32767.0f)));
+			I32x4 upperInts = truncateToI32(clamp(F32x4(-32768.0f), upperFloats * 32767.0f, F32x4(32767.0f)));
 			// TODO: Create I16x8 SIMD vectors for processing sound as 16-bit integers?
 			// TODO: Create I16x8 SIMD vectors for processing sound as 16-bit integers?
 			//       Or just move unzip into simd.h with a fallback solution and remove simdExtra.h.
 			//       Or just move unzip into simd.h with a fallback solution and remove simdExtra.h.
 			//       Or just implement reading and writing of 16-bit signed integers using multiple SIMD registers or smaller memory regions.
 			//       Or just implement reading and writing of 16-bit signed integers using multiple SIMD registers or smaller memory regions.

+ 2 - 2
Source/soundManagers/WinMMSound.cpp

@@ -107,8 +107,8 @@ bool sound_streamToSpeakers(int channels, int sampleRate, std::function<bool(Saf
 					// SIMD vectorized sound conversion with scaling and clamping to signed 16-bit integers.
 					// SIMD vectorized sound conversion with scaling and clamping to signed 16-bit integers.
 					F32x4 lowerFloats = F32x4::readAligned(floatData + t, "sound_streamToSpeakers: Reading lower floats");
 					F32x4 lowerFloats = F32x4::readAligned(floatData + t, "sound_streamToSpeakers: Reading lower floats");
 					F32x4 upperFloats = F32x4::readAligned(floatData + t + 4, "sound_streamToSpeakers: Reading upper floats");
 					F32x4 upperFloats = F32x4::readAligned(floatData + t + 4, "sound_streamToSpeakers: Reading upper floats");
-					I32x4 lowerInts = truncateToI32((lowerFloats * 32767.0f).clamp(-32768.0f, 32767.0f));
-					I32x4 upperInts = truncateToI32((upperFloats * 32767.0f).clamp(-32768.0f, 32767.0f));
+					I32x4 lowerInts = truncateToI32(clamp(F32x4(-32768.0f), lowerFloats * 32767.0f, F32x4(32767.0f)));
+					I32x4 upperInts = truncateToI32(clamp(F32x4(-32768.0f), upperFloats * 32767.0f, F32x4(32767.0f)));
 					// TODO: Create I16x8 SIMD vectors for processing sound as 16-bit integers?
 					// TODO: Create I16x8 SIMD vectors for processing sound as 16-bit integers?
 					//       Or just move unzip into simd.h with a fallback solution and remove simdExtra.h.
 					//       Or just move unzip into simd.h with a fallback solution and remove simdExtra.h.
 					//       Or just implement reading and writing of 16-bit signed integers using multiple SIMD registers or smaller memory regions.
 					//       Or just implement reading and writing of 16-bit signed integers using multiple SIMD registers or smaller memory regions.

+ 0 - 1
Source/test.sh

@@ -5,7 +5,6 @@ TEMP_ROOT=${ROOT_PATH}/../../temporary
 CPP_VERSION=-std=c++14
 CPP_VERSION=-std=c++14
 MODE="-DDEBUG"
 MODE="-DDEBUG"
 DEBUGGER="-g"
 DEBUGGER="-g"
-#MODE="-msse2 -mssse3 -mavx2"
 O_LEVEL=-O2
 O_LEVEL=-O2
 
 
 chmod +x ${ROOT_PATH}/tools/build.sh;
 chmod +x ${ROOT_PATH}/tools/build.sh;

+ 14 - 8
Source/test/tests/SimdTest.cpp

@@ -3,6 +3,8 @@
 #include "../../DFPSR/base/simd.h"
 #include "../../DFPSR/base/simd.h"
 
 
 // TODO: Test: allLanesNotEqual, allLanesLesser, allLanesGreater, allLanesLesserOrEqual, allLanesGreaterOrEqual, reinterpret_U16FromU32, reinterpret_U32FromU16, operand ~
 // TODO: Test: allLanesNotEqual, allLanesLesser, allLanesGreater, allLanesLesserOrEqual, allLanesGreaterOrEqual, reinterpret_U16FromU32, reinterpret_U32FromU16, operand ~
+// TODO: Test that truncateToU32 saturates to minimum and maximum values.
+// TODO: Test that truncateToI32 saturates to minimum and maximum values.
 
 
 START_TEST(Simd)
 START_TEST(Simd)
 	printText("\nSIMD test is compiled using:\n");
 	printText("\nSIMD test is compiled using:\n");
@@ -146,13 +148,13 @@ START_TEST(Simd)
 	ASSERT(allLanesEqual(U16x8(12, 0, 34, 0, 56, 0, 78, 0).get_U32(), U32x4(12, 34, 56, 78)));
 	ASSERT(allLanesEqual(U16x8(12, 0, 34, 0, 56, 0, 78, 0).get_U32(), U32x4(12, 34, 56, 78)));
 
 
 	// Reciprocal: 1 / x
 	// Reciprocal: 1 / x
-	ASSERT(allLanesEqual(F32x4(0.5f, 1.0f, 2.0f, 4.0f).reciprocal(), F32x4(2.0f, 1.0f, 0.5f, 0.25f)));
+	ASSERT(allLanesEqual(reciprocal(F32x4(0.5f, 1.0f, 2.0f, 4.0f)), F32x4(2.0f, 1.0f, 0.5f, 0.25f)));
 
 
 	// Square root: sqrt(x)
 	// Square root: sqrt(x)
-	ASSERT(allLanesEqual(F32x4(1.0f, 4.0f, 9.0f, 100.0f).squareRoot(), F32x4(1.0f, 2.0f, 3.0f, 10.0f)));
+	ASSERT(allLanesEqual(squareRoot(F32x4(1.0f, 4.0f, 9.0f, 100.0f)), F32x4(1.0f, 2.0f, 3.0f, 10.0f)));
 
 
 	// Reciprocal square root: 1 / sqrt(x)
 	// Reciprocal square root: 1 / sqrt(x)
-	ASSERT(allLanesEqual(F32x4(1.0f, 4.0f, 16.0f, 100.0f).reciprocalSquareRoot(), F32x4(1.0f, 0.5f, 0.25f, 0.1f)));
+	ASSERT(allLanesEqual(reciprocalSquareRoot(F32x4(1.0f, 4.0f, 16.0f, 100.0f)), F32x4(1.0f, 0.5f, 0.25f, 0.1f)));
 
 
 	// Minimum
 	// Minimum
 	ASSERT(allLanesEqual(min(F32x4(1.1f, 2.2f, 3.3f, 4.4f), F32x4(5.0f, 3.0f, 1.0f, -1.0f)), F32x4(1.1f, 2.2f, 1.0f, -1.0f)));
 	ASSERT(allLanesEqual(min(F32x4(1.1f, 2.2f, 3.3f, 4.4f), F32x4(5.0f, 3.0f, 1.0f, -1.0f)), F32x4(1.1f, 2.2f, 1.0f, -1.0f)));
@@ -161,7 +163,9 @@ START_TEST(Simd)
 	ASSERT(allLanesEqual(max(F32x4(1.1f, 2.2f, 3.3f, 4.4f), F32x4(5.0f, 3.0f, 1.0f, -1.0f)), F32x4(5.0f, 3.0f, 3.3f, 4.4f)));
 	ASSERT(allLanesEqual(max(F32x4(1.1f, 2.2f, 3.3f, 4.4f), F32x4(5.0f, 3.0f, 1.0f, -1.0f)), F32x4(5.0f, 3.0f, 3.3f, 4.4f)));
 
 
 	// Clamp
 	// Clamp
-	ASSERT(allLanesEqual(F32x4(-35.1f, 1.0f, 2.0f, 45.7f).clamp(-1.5f, 1.5f), F32x4(-1.5f, 1.0f, 1.5f, 1.5f)));
+	ASSERT(allLanesEqual(clamp(F32x4(-1.5f), F32x4(-35.1f, 1.0f, 2.0f, 45.7f), F32x4(1.5f)), F32x4(-1.5f, 1.0f, 1.5f, 1.5f)));
+	ASSERT(allLanesEqual(clampUpper(F32x4(-35.1f, 1.0f, 2.0f, 45.7f), F32x4(1.5f)), F32x4(-35.1f, 1.0f, 1.5f, 1.5f)));
+	ASSERT(allLanesEqual(clampLower(F32x4(-1.5f), F32x4(-35.1f, 1.0f, 2.0f, 45.7f)), F32x4(-1.5f, 1.0f, 2.0f, 45.7f)));
 
 
 	// F32x4 operations
 	// F32x4 operations
 	ASSERT(allLanesEqual(F32x4(1.1f, -2.2f, 3.3f, 4.0f) + F32x4(2.2f, -4.4f, 6.6f, 8.0f), F32x4(3.3f, -6.6f, 9.9f, 12.0f)));
 	ASSERT(allLanesEqual(F32x4(1.1f, -2.2f, 3.3f, 4.0f) + F32x4(2.2f, -4.4f, 6.6f, 8.0f), F32x4(3.3f, -6.6f, 9.9f, 12.0f)));
@@ -428,13 +432,13 @@ START_TEST(Simd)
 	ASSERT(allLanesEqual(U16x16(12, 0, 34, 0, 56, 0, 78, 0, 11, 0, 22, 0, 33, 0, 44, 2).get_U32(), U32x8(12, 34, 56, 78, 11, 22, 33, 131116)));
 	ASSERT(allLanesEqual(U16x16(12, 0, 34, 0, 56, 0, 78, 0, 11, 0, 22, 0, 33, 0, 44, 2).get_U32(), U32x8(12, 34, 56, 78, 11, 22, 33, 131116)));
 
 
 	// Reciprocal: 1 / x
 	// Reciprocal: 1 / x
-	ASSERT(allLanesEqual(F32x8(0.5f, 1.0f, 2.0f, 4.0f, 8.0f, 10.0f, 100.0f, 1000.0f).reciprocal(), F32x8(2.0f, 1.0f, 0.5f, 0.25f, 0.125f, 0.1f, 0.01f, 0.001f)));
+	ASSERT(allLanesEqual(reciprocal(F32x8(0.5f, 1.0f, 2.0f, 4.0f, 8.0f, 10.0f, 100.0f, 1000.0f)), F32x8(2.0f, 1.0f, 0.5f, 0.25f, 0.125f, 0.1f, 0.01f, 0.001f)));
 
 
 	// Square root: sqrt(x)
 	// Square root: sqrt(x)
-	ASSERT(allLanesEqual(F32x8(1.0f, 4.0f, 9.0f, 100.0f, 64.0f, 256.0f, 1024.0f, 4096.0f).squareRoot(), F32x8(1.0f, 2.0f, 3.0f, 10.0f, 8.0f, 16.0f, 32.0f, 64.0f)));
+	ASSERT(allLanesEqual(squareRoot(F32x8(1.0f, 4.0f, 9.0f, 100.0f, 64.0f, 256.0f, 1024.0f, 4096.0f)), F32x8(1.0f, 2.0f, 3.0f, 10.0f, 8.0f, 16.0f, 32.0f, 64.0f)));
 
 
 	// Reciprocal square root: 1 / sqrt(x)
 	// Reciprocal square root: 1 / sqrt(x)
-	ASSERT(allLanesEqual(F32x8(1.0f, 4.0f, 16.0f, 100.0f, 400.0f, 64.0f, 25.0f, 100.0f).reciprocalSquareRoot(), F32x8(1.0f, 0.5f, 0.25f, 0.1f, 0.05f, 0.125f, 0.2f, 0.1f)));
+	ASSERT(allLanesEqual(reciprocalSquareRoot(F32x8(1.0f, 4.0f, 16.0f, 100.0f, 400.0f, 64.0f, 25.0f, 100.0f)), F32x8(1.0f, 0.5f, 0.25f, 0.1f, 0.05f, 0.125f, 0.2f, 0.1f)));
 
 
 	// Minimum
 	// Minimum
 	ASSERT(allLanesEqual(min(F32x8(1.1f, 2.2f, 3.3f, 4.4f, 5.5f, 6.6f, 7.7f, 8.8f), F32x8(5.0f, 3.0f, 1.0f, -1.0f, 4.0f, 5.0f, -2.5f, 10.0f)), F32x8(1.1f, 2.2f, 1.0f, -1.0f, 4.0f, 5.0f, -2.5f, 8.8f)));
 	ASSERT(allLanesEqual(min(F32x8(1.1f, 2.2f, 3.3f, 4.4f, 5.5f, 6.6f, 7.7f, 8.8f), F32x8(5.0f, 3.0f, 1.0f, -1.0f, 4.0f, 5.0f, -2.5f, 10.0f)), F32x8(1.1f, 2.2f, 1.0f, -1.0f, 4.0f, 5.0f, -2.5f, 8.8f)));
@@ -443,7 +447,9 @@ START_TEST(Simd)
 	ASSERT(allLanesEqual(max(F32x8(1.1f, 2.2f, 3.3f, 4.4f, 5.5f, 6.6f, 7.7f, 8.8f), F32x8(5.0f, 3.0f, 1.0f, -1.0f, 4.0f, 5.0f, -2.5f, 10.0f)), F32x8(5.0f, 3.0f, 3.3f, 4.4f, 5.5f, 6.6f, 7.7f, 10.0f)));
 	ASSERT(allLanesEqual(max(F32x8(1.1f, 2.2f, 3.3f, 4.4f, 5.5f, 6.6f, 7.7f, 8.8f), F32x8(5.0f, 3.0f, 1.0f, -1.0f, 4.0f, 5.0f, -2.5f, 10.0f)), F32x8(5.0f, 3.0f, 3.3f, 4.4f, 5.5f, 6.6f, 7.7f, 10.0f)));
 
 
 	// Clamp
 	// Clamp
-	ASSERT(allLanesEqual(F32x8(-35.1f, 1.0f, 2.0f, 45.7f, 0.0f, -1.0f, 2.1f, -1.9f).clamp(-1.5f, 1.5f), F32x8(-1.5f, 1.0f, 1.5f, 1.5f, 0.0f, -1.0f, 1.5f, -1.5f)));
+	ASSERT(allLanesEqual(clamp(F32x8(-1.5f), F32x8(-35.1f, 1.0f, 2.0f, 45.7f, 0.0f, -1.0f, 2.1f, -1.9f), F32x8(1.5f)), F32x8(-1.5f, 1.0f, 1.5f, 1.5f, 0.0f, -1.0f, 1.5f, -1.5f)));
+	ASSERT(allLanesEqual(clampUpper(F32x8(-35.1f, 1.0f, 2.0f, 45.7f, 0.0f, -1.0f, 2.1f, -1.9f), F32x8(1.5f)), F32x8(-35.1f, 1.0f, 1.5f, 1.5f, 0.0f, -1.0f, 1.5f, -1.9f)));
+	ASSERT(allLanesEqual(clampLower(F32x8(-1.5f), F32x8(-35.1f, 1.0f, 2.0f, 45.7f, 0.0f, -1.0f, 2.1f, -1.9f)), F32x8(-1.5f, 1.0f, 2.0f, 45.7f, 0.0f, -1.0f, 2.1f, -1.5f)));
 
 
 	// F32x8 operations
 	// F32x8 operations
 	ASSERT(allLanesEqual(F32x8(1.1f, -2.2f, 3.3f, 4.0f, 1.4f, 2.3f, 3.2f, 4.1f) + F32x8(2.2f, -4.4f, 6.6f, 8.0f, 4.11f, 3.22f, 2.33f, 1.44f), F32x8(3.3f, -6.6f, 9.9f, 12.0f, 5.51f, 5.52f, 5.53f, 5.54f)));
 	ASSERT(allLanesEqual(F32x8(1.1f, -2.2f, 3.3f, 4.0f, 1.4f, 2.3f, 3.2f, 4.1f) + F32x8(2.2f, -4.4f, 6.6f, 8.0f, 4.11f, 3.22f, 2.33f, 1.44f), F32x8(3.3f, -6.6f, 9.9f, 12.0f, 5.51f, 5.52f, 5.53f, 5.54f)));