Quantization Utilities¶
Reference Implementation Methods¶
-
template<typename T, layout_t LAYOUT = layout_t::KCX>
void QuantizeGroupwise(const float *src, int K, int C, int X, int G, const float *scales, const std::int32_t *zero_points, T *dst)¶ Quantize floating point data in
srcto typeT.- Template Parameters:
T – output quantized data type (
int8_t,uint8_t, andint32_tare supported)LAYOUT – layout of input tensor in
src. (KCXandKXCare supported)KCXcorresponds toKCRSorKCTRS(for weight tensors with time dimension)KXCcorresponds toKRSCorKTRSC(for weight tensors with time dimension)
- Parameters:
K – Output channels for weight tensors
C – Number of channels
X –
R*SorT*R*SG – Groups (if
G == Cthe function performs channelwise quantization; if1 < G < Cthe function performs groupwise quantization; ifG == 1the function performs per tensor quantization;)scales – floating point scales. Size should be equal
Gzero_points – zero points (should be reprsentable in type
T). Size should be equalG
-
template<typename T>
void FusedQuantizeDequantize(const float *src, float *dst, std::int64_t len, const TensorQuantizationParams &qparams, int thread_id = 0, int num_threads = 1, float noise_ratio = 0.0f)¶ Fused integer quantization dequantization kernel to accelerate quantization-aware training. Quantize
fp32values in src to(u)int8using the provided qparams, and dequantize quantized integer values back intofp32.
-
template<typename InputType>
void FloatOrHalfToFusedNBitRowwiseQuantizedSBHalf(int bit_rate, const InputType *input, size_t input_rows, int input_columns, std::uint8_t *output, const InputType *rowwise_min_max = nullptr)¶ Convert float (fp32 or fp16) inputs to rowwise quantized outputs. bitrate specifies the number of bits in quantized output. Scale and Bias are in fp16. Each row’s Scale and Bias are stored in the row itself (fused) at the end.
- Parameters:
bit_rate – can be 2, 4, or 8
AVX-2 Implementation Methods¶
-
uint32_t Xor128()¶
Random number generator in [0, 9] based on this paper.
-
void FindMinMax(const float *m, float *min, float *max, int64_t len)¶
Find the min and max value in a float matrix.
-
template<bool A_SYMMETRIC, bool B_SYMMETRIC, QuantizationGranularity Q_GRAN, bool HAS_BIAS, bool FUSE_RELU, typename BIAS_TYPE = std::int32_t, bool DIRECT = false>
void requantizeOutputProcessingAvx2(std::uint8_t *out, const std::int32_t *inp, const block_type_t &block, int ld_out, int ld_in, const requantizationParams_t<BIAS_TYPE> &r)¶ Requantize with avx2 and bias is fused.
AVX-512 Implementation Methods¶
-
template<bool A_SYMMETRIC, bool B_SYMMETRIC, QuantizationGranularity Q_GRAN, bool HAS_BIAS, bool FUSE_RELU, int C_PER_G, typename BIAS_TYPE = std::int32_t>
void requantizeOutputProcessingGConvAvx512(std::uint8_t *out, const std::int32_t *inp, const block_type_t &block, int ld_out, int ld_in, const requantizationParams_t<BIAS_TYPE> &r)¶ Requantize with AVX512.