• Docs >
  • Quantization Utilities
Shortcuts

Quantization Utilities

Reference Implementation Methods

template<typename T, layout_t LAYOUT = layout_t::KCX>
void QuantizeGroupwise(const float *src, int K, int C, int X, int G, const float *scales, const std::int32_t *zero_points, T *dst)

Quantize floating point data in src to type T.

Template Parameters:
  • T – output quantized data type (int8_t, uint8_t, and int32_t are supported)

  • LAYOUT – layout of input tensor in src. (KCX and KXC are supported) KCX corresponds to KCRS or KCTRS (for weight tensors with time dimension) KXC corresponds to KRSC or KTRSC (for weight tensors with time dimension)

Parameters:
  • K – Output channels for weight tensors

  • C – Number of channels

  • XR*S or T*R*S

  • G – Groups (if G == C the function performs channelwise quantization; if 1 < G < C the function performs groupwise quantization; if G == 1 the function performs per tensor quantization;)

  • scales – floating point scales. Size should be equal G

  • zero_points – zero points (should be reprsentable in type T). Size should be equal G

template<typename T>
void FusedQuantizeDequantize(const float *src, float *dst, std::int64_t len, const TensorQuantizationParams &qparams, int thread_id = 0, int num_threads = 1, float noise_ratio = 0.0f)

Fused integer quantization dequantization kernel to accelerate quantization-aware training. Quantize fp32 values in src to (u)int8 using the provided qparams, and dequantize quantized integer values back into fp32.

template<typename InputType>
void FloatOrHalfToFusedNBitRowwiseQuantizedSBHalf(int bit_rate, const InputType *input, size_t input_rows, int input_columns, std::uint8_t *output)

Convert float (fp32 or fp16) inputs to rowwise quantized outputs. bitrate specifies the number of bits in quantized output. Scale and Bias are in fp16. Each row’s Scale and Bias are stored in the row itself (fused) at the end.

Parameters:

bit_rate – can be 2, 4, or 8

AVX-2 Implementation Methods

uint32_t Xor128(void)

Random number generator in [0, 9] based on this paper.

void FindMinMax(const float *m, float *min, float *max, int64_t len)

Find the min and max value in a float matrix.

template<bool A_SYMMETRIC, bool B_SYMMETRIC, QuantizationGranularity Q_GRAN, bool HAS_BIAS, bool FUSE_RELU, typename BIAS_TYPE = std::int32_t, bool DIRECT = false>
void requantizeOutputProcessingAvx2(std::uint8_t *out, const std::int32_t *inp, const block_type_t &block, int ld_out, int ld_in, const requantizationParams_t<BIAS_TYPE> &r)

Requantize with avx2 and bias is fused.

AVX-512 Implementation Methods

template<bool A_SYMMETRIC, bool B_SYMMETRIC, QuantizationGranularity Q_GRAN, bool HAS_BIAS, bool FUSE_RELU, int C_PER_G, typename BIAS_TYPE = std::int32_t>
void requantizeOutputProcessingGConvAvx512(std::uint8_t *out, const std::int32_t *inp, const block_type_t &block, int ld_out, int ld_in, const requantizationParams_t<BIAS_TYPE> &r)

Requantize with AVX512.

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources