Block-Decomposed TurboQuant with PDX Layout
Authors: Will Manning Status: Proposal Date: 2026-04-02 (updated 2026-04-06)
Summary
We propose evolving the TurboQuant vector quantization encoding in stages:
- MSE-only TurboQuant (in progress — PR #7269): a complete,
self-contained building block. 8-bit default, internal zero-padding for
non-power-of-2 dimensions,
FixedSizeListArrayrotation signs supporting variable SORF rounds. - Block decomposition: for dimensions where a valid B exists (greatest power-of-2 ≥ 64 dividing d), split into blocks of size B. For power-of-2 dimensions, B = d (single block). Dimensions with no qualifying B fall back to internal zero-padding to power-of-2. Per-block norms stored as internal children.
- PDX layout (later): transpose codes into dimension-major order within groups of 64 vectors for SIMD scan performance.
QJL correction is deferred to a later stage and may ultimately be dropped. Community findings from multiple independent TurboQuant implementations often show that MSE-only outperforms MSE+QJL for KV-cache attention [8]. For ANN ranking and vector-search workloads, the evidence is currently less complete, so QJL should remain an empirical question rather than a settled conclusion.
Background
TurboQuant
TurboQuant [1] is a lossy vector quantization algorithm for high-dimensional embeddings. It works by:
- Randomly rotating a unit-norm vector so that each coordinate follows a known
marginal distribution — specifically
(1 - x²)^((d-3)/2)on [-1, 1], a concentrated Beta distribution (Lemma 1 in [1]). - Applying an MSE-optimal scalar quantizer (Max-Lloyd centroids) independently to each coordinate.
- Optionally adding a 1-bit QJL (Quantized Johnson-Lindenstrauss) correction on the residual for unbiased inner product estimation (Theorem 2 in [1]).
The paper prescribes a full random orthogonal rotation (QR decomposition of a matrix with i.i.d. N(0,1) entries, yielding a Haar-uniform orthogonal matrix) for the MSE stage — O(d²) storage and O(d²) per-vector. For the QJL stage, the paper uses a random Gaussian projection matrix S with i.i.d. N(0,1) entries (not an orthogonal rotation); this distinction matters for the unbiasedness proof.
Comparison to Product Quantization. TurboQuant's block decomposition (Stage 2 of this RFC) is structurally similar to Product Quantization (PQ) [9]: both partition a vector into sub-vectors and quantize each independently. The key differences are:
| TurboQuant | PQ | |
|---|---|---|
| Quantization type | Scalar (per-coordinate, after rotation) | Vector (per-sub-vector, learned codebook) |
| Codebook | Analytically derived from Beta distribution; data-oblivious | Learned via k-means on training data; data-dependent |
| Rotation | Random orthogonal within each sub-vector | Typically none (OPQ [10] adds a learned rotation) |
| Theoretical guarantees | Provable data-oblivious MSE bound (Theorem 1 [1]) | No comparable data-oblivious bound |
| Codebook training | None (centroids derived from theory) | Requires training pass over data |
| Bits per sub-vector | Scalar: b bits per coordinate | Vector: typically 8 bits per sub-vector (256 codewords) |
TurboQuant trades PQ's flexibility (data-dependent codebooks can exploit structure) for data-obliviousness (no training, provable bounds, no offline index-training phase). Encode-time work (rotation + quantization) still applies. In return, PQ and OPQ retain a major advantage in expressivity: they learn sub-vector codebooks from data rather than applying an analytic scalar quantizer. In practice this means TurboQuant is attractive when training-free operation, simple deployment, and theoretical guarantees matter most, while PQ or OPQ may still win empirically when a learned vector codebook can exploit dataset-specific structure.
Current Vortex implementation
The current implementation (Rust, in the vortex-tensor crate,
merged via PR #7269) implements MSE-only TurboQuant as a Vortex
array encoding that compresses FixedSizeList<float> arrays — the storage
format of Vector and FixedShapeTensor extension types. The
original QJL-inclusive PR was closed in favor of this MSE-only
approach. Key design choices and characteristics:
Rotation. Instead of the paper's O(d²) QR rotation, we use a 3-round
Structured Orthogonal Random Features (SORF) transform HD₃·HD₂·HD₁ [5],
giving O(d) storage (3d sign bits, bitpacked) and O(d log d) per-vector. The
rotation signs are stored as a bitpacked child array rather than recomputed from
a seed at decode time. The 3-round SORF was introduced for kernel approximation
[5] and approximates a random orthogonal matrix. It is distinct from the
single-round SRHT (R·H·D) analyzed by Tropp [3] and the FJLT (P·H·D) of
Ailon-Chazelle [2], both of which are dimensionality-reducing projections rather
than rotation approximations.
Centroids. Max-Lloyd centroids are computed via numerical integration
(trapezoid rule, 1000 points per interval) of the marginal Beta distribution at
the padded dimension, using the HalfIntExponent type for exact integer/half-
integer exponent arithmetic. Centroids are cached in a global DashMap keyed by
(dimension, bit_width) and stored as a shared PrimitiveArray<f32> child.
Array structure. The TurboQuantArray stores 4 child slots: codes
(FixedSizeListArray<u8>, one per vector, list_size = padded_dim), norms
(PrimitiveArray<f32>), centroids (PrimitiveArray<f32>, shared), and MSE
rotation signs (PrimitiveArray<u8>, shared, bitpacked). Codes are stored as
u8 centroid indices; the cascade compressor (BitPacked encoding) handles packing
to the actual bit width on disk.
Compute pushdowns. Slice and take propagate to per-row children (codes, norms) while sharing rotation signs and centroids. Quantized cosine similarity and dot product operate directly on codes and centroids without decompression. L2 norm returns the stored norm directly (O(1) readthrough).
Compression scheme. TurboQuantScheme implements the Scheme trait for the
BtrBlocks cascading compressor. It matches Vector and FixedShapeTensor
extension arrays with non-nullable float elements and dimension ≥ 128,
using 8-bit MSE-only as the default (256 centroids, near-lossless with
normalized MSE ~4e-5, achieving ~4× compression on f32).
Input handling. All float types (f16, f32, f64) are converted to f32 before quantization. Per-vector L2 norms are computed and stored as f32. Non-power-of-2 dimensions are zero-padded to the next power of 2 for SORF compatibility. The minimum dimension for scheme auto-selection is 128; the array-level minimum remains 3 (at d=2 the marginal is the arcsine distribution, which is U-shaped and unsuitable for Max-Lloyd centroids designed for concentrated distributions).
Metadata. Currently serialized as a raw single byte (bit_width). This lacks framing and versioning and cannot be extended backward-compatibly; migrating to a structured/extensible format is a Stage 1 item (the upcoming vtable refactor may eliminate the need for separate serialized metadata entirely).
The Eviox corrections study [7] identified several bugs in the paper's reference
Python implementation; none affect our implementation (see Appendix A). There is
also a notational ambiguity in the MSE bound constant; we use √3·π/2 ≈ 2.72
(see Appendix A for the full analysis).
Multiple independent TurboQuant implementations report that MSE-only often outperforms MSE+QJL for KV-cache attention at the same bit budget [8], likely due to softmax amplifying QJL variance. For ANN ranking the evidence is less settled; MSE-only is the default pending dedicated benchmarks (see Appendix B for details).
Current limitations
The SORF requires power-of-2 input dimension. The TQ array handles this by zero-padding non-power-of-2 dimensions to the next power of 2 internally (e.g., 768 → 1024). For non-power-of-2 dimensions, this means:
- 33% storage overhead for 768-d vectors: 1024 codes stored vs. 768 useful (equivalently, 25% of stored codes are wasted on zero-padded dimensions).
- No scan-optimized layout: row-major code storage prevents SIMD-over-vectors distance computation.
Stage 2's block decomposition eliminates this padding for dimensions with a qualifying B (e.g., 768 → 3×256 blocks), since each block is natively power-of-2.
PDX
PDX [4] is a data layout for vector similarity search. The paper (SIGMOD '25) describes a dimension-major layout within fixed-size blocks of 64 vectors, enabling the compiler to auto-vectorize the inner distance loop over vectors rather than dimensions. The paper reports an average 2× speedup for auto-vectorized PDX distance kernels vs. explicitly SIMD-optimized row-major baselines (SimSIMD, FAISS) across four architectures, with larger gains at low dimensionality (5.5× at D ≤ 32) and ~1.5× at D > 32 [4, Table 4]. Dimension-pruning methods (ADSampling, BSA) recover much larger end-to-end gains (2-7×) when paired with the PDX layout [4]. The block size of 64 is empirically optimal across AVX-512, AVX2, and NEON architectures [4, Table 5].
PDX open-source implementation. The open-source implementation has evolved beyond the paper in several ways relevant to this RFC. Note: the following describes the code repository, not the paper — the paper operates exclusively on float32 and does not discuss int8 layouts.
- 8-bit scalar quantization (
IndexPDXIVFTreeSQ8): Maps floats to 0-255 via linear min-max scaling. The int8 layout differs from float32: dimensions are packed in groups of 4 ("4 dims × 16 vecs") to leverage hardware dot-product instructions (VPDPBUSD on x86, UDOT/SDOT on ARM) that process 4 byte pairs per operation. This is a different tiling than the paper's "1 dim × 64 vecs." - ADSampling with random rotation: The pruner applies a random orthogonal rotation to the entire collection as a preprocessing step. This makes coordinates approximately independent, enabling dimension-by-dimension hypothesis testing for early pruning. The rotation serves a similar purpose to TurboQuant's rotation — making the coordinate distribution known — but for pruning rather than quantization.
- Dimension zones: Consecutive dimensions are grouped into zones; at query time, zones are ranked by "distance-to-means" and the most discriminative zones are scanned first, enabling faster pruning (~30% faster than per-dimension pruning [4]).
Implications for our design. The PDX paper's float32 layout ("1 dim × 64 vecs") maps cleanly to our quantized-code scan kernel, where the inner loop gathers from a centroid-product distance table over 64 vectors. However, if we pursue direct int8 arithmetic (b_mse=8 with linear centroids, see GPU section), the "4 dims × 16 vecs" int8 layout from the PDX implementation may be more appropriate, as it enables hardware dot-product instructions.
Additionally, ADSampling's dimension-pruning approach is complementary to TurboQuant's block structure: when scanning with block decomposition, the pruner could skip entire TQ blocks (B dimensions at a time) if the partial distance already exceeds the candidate threshold. This combines the storage efficiency of quantization with the computational savings of early termination.
Proposal
Block size strategy
For each dimension d, choose B = the greatest power-of-2 ≥ 64 that evenly divides d. If no such B exists (e.g., d=96), the TQ array falls back to internal zero-padding (single padded block, as in Stage 1). For common embedding dimensions, this rule always produces a valid B and avoids padding entirely:
| Dimension d | Block size B | Blocks k | Notes |
|---|---|---|---|
| 512 | 512 | 1 | Single block (= current TQ) |
| 768 | 256 | 3 | Greatest dividing power-of-2 |
| 1024 | 1024 | 1 | Single block |
| 1536 | 512 | 3 | |
| 2048 | 2048 | 1 | Single block |
| 3072 | 1024 | 3 | |
| 4096 | 4096 | 1 | Single block |
Key observations:
- Power-of-2 dimensions (512, 1024, 2048, 4096) use B = d — a single block, identical to the current implementation except with PDX underneath (Stage 3). No block decomposition overhead, no per-block norms. These dimensions are already well-served by the current design.
- Non-power-of-2 dimensions (768, 1536, 3072) decompose into k=3 blocks at B=256 or B=512. No padding waste. Each block has its own SORF rotation and shares a single centroid set.
- No qualifying B is rare for common embedding dimensions. Dimensions where no power-of-2 ≥ 64 divides d (e.g., 96, 100) fall back to internal zero-padding. A future straggler-block extension could handle these without padding (see Stage 2: Straggler blocks). These dimensions are uncommon in modern model architectures.
- The SORF approximation at B=256+ is expected to be adequate: 3 rounds at B=256 provides 24 butterfly stages, and at B=512 provides 27 — both comparable to the current B=1024 (30 stages). This needs empirical validation; see Experimental plan.
Minimum dimension
The compression scheme should only select TurboQuant for vectors with dimension ≥ 128. Below this threshold, several factors degrade quality and efficiency:
- SORF mixing quality: 3-round SORF at d=64 provides only 18 butterfly stages (vs. 21 at d=128, 30 at d=1024). The coordinate distribution deviates more from the analytical Beta, making Max-Lloyd centroids less optimal. Stage 1's variable-round rotation signs (see Stage 1) may allow compensating with additional SORF rounds at lower dimensions — this should be benchmarked.
- Practical MSE: At smaller d, the SORF mixing quality and coordinate- independence approximations are weaker, potentially worsening practical quantization quality beyond what the dimension-free theoretical bound captures. The actual MSE at each d is an empirical question.
- Overhead ratio: Per-vector norm (32 bits) is a larger fraction of the compressed representation at small d. At d=32, b=5: codes=160 bits, norm=32 bits, total=192 — norm is ~17% of compressed size. At d=768: <1%.
- Diminishing returns for high bit widths: With fewer coordinates, the fine-grained centroid structure of high-b quantization has less to exploit.
The threshold of 128 is conservative:
- d=128 (SIFT) is the smallest dimension in our recommended benchmark table.
- SORF at d=128 has 21 butterfly stages — tested and adequate in the current implementation.
- The block-size rule produces B=128 for d=128 (single block, no decomposition).
Whether TQ works well at all below d=64 is an open question — SORF mixing quality degrades rapidly at small dimensions, and the overhead ratio makes TQ increasingly uncompetitive vs. simpler scalar quantization. The scheme minimum of 128 is conservative; the experimental plan should determine the true minimum (likely in the 64-128 range). Padding modest amounts (e.g., 96 → 128) is probably acceptable; padding large fractions (e.g., 32 → 64) is not.
The exact threshold should be validated experimentally — see Experimental plan.
Stage 1: MSE-only TurboQuant (in progress — PR #7269)
Stage 1 delivers MSE-only TurboQuant as a complete, self-contained building block. The initial implementation is merged; the original QJL-inclusive PR was closed in favor of this MSE-only approach. Work remaining to complete Stage 1 is described below.
The goal is to arrive at a wire format that we believe is ready for backward-compatibility guarantees — one we would be comfortable freezing — without formally committing to stability until confirmed by Stage 2 implementation and benchmarking.
Target properties:
- MSE-only, no QJL. 4 child slots: codes, norms, centroids, rotation_signs. QJL code can be resurrected from the original PR if Phase 4 is pursued.
- 8-bit default (256 centroids). Near-lossless: normalized MSE ~4e-5,
~4× compression on f32. Lower bit widths available via
TurboQuantConfig. - Power-of-2 block size with internal padding. The TQ array requires
block_sizeto be a power of 2. Non-power-of-2 dimensions are zero-padded internally to the next power of 2 (e.g., 768 → 1024), socodes.list_size(=padded_dim) may exceeddimension. Stage 2's block decomposition eliminates this padding for dimensions with a qualifying B (e.g., 768 → 3×256 blocks, each natively power-of-2). - Variable-round SORF rotation. Rotation signs are stored as a
FixedSizeListArraywhere each element is aFixedSizeList(u8, padded_dim, NonNullable)— one bitpacked diagonal per SORF round. The array length R equals the number of rounds (default 3). This makes the round count a property of the array shape rather than a hard-coded constant. More rounds may improve mixing quality at lower dimensions or lower bit widths (see Experimental plan: "Test 3, 4, 5 SORF rounds at each B"). Signs are stored in inverse-friendly (read-optimized) order. - Scheme auto-selection for dimension ≥ 128 (see Minimum dimension). Smaller power-of-2 dimensions remain available via explicit construction.
- Compute pushdowns: slice/take/scalar_at, quantized cosine similarity and dot product, compression scheme integration.
- Dtype-matching norms: f64 for f64 input, f32 for f32/f16.
- Codes and centroids remain separate children. The codes
(
FixedSizeListArray<u8>) and centroids (PrimitiveArray<f32>) are independent child slots. Operations that need a unified view (e.g.,canonicalize) can construct aDictArrayfrom codes and centroids and apply the inverse rotation to produce a canonical decoded form.
Forward-compatible metadata: dimension: u32, block_size: u32 (=
padded_dim in Stage 1), num_blocks: u32 (always = 1 in Stage 1),
num_rounds: u32 (= R, default 3). These fields are inert in Stage 1 but
enable Stage 2 decoders to read Stage 1
files. The serialization format is TBD — the upcoming vtable refactor may make
the current raw-byte metadata unnecessary by encoding these fields directly in
the vtable. If the refactor does not land first, a structured format (e.g.,
protobuf) is needed. (PDX is handled via the codes child type, not a metadata
flag — see Stage 3.)
Remaining work (relative to the initial implementation):
- Restructure rotation signs from flat
PrimitiveArray<u8>toFixedSizeListArray(variable SORF rounds, as described above). - Dtype-matching norms (currently always f32).
- Structured metadata (currently a raw single byte).
- Restrict
new_uncheckedvisibility topub(crate). - f64-to-f32 truncation in encode path: add comment or checked cast.
- CENTROID_CACHE: document intentional unbounded-ness.
- Note MSE bound caveat: Theorem 1 is proved for Haar matrices, not SORF.
Stage 2: Block decomposition
Block decomposition splits a FixedSizeListArray vertically by dimension into
fixed-size blocks, each encoded independently. This is structurally analogous
to ChunkedArray (which splits horizontally by rows) — both are general-purpose
structural transforms over arrays, not specific to any particular encoding. Like
PDX (Stage 3), block decomposition is a layout concern that can wrap arbitrary
child encodings.
In the initial implementation, block decomposition is embedded inside
TurboQuantArray — all blocks use TQ MSE-only encoding with independent SORF
rotations, and TQ-specific children (centroids, rotation signs) are stored
alongside the blocks. However, the concept of block decomposition is
encoding-agnostic: a future refactor could extract it into a general-purpose
BlockDecomposedFSLArray that wraps k independently-encoded child arrays. This
matters for straggler-block support (see below), where the straggler may use a
different encoding than the main blocks.
For dimensions where the block-size rule produces a valid B (see table above), the scheme splits the input into k = d/B blocks of size B. Each block is a power-of-2 TQ array with an independent B-dim SORF rotation.
Changes vs. Stage 1 (with TQ blocks):
| Aspect | Stage 1 | Stage 2 |
|---|---|---|
| Block count | k = 1 (single power-of-2 block) | k = d/B (multiple blocks) |
| SORF dimension | padded_dim (next power-of-2 ≥ dim) | B (e.g., 256 for d=768) |
| Rotation signs | FSL, len = R, element dim = padded_dim | FSL, len = k × R, element dim = B |
| Centroids | Computed for padded_dim distribution | Computed for B-dim distribution (different codebook!) |
| Norms child | PrimitiveArray<F>, 1 per vector | PrimitiveArray<F> (k=1) or FixedSizeListArray<F> (k>1), same dtype F |
| Codes list_size | padded_dim | k × B (= d) |
| Scheme compress() | Single SORF → quantize | Choose B → split → per-block normalize/rotate/quantize |
| Quantized dot product | Single sum over padded_dim centroids | Per-block weighted sum (Σ_k norm_a_k · norm_b_k · unit_dot_k) |
| L2 norm readthrough | O(1) — return stored norm | O(k) — compute √(Σ_k norm_k²) |
Unchanged from Stage 1: SORF construction (R-round HD, default R=3),
Max-Lloyd algorithm, f32 internal quantization, slice/take semantics (per-row
data sliced, shared data cloned), FixedSizeListArray rotation sign storage,
compression scheme trait.
For power-of-2 dimensions: B = d, k = 1. The encoding produces an identical wire format to Stage 1 (single norm, single SORF, single codes block). A Stage 2 encoder writing k=1 data is fully backward-compatible with Stage 1 decoders.
Key design properties:
- Structural, not encoding-specific. The block decomposition itself is a
vertical split of a
FixedSizeListArrayby dimension. Each block is an independently-encoded child. In the initial implementation all blocks are TQ MSE-only, but the structure allows heterogeneous child encodings in future. - One shared centroid set for all TQ blocks at the same B-dim distribution.
- Per-block SORF rotation signs. Each block's SORF is independent (different
seed). Signs are R × B bits per block (R = number of SORF rounds, default 3),
stored as a
FixedSizeListArraywith len = k × R.
Straggler blocks (future work)
The current block-size rule requires B to evenly divide d, so dimensions with no qualifying power-of-2 B ≥ 64 (e.g., d=96) fall back to internal zero-padding (single padded block, as in Stage 1). A natural extension is straggler blocks: allow k blocks where k-1 are full-size B and the final block covers the remaining d - (k-1)×B dimensions.
Because the block decomposition is encoding-agnostic (each block is an independently-encoded child array), the straggler block need not use the same encoding as the main blocks. For example, d=800 could be decomposed as 3×256 = 768 TQ-encoded dimensions plus a 32-dimension straggler. SORF is unlikely to be effective at such small straggler dimensions (see Minimum dimension), so the straggler would use a different strategy:
- Uncompressed: store the straggler dimensions as raw floats. Simplest; the overhead is modest (32 × 4 = 128 bytes per vector for a 32-dim straggler).
- Padded TQ: pad the straggler to the next power-of-2 (e.g., 32 → 64), encode with standard TQ. Only viable if the padded dimension is large enough for SORF to be effective (≥ 64, probably ≥ 128).
- Exact-rotation TQ: use a dense random orthogonal matrix (QR of Gaussian) instead of SORF for the straggler block. Eliminates the power-of-2 constraint at the cost of O(B_s²) rotation, where B_s is the straggler size.
- Scalar quantization or PQ: the block decomposition structure supports heterogeneous child encodings.
Note that for some dimensions (e.g., d=800), padding the entire vector to the next power-of-2 (1024) may be preferable to block decomposition with a straggler, depending on the overhead tradeoff. This is an empirical question.
This is deferred: the block-size rule already handles all common embedding dimensions (768, 1024, 1536, etc.) without stragglers, and the rare no-qualifying-B case (d=96) is adequately served by internal zero-padding for now.
Norm architecture
Per-block norms are stored as an internal child of the TurboQuant array:
- For k = 1 (power-of-2 dims):
PrimitiveArray<F>with len = num_rows (identical to Stage 1's single-norm layout). - For k > 1:
FixedSizeListArray<F>with list_size = k, len = num_rows.
The norm dtype F matches or widens the input element type:
| Input dtype | Norm dtype | Rationale |
|---|---|---|
| f16 | f32 | f16 has insufficient range/precision for norms |
| f32 | f32 | Same type |
| f64 | f64 | Preserve full precision |
Norms are stored as plain child arrays; the cascading compressor handles secondary encoding (ALP, Pco, etc.).
Note: centroids and quantization always operate in f32 internally (the current implementation converts all input to f32 before quantization). For f64 input, decode produces f32 unit-direction reconstructions scaled by f64 norms — a mixed-precision multiply that preserves norm precision.
Zero-norm sub-vectors
When splitting a vector into B-dim blocks, some blocks may have zero norm. The encoding handles ‖xₖ‖ = 0 explicitly: skip rotation and quantization, store norm = 0, decode as all zeros.
Theoretical MSE bound
The paper's MSE bound (Theorem 1 in [1]) is:
E[‖x - x̂‖² / ‖x‖²] ≤ (√3 · π / 2) / 4^b ≈ 2.72 / 4^b
Crucially, Theorem 1 is proved for true random orthogonal matrices (QR of Gaussian), not SORF. Our SORF is an approximation. The bound holds exactly only with a true random orthogonal rotation or with empirical SORF validation (see Experimental plan).
Assuming the per-block MSE bound holds, for a vector split into blocks the first line is an algebraic identity (exact); the inequality on the second line applies Theorem 1's probabilistic bound to each block and should be read as holding in expectation over independent per-block rotations, not almost surely:
‖x - x̂‖² / ‖x‖² = Σ_k (‖xₖ‖² / ‖x‖²) × (‖xₖ - x̂ₖ‖² / ‖xₖ‖²) (exact)
E[...] ≤ MSE_bound × Σ_k (‖xₖ‖² / ‖x‖²) = MSE_bound (in expectation)
The conclusion: E[‖x - x̂‖² / ‖x‖²] ≤ MSE_bound assuming independent
per-block rotations. (Theorem 1 applies because each block is normalized to
unit norm before rotation and quantization; the per-block encoding pipeline is:
split → normalize → rotate → quantize, matching the theorem's unit-sphere
assumption.) Note that TurboQuant's original analysis uses a single
global rotation in high-d where coordinates are nearly independent; with
smaller block dimension B, within-block coordinate dependence after rotation may
be stronger even when marginals are correct — this is an additional motivation
for the experimental plan's comparison of block sizes.
The actual MSE may depend on block dimension B: at larger B the coordinate distribution is more concentrated (variance ~1/B), giving the Max-Lloyd quantizer more to exploit. See Experimental plan.
SORF approximation. The R-round SORF HD_R·...·HD₂·HD₁ [5] provides
log₂(B) butterfly stages per round × R rounds = R·log₂(B) total. At R=3
(default): 18 at B=64, 24 at B=256, 27 at B=512. At R=5: 30 at B=64, 40 at
B=256. Counting butterfly stages is a rough heuristic for mixing quality with
no theoretical backing: [5] proves near-unbiasedness for kernel approximation
(Theorem 3) and pairwise near-orthogonality (Theorem 4), but does not prove
distributional closeness to Haar measure, does not analyze convergence rate as
a function of rounds × dimension, and leaves tight variance bounds for SORF as
an open problem. The variable-round rotation signs (Stage 1) enable testing
more rounds at smaller B or lower bit widths where mixing quality matters more.
Empirical validation is needed.
Fallback: dense rotation. If SORF proves insufficient at the chosen B, use a B × B random orthogonal matrix (QR of Gaussian). Storage at B=256: 256 KB per block. For d=768 with k=3: 768 KB total. Amortizes for large columns (100K+ vectors). Each block must have an independent rotation matrix.
DCT and other fixed structured transforms are not suitable for TurboQuant's rotation (they do not produce the required Beta marginal). Sharing a rotation with ADSampling-style pruning is a speculative future direction. See Appendix C for details on both.
Quantized-domain operations
All quantized operations read per-block norms from the internal child array:
- L2 distance:
‖a-b‖² = Σ_k ‖aₖ‖² + Σ_k ‖bₖ‖² - 2·Σ_k ‖aₖ‖·‖bₖ‖· unit_dotₖ. Primary ANN metric; reuses per-block dot product and norms. - Dot product:
<a,b> ≈ Σ_k ‖aₖ‖·‖bₖ‖ · Σ_j centroids[code_aₖ[j]] · centroids[code_bₖ[j]]. - Cosine similarity:
cos(a,b) ≈ dot(a,b) / (‖a‖·‖b‖)where‖a‖ = √(Σ_k ‖aₖ‖²). - L2 norm:
√(Σ_k ‖xₖ‖²). O(k) per vector — a regression from the current O(1) single-norm readthrough, but modest.
Encoding algorithm
Input: x ∈ ℝ^d, b_mse bits per coordinate, block_size B
k = d / B (exact division, no straggler for chosen B)
num_centroids = 2^b_mse
# Block split and normalize
for i in 0..k:
xᵢ = x[i*B .. (i+1)*B]
nᵢ = ‖xᵢ‖
if nᵢ > 0:
ûᵢ = xᵢ / nᵢ
else:
ûᵢ = zeros(B)
# MSE stage (per block, SORF rotation)
for i in 0..k:
if nᵢ > 0:
rᵢ = SORFᵢ(ûᵢ)
cᵢ[j] = nearest_centroid(rᵢ[j])
else:
cᵢ[j] = 0
Store (all as internal children):
codes (k × B per vector), norms (k per vector),
centroids (2^b_mse, shared), SORF signs (k × R × B, shared; R = SORF rounds)
Decoding algorithm
for i in 0..k:
r̂ᵢ[j] = centroids[cᵢ[j]]
ûᵢ = SORF⁻¹ᵢ(r̂ᵢ)
x̂ᵢ = nᵢ × ûᵢ (nᵢ read from internal norms child)
x̃ = concat(x̂₀, ..., x̂ₖ₋₁)
Stage 3: PDX dimension-major layout
Introduce a new PDXArray encoding type that wraps any FixedSizeListArray
with a dimension-major layout within groups of 64 vectors [4]. Like block
decomposition (Stage 2), PDXArray is a structural transform over
FixedSizeListArray, not specific to any particular encoding — it is a
general-purpose layout optimization for any FixedSizeList of scalar elements
(raw float vectors, scalar-quantized vectors, TurboQuant codes, etc.).
Changes vs. Stage 2:
| Aspect | Stage 2 | Stage 3 |
|---|---|---|
| Codes child type | FixedSizeListArray<u8> | PDXArray<u8> (wraps FSL with transposed layout) |
| Codes detection | N/A (codes always FSL) | TQ checks child type: FSL → row-major decode, PDXArray → un-transpose first |
| Distance kernel | Per-vector loop with per-element centroid lookup | SIMD-friendly 64-vector inner loop with distance-table lookup |
| Decode path | Direct inverse SORF per vector | PDXArray.to_fsl() first, then inverse SORF |
Unchanged from Stage 2: Block size B, centroid computation, norm storage, SORF rotation, all encoding logic. The encode path produces row-major codes (FSL), then the compressor wraps them in a PDXArray; the decode path converts PDXArray back to FSL then decodes.
PDXArray design:
PDXArray<T> (general-purpose dimension-major layout for FixedSizeList)
├── metadata: { list_size, chunk_size (= 64) }
├── elements: PrimitiveArray<T> # transposed: 64 values per dim, contiguous
├── validity: ... # same as FSL validity
PDXArray::try_new(fsl)— transposes a FixedSizeListArray into PDX layoutPDXArray::to_fsl()— un-transposes back to row-major FSL (for decode, scalar_at, or non-aligned slice/take)PDXArray::elements_for_dim(dim, chunk)— O(1) access to a contiguous slice of 64 values for one dimension within one chunk- Slice/take: un-transpose to FSL (simplest). Un-transpose cost is O(rows × list_size) per operation; consider 64-row-aligned fast paths for hot scan workloads. Preserving PDX layout is possible only for 64-vector-aligned ranges.
- The cascade compressor treats PDXArray as a valid encoding of FSL-typed data.
Benefits of PDXArray as a separate type:
- PDX logic tested and maintained independently of TurboQuant
- Other encodings (raw float vectors, scalar quantization, future encodings) get PDX scan performance for free
- TurboQuant doesn't need an
is_pdxmetadata flag — it checks its codes child's type at runtime - The distance kernel operates on PDXArray's dimension-contiguous slices
Within each 64-vector chunk, codes are stored dimension-major:
TQ block 0, dim 0: [v0 v1 v2 ... v63]
TQ block 0, dim 1: [v0 v1 v2 ... v63]
...
TQ block 0, dim (B - 1): [v0 v1 v2 ... v63]
TQ block 1, dim 0: [v0 v1 v2 ... v63]
...
The inner SIMD loop (64 vectors) has no inter-vector dependencies. TQ block boundaries only affect where norm weighting occurs — they don't affect the transpose.
Quantized distance kernel (dot product):
let dist_table = precompute_product_table(¢roids);
// At b_mse=4: 16×16 = 256 floats = 1KB, fits in L1
let mut distances = [0.0f32; 64];
let mut unit_dots = [0.0f32; 64];
let mut offset = 0;
for tq_block in 0..k {
for dim in 0..B {
let qd = query_codes[tq_block * B + dim];
let row = &dist_table[qd as usize];
for v in 0..64 { // SIMD-friendly: no inter-vector deps
unit_dots[v] += row[codes[offset] as usize];
offset += 1;
}
}
// Weight per-block unit-norm dot product by both vectors' block norms
for v in 0..64 {
distances[v] += query_norms[tq_block] * data_norms[v][tq_block]
* unit_dots[v];
unit_dots[v] = 0.0; // reset for next TQ block
}
}
Int8 layout variant. The PDX implementation pdx-impl uses a different tiling for int8 data: "4 dims × 16 vecs" to leverage VPDPBUSD/UDOT hardware dot-product instructions (which process 4 unsigned×signed byte pairs per operation). For TurboQuant codes at b_mse ≤ 8, codes are uint8 centroid indices, so VPDPBUSD doesn't apply directly — we need the distance-table-lookup path shown above. However, at b_mse=8 with high B, the Max-Lloyd centroids are near-uniformly spaced (see GPU section), potentially enabling direct hardware dot-product on the codes. Whether this requires a separate linear quantization mode or works with the existing Max-Lloyd centroids is an empirical question. The "4 dims × 16 vecs" layout would be a Stage 3 optimization to evaluate alongside the "1 dim × 64 vecs" float-style layout.
ADSampling integration. The PDX dimension-pruning approach (ADSampling [4]) is complementary to TurboQuant's block structure. During a scan, the pruner could evaluate partial distances after each TQ block (B dimensions) and skip remaining blocks if the partial L2 distance already exceeds the candidate threshold. This requires the per-block norm weighting to happen at block boundaries (as shown in the kernel above), which our design already provides.
Open design questions:
- Should PDXArray live in
vortex-array(general infrastructure) orvortex-tensor(vector-specific)? - Should the cascade compressor automatically PDX-transpose FSL children when it detects a scan-heavy workload, or should PDX be opt-in?
- Should we support the "4 dims × 16 vecs" uint8 layout variant (for hardware dot-product) alongside the "1 dim × 64 vecs" float-style layout?
QJL correction (deferred — experimental)
Based on community findings [8], QJL is deferred to after the MSE stages are validated.
Changes vs. MSE-only (if pursued):
| Aspect | MSE-only | MSE + QJL |
|---|---|---|
| Bit budget | All b bits → MSE (2^b centroids) | b-1 bits MSE + 1 bit QJL (2^(b-1) centroids) |
| Inner product estimate | Biased (MSE quantization noise) | Unbiased (QJL correction; see TurboQuant_prod in [1]) |
| Additional children | None | QJL signs, QJL residual norms, QJL projection params |
| Encode cost | SORF only | SORF + QJL projection (O(B²) for Gaussian, O(B log B) for SORF) |
| Decode cost | Inverse SORF only | Inverse SORF + QJL inverse projection |
If pursued, four strategies should be compared:
| Strategy | Theoretical | Speed | Storage |
|---|---|---|---|
| Per-block Gaussian | Correct (Lemma 4 [1]) | O(B²)/block | k×B²×4 bytes |
| Per-block SORF | Approximate | O(B log B)/block | k×R×B bits |
| Full-dim SORF | Approximate | O(d log d) total | R×d bits |
| MSE-only (no QJL) | N/A | 0 | None |
The paper's QJL uses Gaussian S (not SORF); Lemma 4 [1] is proved specifically for Gaussian. SORF for QJL is an additional approximation (the original QJL implementation used SORF for QJL). Per-block QJL can incur up to d/B times larger variance bound than full-dimension QJL (Lemma 4 [1]), depending on how query and residual energy are distributed across blocks.
Community reports indicate MSE-only often wins for KV-cache attention at all tested bit widths [8]. Whether this extends to ANN ranking is an empirical question (see Experimental plan); QJL may not be worth the complexity. Note: the original QJL PR flagged a known SORF-related QJL bias for non-power-of-2 padded dimensions (#7245); the merged MSE-only encoding avoids this path.
Array layout
Stage 1 (MSE-only single block)
TurboQuantArray
├── metadata: { dimension, b_mse,
│ block_size (= padded_dim, next power-of-2 ≥ dimension),
│ num_blocks (= 1), num_rounds (= R, default 3) }
│
│ # Per-row children
├── codes: FixedSizeListArray<u8> # list_size = padded_dim
│ (or PDXArray<u8> after Stage 3)
├── norms: PrimitiveArray<F> # len = num_rows (F = f64 for f64, f32 otherwise)
│
│ # Shared children
├── centroids: PrimitiveArray<f32> # len = 2^b_mse
├── mse_rotation_signs: FixedSizeListArray # len = R (default 3)
│ element dtype: FixedSizeList(u8, padded_dim, NonNullable)
│ # each element = one bitpacked sign diagonal, inverse-friendly order
For power-of-2 dimensions, padded_dim = dimension (no waste). For
non-power-of-2 (e.g., d=768), padded_dim = 1024 (33% overhead, eliminated
by Stage 2 block decomposition).
The codes child is FixedSizeListArray in Stages 1-2 and may be swapped to
PDXArray in Stage 3 — TurboQuant checks the child type at runtime, not via
a metadata flag.
Stage 2 (block decomposition)
TurboQuantArray (self-contained, handles blocks internally)
├── metadata: { dimension, b_mse, block_size, num_blocks,
│ num_rounds }
│
│ # Per-row children (sliced/taken on row operations)
├── codes: FixedSizeListArray<u8> # list_size = k × B
│ (or PDXArray<u8> after Stage 3)
├── norms: PrimitiveArray<F> # len = num_rows (k=1)
│ or FixedSizeListArray<F> # list_size = k (k>1)
│
│ # Shared children (cloned on row operations, not sliced)
├── centroids: PrimitiveArray<f32> # len = 2^b_mse
├── mse_rotation_signs: FixedSizeListArray # len = k × R
│ element dtype: FixedSizeList(u8, B, NonNullable)
│ # k blocks × R rounds, each element = one bitpacked sign diagonal
Compression ratio
For f32 input, b_mse bits MSE, k = d/B blocks, N vectors (for f64 input, replace 32 with 64 in the norms row — ratios decrease accordingly):
| Component | Bits per vector |
|---|---|
| MSE codes | k × B × b_mse |
| Block norms | k × 32 |
| Component | Shared bits |
|---|---|
| Centroids | 2^b_mse × 32 |
| SORF signs | k × R × B |
Worked examples (f32, N=1000)
At b_mse=8 (default, near-lossless):
| d | B | k | Per-vec bits | Ratio | Notes |
|---|---|---|---|---|---|
| 768 | 256 | 3 | 3×256×8 + 3×32 = 6240 | 3.9× | Block decomp; no padding |
| 1024 | 1024 | 1 | 1024×8 + 32 = 8224 | 4.0× | Single block (= current) |
| 768 (padded) | 1024 | 1 | 1024×8 + 32 = 8224 | 3.0× | Padded; 33% overhead |
At b_mse=5 (32 centroids):
| d | B | k | Per-vec bits | Ratio | Notes |
|---|---|---|---|---|---|
| 768 | 256 | 3 | 3×256×5 + 3×32 = 3936 | 6.2× | Block decomp; no padding |
| 1024 | 1024 | 1 | 1024×5 + 32 = 5152 | 6.4× | Single block (= current) |
| 768 (padded) | 1024 | 1 | 1024×5 + 32 = 5152 | 4.8× | Padded; 33% overhead |
Block decomposition improves the compression ratio at both bit widths. At b=8 for d=768: from ~3.0× (padded) to ~3.9× (block decomp). At b=5 for d=768: from ~4.8× to ~6.2×. For d=1024, the encoding is identical to current (single block).
Shared overhead note: centroids and SORF signs are amortized over N vectors; for small N, per-column shared metadata is significant — report totals with and without amortization when publishing ratios.
Performance analysis
Encode/decode throughput
SORF at B dimensions (heuristic — real cost is dominated by memory bandwidth and constant factors): R × B × log₂(B) butterflies + R × B sign applications per block (R = SORF rounds, default 3; plus B normalization multiplies, omitted). For k blocks, R=3:
| B | SORF FLOPs/block | k (d=768) | Total MSE FLOPs |
|---|---|---|---|
| 256 | 3×256×8 + 768 = 6,912 | 3 | 20,736 |
| 512 | 3×512×9 + 1536 = 15,360 | — | — |
| 1024 (current) | 3×1024×10 + 3072 = 33,792 | 1 | 33,792 |
Block decomposition at d=768 is ~40% fewer FLOPs than the padded single-block approach, despite more blocks, because each block is smaller.
Benchmarking plan
- Encode/decode throughput: block TQ vs. current TQ at d=128, 768, 1024
- Quantized cosine similarity: block vs. current
- L2 norm readthrough: O(k) vs. O(1)
- PDX scan throughput vs. row-major (Stage 3)
Experimental plan
Minimum dimension threshold
Test TurboQuant quality at d ∈ {32, 64, 96, 128, 256} to validate the scheme minimum of 128:
- Compare TurboQuant MSE distortion and ANN recall@k against scalar quantization matched on total compressed bits per vector (codes + norm + amortized shared metadata), not just bits-per-coordinate — this is critical at small d where norm overhead is significant
- Plot the crossover point: at what d does TurboQuant's recall@k drop below the rate-matched scalar baseline?
- Test SORF coordinate distribution quality at each d (histogram vs. Beta)
- Measure overhead ratio (norm bits / total compressed bits) at each d
The scheme minimum should be set at the smallest d where TurboQuant reliably beats the scalar baseline on recall@k across the benchmarking datasets. Default scalar baseline: per-dimension linear min-max quantization at b bits per coordinate plus an f32 norm (matching TurboQuant's norm overhead). Report results at a reference N (e.g., N=100K vectors) where shared metadata is amortized; optionally show sensitivity to small N where shared costs dominate. The current proposal of 128 is conservative; experiments may justify lowering to 64 or raising to 256.
MSE quality and scan performance vs. block size
- Compare actual normalized MSE at B ∈ {64, 128, 256, 512} vs. single-block at full power-of-2 dimension, at bit widths b ∈ {2, 3, 4, 5, 8}
- Compare ANN recall@k and scan throughput at fixed d (e.g., d=3072) across B ∈ {256, 512, 1024} — smaller B gives more pruning checkpoints for ADSampling-style early termination but increases norm overhead
- Test SORF coordinate distribution at each B: histogram vs. analytical Beta
- Test 3, 4, 5 SORF rounds at each B
- Determine if the practical MSE constant is worse at smaller B
The block-size rule ("greatest qualifying B") is a starting heuristic that maximizes per-block quality and minimizes norm count. Experiments may show that smaller B with more pruning checkpoints yields better end-to-end scan performance despite higher per-block overhead.
QJL strategy comparison (if pursued)
- Per-block Gaussian QJL vs. per-block SORF QJL vs. full-dim SORF QJL vs. MSE-only
- Key metric: ANN recall@k on the datasets above (Contriever, OpenAI, SIFT)
- Per community findings for attention, MSE-only is expected to win [8]; ANN ranking is the key open question
Benchmarking datasets
The current test suite uses i.i.d. Gaussian vectors as a theory anchor and sanity check: for isotropic data, a random orthogonal transform is distributionally neutral, which cleanly validates theoretical bounds. This is not a universal "worst case" for all production workloads — heavy-tailed or clustered embeddings can behave differently. Recent work (VIBE [11]) argues that traditional benchmarks (SIFT, GloVe) are no longer representative of modern ANN workloads.
Recommended datasets:
| Dataset | Dim | Size | Source | Why |
|---|---|---|---|---|
| Contriever | 768 | ~1M | PDX paper [4] | Key non-power-of-2 target; real embeddings |
| OpenAI text-embedding-3-large | 1536 | ~1M | Common in RAG | High-d production embeddings |
| SIFT | 128 | 1M | Classic | Low-d power-of-2 baseline, well-studied recall numbers |
| arXiv embeddings | 768 | 2.25M | PDX paper [4] | Same dim as Contriever, larger scale |
| DEEP | 96 | 10M | Image embeddings | Large scale; d=96 < scheme min (128) and has no B ≥ 64 — requires explicit TurboQuantArray construction or benchmark-only scheme override |
| Synthetic Gaussian | varies | varies | Internal | Theory anchor / sanity check; not universal worst case |
Metrics (at b_mse ∈ {2, 3, 4, 5, 8}):
- Recall@10, Recall@100 (ANN ranking quality)
- Normalized MSE distortion (reconstruction quality)
- Inner product mean signed relative error (bias measurement)
- Encode/decode throughput (vectors/sec)
The Gaussian baseline validates that theoretical bounds hold. The real-embedding datasets measure practical quality — which may be better than Gaussian (structured data benefits more from rotation) or worse (if the data has adversarial properties for the specific rotation).
Dimensions with no qualifying B
Rare for common embedding dimensions (e.g., d=96). Currently these fall back to internal zero-padding to the next power-of-2 (single padded block). See "Straggler blocks (future work)" in Stage 2 for a potential alternative using heterogeneous per-block encodings.
Phasing
Phase 1 (in progress) — MSE-only single-block TurboQuant: Initial
implementation merged as PR #7269. Remaining:
FixedSizeListArray rotation signs (variable SORF rounds), dtype-matching
norms, structured metadata, and review items (see Stage 1: Remaining work).
Phase 2 — Block decomposition: Add block splitting for dimensions where a
valid B exists (greatest power-of-2 ≥ 64 dividing d). Per-block norms stored as
internal children. The TurboQuantScheme::compress() method must be updated to:
(a) choose B based on d, (b) split input into blocks, (c) normalize per-block,
(d) encode each block, and (e) store per-block norms as an internal child array.
Phase 3 — PDXArray + scan kernels: Introduce PDXArray as a general-purpose
dimension-major layout for FixedSizeListArray. TurboQuant's codes child is
swapped from FSL to PDXArray by the compressor. Distance computation kernels
operate on PDXArray's dimension-contiguous slices.
Phase 4 (experimental) — QJL: If the experimental plan shows QJL improves recall@k beyond MSE-only, add per-block Gaussian or SORF QJL. Based on KV-cache community reports [8], this may not be pursued.
Practical recommendations
For common model dimensions, the most promising configurations are:
| Dimension | Recommendation | Rationale |
|---|---|---|
| 512, 1024, 2048, 4096 | Single-block MSE-only + PDX | B=d, no decomposition needed. Same as current TQ but with PDX scan layout. |
| 768, 1536, 3072 | 3-block MSE-only + PDX | B=256 or 512. No padding waste. 3 blocks, shared centroids. |
| No qualifying B (rare) | Padded single-block | Internal zero-padding to next power-of-2, single SORF. |
In all cases, MSE-only is the recommended starting point. QJL should only be added if experiments demonstrate clear recall@k improvements for the target workload.
Future work: GPU decode and fused distance computation
The B-dim block structure maps naturally to GPU tile sizes and tensor cores. For a single block (k=1; Stage 2 generalizes to k independent per-block GEMMs) with a batch of N vectors sharing the same rotation matrix R⁻¹:
decoded_batch = diag(norms) × R⁻¹ × codebook_lookup_batch(codes)
↑ B×N matrix
↑ B×B × B×N = GEMM
The codebook gather + inverse rotation + norm scaling can be fused into a single kernel using an IO-aware streaming pattern analogous to Flash-KMeans [6] — not the same algorithm (Flash-KMeans is GPU k-means), but a similar systems goal: reduce HBM traffic and avoid full materialization. For distance computation without full decode, a precomputed (2^b_mse)²-entry distance table fits in shared memory at low bit widths (1 KB at b_mse=4, 4 KB at b_mse=5). At the default b_mse=8, the table is 256² × 4 = 256 KB, which exceeds typical GPU shared memory (48-228 KB); the distance-table approach is therefore practical only at b ≤ 5 on GPU, or requires tiling/streaming for b=8. On CPU, the table fits in L2 at all bit widths. The kernel streams code bytes from HBM with gather-reduce accumulation, using 4-8× less bandwidth than full float vectors.
At b_mse=8, codes are uint8 indices (0-255). Direct low-precision GEMM on hardware accelerators (tensor cores on GPU, byte-dot-product instructions on CPU) requires approximately linear centroids — but at high B the Max-Lloyd centroids are already near-uniform (the Beta distribution is highly concentrated, approaching Gaussian, for which high-resolution optimal quantization is approximately uniform). Whether the existing Max-Lloyd centroids are "linear enough" for hardware dot-product instructions is an empirical question worth testing before introducing a separate linear quantization mode.
Integration with Vortex scan engine
TurboQuant's quantized-domain operations must integrate with Vortex's expression
evaluation and scan pushdown infrastructure. The current implementation provides
this via ScalarFnVTable implementations in vortex-tensor.
Current integration path. The CosineSimilarity, DotProduct, and L2Norm
scalar functions check whether their input storage arrays are TurboQuant-encoded
(via TurboQuant::try_match()). If both operands are TurboQuant and the
ApproxOptions::Approximate flag is set, the scalar function dispatches to the
quantized-domain kernel (e.g., cosine_similarity_quantized_column), bypassing
full decompression. Otherwise, it falls back to the exact path (decompress →
compute on floats).
Stage 2 changes. With block decomposition, the quantized kernels must be updated to iterate over TQ blocks, weighting by per-block norms:
cosine_similarity_quantized_column: currently computes a single unit-norm dot product per row pair. Must change toΣ_k norm_a_k · norm_b_k · unit_dot_k / (‖a‖ · ‖b‖)with‖a‖ = √(Σ_k norm_a_k²).dot_product_quantized_column: same per-block weighting.l2_norm: currently returns the stored norm directly (O(1)). Must change to√(Σ_k norm_k²)— read the norms child (PrimitiveArrayfor k=1,FixedSizeListArrayfor k>1) and compute.- Both operands must have the same block size B, compatible centroids (same
b_mseand B-dim codebook), and bit-identical MSE rotation parameters (mse_rotation_signsand same SORF construction) for the quantized inner-product path to be valid. Two stored columns with different rotations must fall back to exact (decompress → float). The common column vs. constant query path avoids this: the query is re-encoded with the column's rotation and centroids at query time.
Stage 3 changes. The PDX distance kernel (shown in Stage 3 pseudocode) is a
new execution path that operates on PDXArray-typed codes. It should be exposed
as an alternative ScalarFnVTable implementation that activates when the codes
child is a PDXArray and the scan is over a contiguous 64-vector-aligned range.
For non-aligned ranges or single-vector access (scalar_at), the PDXArray is
converted to FSL first via PDXArray::to_fsl().
Expression tree integration. The typical ANN scan expression is:
top_k(cosine_similarity(column, constant_query), k=10)
The constant_query is broadcast to match the column length. The
CosineSimilarity scalar function receives both the column (TurboQuant-encoded)
and the query (ConstantArray wrapping a single vector). For the quantized path,
the query is first encoded with the column's rotation and centroids to produce
query codes and query block norms, then the PDX kernel runs over the column's
codes without decompressing them.
Migration and compatibility
TurboQuant has not been included in a release yet, so the wire format can still change freely. The Stage 1 target wire format is intended to be one we believe is ready for backward-compatibility guarantees, without formally committing to stability until confirmed by Stage 2 implementation and benchmarking.
Strategy: single array ID, versioned metadata. All stages use the same array
ID (vortex.turboquant). The metadata includes block_size, num_blocks, and
num_rounds fields. Stage 1 always writes num_blocks=1, but the field exists
so that Stage 2 decoders can read Stage 1 files without migration.
Decoder invariant: block_size is always power-of-2.
codes.list_size = num_blocks × block_size. Note that dimension (the
original input dimension) may differ from codes.list_size in Stage 1 when
internal padding applies (e.g., dimension=768, block_size=1024, list_size=1024).
In Stage 2, dimension = num_blocks × block_size (no padding, since B is
chosen to divide d exactly). The decoder validates that
codes.list_size == num_blocks × block_size (reject files where this does not
hold). num_rounds must equal rotation_signs.len / num_blocks.
Norms are always internal children. The TurboQuant array is self-contained — it stores norms as a child slot, not in a parent encoding. This means:
- Stage 1: norms child is
PrimitiveArray<F>, one norm per vector (F = f64 for f64 input, f32 otherwise). - Stage 2 with k=1 (power-of-2 dims): same as Stage 1, identical wire format.
- Stage 2 with k>1: norms child is
FixedSizeListArray<F>, k norms per vector.
The decoder distinguishes k=1 from k>1 by reading num_blocks from metadata.
A k=1 decoder is backward-compatible with Stage 1 files. A k>1 decoder is a
new code path that only applies to files written by Stage 2+.
Stage 3 (PDXArray) is additive. PDX is not a TurboQuant metadata flag — it's
a separate array type (PDXArray) that wraps the codes child. Stage 1/2 files
have FixedSizeListArray codes; Stage 3 files have PDXArray codes. The
TurboQuant decoder checks the child type and un-transposes PDXArray on decode if
needed. PDXArray itself is registered as a new encoding, independent of
TurboQuant.
Incremental shipping:
| Stage | Ships to users? | Reads prior stage files? | Notes |
|---|---|---|---|
| 1 (MSE) | Yes | N/A (first stable version) | Single block, variable SORF rounds |
| 2 (blocks) | Yes | Yes (k=1 is identical) | k>1 files need Stage 2+ decoder |
| 3 (PDX) | Yes | Yes (FSL codes still work) | PDX codes need PDXArray registered |
Each stage is independently shippable. Users can upgrade incrementally. Files written by earlier stages are always readable by later decoders.
References
All lemma, theorem, and definition numbers for [1] refer to arXiv:2504.19874v1. The ICLR 2026 camera-ready proceedings may use different numbering.
[1] Zandieh, A., Daliri, M., Hadian, M. and Mirrokni, V. "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate." ICLR 2026. arXiv:2504.19874, April 2025.
[2] Ailon, N. and Chazelle, B. "The Fast Johnson-Lindenstrauss Transform and Approximate Nearest Neighbors." SIAM J. Comput. 39(1):302-322, 2009.
[3] Tropp, J.A. "Improved Analysis of the Subsampled Randomized Hadamard Transform." Adv. Adaptive Data Analysis 3(1-2):115-126, 2011.
[4] Kuffo, L., Krippner, E. and Boncz, P. "PDX: A Data Layout for Vector Similarity Search." SIGMOD '25. arXiv:2503.04422, March 2025.
[5] Yu, F.X., Suresh, A.T., Choromanski, K., Holtmann-Rice, D. and Kumar, S. "Orthogonal Random Features." NeurIPS 2016. arXiv:1610.09072.
[6] Yang, S. et al. "Flash-KMeans: Fast and Memory-Efficient Exact K-Means." arXiv:2603.09229, March 2026.
[7] Pathare, T. et al. "TurboQuant: Implementation Corrections, Production Hardening, and Deployment Infrastructure." Eviox Tech Report v1.2.0, March 2026. https://eviox.tech/nexus/eviox_turboquant_corrections_study.pdf (Note: this URL may require Eviox account access; not publicly indexed.)
[8] Community TurboQuant implementation reports (primarily KV-cache attention):
- https://github.com/tonbistudio/turboquant-pytorch — MSE-only (V3) vs MSE+QJL (V2); reports MSE-only wins for attention and generation quality.
- https://github.com/ggml-org/llama.cpp/discussions/20969 — TurboQuant discussion; quantized attention analysis and MSE vs Prod comparison.
- https://github.com/0xSero/turboquant — Triton kernels; paper validation.
- https://github.com/scos-lab/turboquant — Reference reproduction; MSE vs Prod/QJL comparison. Multiple groups report MSE-only beating MSE+QJL for attention metrics at tested bit widths. ANN ranking conclusions remain preliminary pending dedicated benchmarks.
[9] Jégou, H., Douze, M. and Schmid, C. "Product Quantization for Nearest Neighbor Search." IEEE Trans. PAMI 33(1):117-128, 2011.
[10] Ge, T., He, K., Ke, Q. and Sun, J. "Optimized Product Quantization." IEEE Trans. PAMI 36(4):744-755, 2014.
[11] Jääsaari, E., Hyvönen, V., Ceccarello, M., Roos, T. and Aumüller, M. "VIBE: Vector Index Benchmark for Embeddings." arXiv:2505.17810, May 2025.
Appendix A: Reference implementation bugs and Theorem 1 constant
Reference implementation bugs
The Eviox corrections study [7] identified six material bugs in the paper's
reference Python implementation. The most critical is a mathematical error in
the QJL scale factor: the reference code used √(π/(2d)) instead of
√(π/2)/d (Definition 1 in [1]), differing by a factor of √d (≈11× at d=128).
Our current implementation uses the correct formula
(sqrt(FRAC_PI_2) / padded_dim in Rust), so this bug does not affect us.
Other notable Eviox findings: (a) the reference code recomputes codebooks at
every instantiation (we cache in a DashMap); (b) the reference uses float16
for codebook distance computation, causing misassignment at small centroid
spacings (we cast to f32 before quantization). See [7] for the full list.
Theorem 1 constant
There is an ambiguity in the paper's notation for the MSE bound constant. The
formal proof gives (√3 · π / 2) · 4^{-b} where the constant √3·π/2 ≈ 2.72.
The Eviox report [7] (Item 7) deliberately adopts the alternative parsing
√(3π)/2 ≈ 1.535, claiming it is "consistent with the formal proof." We treat
√3·π/2 ≈ 2.72 as the theorem constant because: (a) the paper's prose
describes the constant as "≈ 2.7," which matches 2.72 not 1.535; and (b) the
paper's reported distortion values (b=2: 0.117, b=3: 0.03) exceed the 1.535-
based bound (b=2: 0.096, b=3: 0.024), ruling out √(3π)/2 as a valid
upper bound on the measured quantity. The definitive resolution requires
checking the exact LaTeX grouping in the ICLR 2026 camera-ready proof. The
paper's "explicit values" (0.36, 0.117, 0.03, 0.009) are the actual computed
distortion of the optimal quantizer, not the bound itself — they are well below
the 2.72/4^b bound.
Appendix B: Community findings on QJL
Multiple independent TurboQuant implementations have repeatedly reported a practical finding for KV-cache attention: MSE-only often outperforms MSE+QJL at the same bit budget. The likely mechanism is a variance-bias tradeoff: QJL removes bias in raw inner-product estimation but adds variance, and the softmax nonlinearity amplifies variance more than it penalizes bias. In that setting, allocating all bits to MSE (more centroids, lower quantization variance) can beat splitting the budget between MSE + QJL. This behavior has been reported by multiple groups across Python, C, and Rust implementations [8].
For ANN search, cosine ranking, and other non-softmax vector-search workloads, the evidence is currently less settled. MSE-only is still a reasonable default because it is simpler and better supported by the current implementation work, but the ANN question should be treated as empirical until evaluated on ANN datasets with recall@k and ranking metrics (see Experimental plan).
Appendix C: Alternative rotation strategies
Why not DCT?
DCT is O(B log B) and invertible, but it is a fixed structured transform,
not a random rotation — it does not produce the Beta marginal distribution
(1-x²)^((B-3)/2) (in block dimension B) that TurboQuant's Max-Lloyd centroids
are optimized for. ADSampling only needs approximate coordinate independence
(for hypothesis-testing pruning), so a fixed orthogonal transform like DCT
suffices there. TurboQuant needs a specific known marginal distribution, so only
random orthogonal rotations (QR or SORF) are suitable.
Shared rotation with ADSampling (speculative)
Both TurboQuant and ADSampling apply a random orthogonal rotation to make coordinates independent. If we integrate ADSampling-style dimension pruning (see Stage 3), the same rotation could in principle serve both purposes. However, this is not automatic under the Stage 2 block-decomposed design: ADSampling is formulated around a single full-dimensional random projection whose coordinates can be sequentially sampled, whereas Stage 2 introduces per-block rotations and per-block norm weighting. Reusing one rotation across both systems should be treated as a future research direction that requires new analysis or direct empirical validation. If it proves viable, it would avoid rotating the data twice. The query would also need to be rotated at query time with the same stored transform.