- Start Date: 2026-02-27
- Authors: Connor Tsui
- RFC PR: vortex-data/rfcs#5
Extension Types
Summary
We would like to build a more robust system for extension data types (or DTypes). This RFC
proposes a direction for extending the ExtVTable trait to support richer behavior (beyond
forwarding to the storage type), lays out the completed and in-progress work, and identifies the
open questions that remain.
Motivation
A limitation of the current type system in Vortex is that we cannot easily add new logical types.
For example, the effort to add FixedSizeList
(vortex#4372) and also change List to
ListView (vortex#4699) was very intrusive.
It is much easier to add wrappers around canonical types (treating the canonical dtype as a
"storage type") and implement some additional logic than to add a new variant to the DType enum.
Storage DTypes
Extension types work by wrapping an existing canonical DType, called the storage dtype. The
storage dtype is itself a logical type (e.g., Primitive, Struct, List), and the extension
type is a logical wrapper over it that layers on additional semantics such as validation, display
formatting, and (eventually) custom compute logic.
For example, a Timestamp extension type has a Primitive storage dtype. Under the hood, a
timestamp array is just a primitive array of integers, but the extension layer knows that those
integers represent microseconds since the Unix epoch. Similarly, a Union extension type might
use Struct as its storage dtype, wrapping a struct of fields with union-specific dispatch logic.
This separation means that adding a new logical type does not require changes to the core canonical type system, the compressor, or the I/O layer. Extension types get compression for free because data is always read from and written to disk as the underlying storage dtype.
Current State
Vortex provides an Extension variant of DType to help with this. Currently, implementors can
add a new extension type by defining an extension ID (for example, vortex.time or vortex.date)
and specifying a storage dtype. For example, the time extension types use a primitive storage dtype,
meaning they wrap the primitive scalars or primitive arrays with some extra logic on top (mostly
validating that the timestamps are valid).
We would like to add many more extension types. Some notable extension types (and their likely storage types) include:
- Matrix / Tensor: This would be an extension over
FixedSizeList, where dimensions correspond to levels of nesting. There are many open questions on the design of this, but that is out of scope of this RFC. - Union: The sum type of an algebraic data type, like a Rust enum. One approach is to implement
this with a type tag paired with a
Struct(soStruct { Primitive, Struct { types } }). Vortex is well suited to represent this because it can compress each of the type field arrays independently, so we do not need to distinguish between a "Sparse" or "Dense" Union. - UUID: Since this is a 128-bit number, we likely want to add
FixedSizeBinary. This is out of scope for this RFC.
The issue with the current system is that it only forwards logic to the underlying storage type. The only other behavior we support is serializing and pretty-printing extension arrays. This means that we cannot define custom compute logic for extension types.
Take the time extension types as an example of where this limitation does not matter. If we want to
run a compare expression over a timestamp array, we just run the compare over the underlying
primitive array. For simple types like timestamps, this is sufficient (and this is what we do right
now). For types like Tensors (which are simply type aliases over FixedSizeList), this is also
fine.
However, for more complex types like UUID, Union, or JSON, forwarding to the storage type is likely
insufficient as these types need custom compute logic. Given that, we want a more robust
implementation path instead of wrapping ExtensionArray and performing significant internal
dispatch work.
Design
Background
vortex#6081 introduced vtables (virtual tables,
or Rust unit structs with methods) for extension DTypes. Each extension type (e.g., Timestamp)
now implements ExtDTypeVTable, which handles validation, serialization, and metadata.
The type-erased ExtDTypeRef carries this vtable with it inside DType::Extension.
There were a few blockers (detailed in the tracking issue vortex#6547), but now that those have been resolved, we can move forward.
Proposed Design
Now that vortex-scalar and vortex-dtype have been merged into vortex-array, we can place
all extension logic (for types, scalars, and arrays) onto an ExtVTable (renamed from
ExtDTypeVTable).
It will look something like the following:
// Note: naming should be considered unstable.
/// The public API for defining new extension types.
///
/// This is the non-object-safe trait that plugin authors implement to define a new extension
/// type. It specifies the type's identity, metadata, serialization, and validation.
pub trait ExtVTable: 'static + Sized + Send + Sync + Clone + Debug + Eq + Hash {
/// Associated type containing the deserialized metadata for this extension type.
type Metadata: 'static + Send + Sync + Clone + Debug + Display + Eq + Hash;
/// A native Rust value that represents a scalar of the extension type.
///
/// The value only represents non-null values. We denote nullable values as `Option<Value>`.
type NativeValue<'a>: Display;
/// Returns the ID for this extension type.
fn id(&self) -> ExtId;
// Methods related to the extension `DType`.
/// Serialize the metadata into a byte vector.
fn serialize_metadata(&self, metadata: &Self::Metadata) -> VortexResult<Vec<u8>>;
/// Deserialize the metadata from a byte slice.
fn deserialize_metadata(&self, metadata: &[u8]) -> VortexResult<Self::Metadata>;
/// Validate that the given storage type is compatible with this extension type.
fn validate_dtype(&self, metadata: &Self::Metadata, storage_dtype: &DType) -> VortexResult<()>;
// Methods related to the extension scalar values.
/// Validate the given storage value is compatible with the extension type.
///
/// By default, this calls [`unpack_native()`](ExtVTable::unpack_native) and discards the
/// result.
///
/// # Errors
///
/// Returns an error if the storage [`ScalarValue`] is not compatible with the extension type.
fn validate_scalar_value(
&self,
metadata: &Self::Metadata,
storage_dtype: &DType,
storage_value: &ScalarValue,
) -> VortexResult<()> {
self.unpack_native(metadata, storage_dtype, storage_value)
.map(|_| ())
}
/// Validate and unpack a native value from the storage [`ScalarValue`].
///
/// Note that [`ExtVTable::validate_dtype()`] is always called first to validate the storage
/// [`DType`], and the [`Scalar`](crate::scalar::Scalar) implementation will verify that the
/// storage value is compatible with the storage dtype on construction.
///
/// # Errors
///
/// Returns an error if the storage [`ScalarValue`] is not compatible with the extension type.
fn unpack_native<'a>(
&self,
metadata: &'a Self::Metadata,
storage_dtype: &'a DType,
storage_value: &'a ScalarValue,
) -> VortexResult<Self::NativeValue<'a>>;
// `ArrayRef`
fn validate_array(&self, metadata: &Self::Metadata, storage_array: &ArrayRef) -> VortexResult<()>;
fn cast_array(&self, metadata: &Self::Metadata, array: &ArrayRef, target: &DType) -> VortexResult<ArrayRef> { ... }
// Additional compute methods TBD.
}
Most of the implementation work will be making sure that ExtDTypeRef (which we pass around as the
Extension variant of DType) has the correct methods that access the internal, type-erased
ExtVTable.
Take extension scalars as an example. The only behavior we need from extension scalars is validating
that they have correct values, displaying them, and unpacking them into native types. So we added
these methods to ExtDTypeRef:
impl ExtDTypeRef {
/// Formats an extension scalar value using the current dtype for metadata context.
pub fn fmt_storage_value<'a>(
&'a self,
f: &mut fmt::Formatter<'_>,
storage_value: &'a ScalarValue,
) -> fmt::Result { ... }
/// Validates that the given storage scalar value is valid for this dtype.
pub fn validate_storage_value(&self, storage_value: &ScalarValue) -> VortexResult<()> { ... }
}
Open question: What should the API for extension arrays look like? The answer will determine
what additional methods ExtDTypeRef needs beyond the scalar-related ones shown above.
Compatibility
This should not break anything because extension types are mostly related to in-memory APIs (since data is read from and written to disk as the storage type).
Drawbacks
If forwarding to the storage type turns out to be sufficient for all extension types, the additional vtable surface area adds complexity without clear benefit.
Alternatives
We could have many ExtensionArray wrappers with custom logic. This approach would be clunky and
may not scale.
Prior Art
Apache Arrow allows defining extension types and also provides a set of canonical extension types.
Unresolved Questions
- Is forwarding to the storage type insufficient, and which extension types genuinely need custom compute logic?
- What should the
ExtVTableAPI for extension arrays look like? What methods beyondvalidate_arrayare needed? - How should compute expressions be defined and dispatched for extension types?
Future Possibilities
If we can get extension types working well, we can add all of the following types:
DateTimeParts(Primitive)- Matrix (
FixedSizeList) - Tensor (
FixedSizeList) - UUID (Do we need to add
FixedSizeBinaryas a canonical type?) - JSON (
UTF8) - PDX: https://arxiv.org/pdf/2503.04422v1 (
FixedSizeList) - Union
- Sparse (
Struct { Primitive, Struct { types } }) - Dense[^1]
- Sparse (
- Map (
List<Struct { K, V }>) - Tags: See this
discussion,
where we think we can represent this with (
ListView<Utf8>) Structbut with protobuf-style field numbers (Struct)- NOT Variant: see RFC 0015 (Variant Type). Variant cannot
be an extension type because there is no way to define a storage dtype when the schema is not
known ahead of time for each row. Instead, Variant will have its own
DTypevariant. - And likely more.
[^1]:
Struct doesn't work here because children can have different lengths, but what we could do
is simply force the inner Struct { types } to hold SparseArray fields, which would
effectively be the exact same but with the overhead of tracking indices for each of the child
fields. In that case, it might just be better to always use a "sparse" union and let the
compressor decide what to do.