Performance
Tova generates high-performance code through compile-time optimizations and performance decorators. You write clean, expressive code and the compiler makes it fast. These features range from automatic rewrites (zero effort) to explicit decorators (@wasm, @fast) for compute-intensive workloads.
Benchmark Results
All benchmarks run on Apple Silicon. Each reports the best of 3 runs.
| Benchmark | Time | Technique |
|---|---|---|
| Sort 1M integers | 27ms | Rust FFI radix sort (O(n)) |
| JSON parse 11MB | 37ms | SIMD-accelerated parser |
| JSON stringify 100K objects | 19ms | Native serialization |
| Fibonacci iterative (n=40) | 20ms | JIT-optimized tight loop |
| @fast dot product 1M | 97ms | Float64Array coercion |
| N-body simulation | 22ms | Floating-point optimization |
| @wasm integer compute (200x500K) | 117ms | Native WebAssembly binary |
| Array find (1M items, 100x) | 7ms | Optimized builtins |
| 10-arm match dispatch (10M iter) | 18.8ms | Compiled to if-chain |
| @fast Kahan sum 1M | 380ms | Float64Array + compensated summation |
| @fast vector add 1M | 90ms | TypedArray element-wise ops |
| Prime sieve 10M | 25ms | Uint8Array fill optimization |
| Result.map 3x chain (10M iter) | 10ms | Compile-time fusion |
| Result create+check (10M iter) | 17ms | Scalar replacement (zero allocation) |
| Option create+unwrapOr (10M iter) | 10ms | Scalar replacement (zero allocation) |
| unwrapOr alternating (10M iter) | 7ms | Compile-time devirtualization |
HTTP Server
| Mode | Requests/sec |
|---|---|
| Fast mode (auto-detected) | 108,000 |
| Standard mode | 90,000 |
HTTP fast mode is automatically enabled when the compiler detects a simple server (no middleware, sessions, or WebSockets). It emits synchronous handlers with direct if-chain dispatch instead of a middleware pipeline.
Optimization Impact
Several benchmarks improved dramatically through compiler optimizations:
| Benchmark | Before | After | Improvement |
|---|---|---|---|
| Prime sieve 10M | 78ms | 25ms | 3.1x (array fill + Uint8Array) |
| Result.map 3x chain | 101ms | 10ms | 10x (map chain fusion) |
| Result create+check 10M | 36ms | 17ms | 2.1x (scalar replacement) |
| Option create+unwrapOr 10M | 190ms | 10ms | 19x (scalar replacement) |
| HTTP req/s | 66K | 108K | 1.6x (fast mode) |
| Lexer throughput | 0.079ms/iter | 0.045ms/iter | 1.8x (substring extraction) |
Automatic Optimizations
These happen at compile time. You get them for free without changing your code.
Array Fill Pattern Detection
When the compiler sees an empty array followed by a push loop with a constant value, it replaces the entire pattern with a single pre-allocated array:
// You write:
var scores = []
for i in range(1000) {
scores.push(0)
}
// Compiler generates: new Array(1000).fill(0)For boolean fills, the compiler upgrades to a Uint8Array for contiguous memory:
// You write:
var flags = []
for i in range(1000) {
flags.push(false)
}
// Compiler generates: new Uint8Array(1000)Impact: 3x faster for the prime sieve benchmark (78ms to 25ms). The Uint8Array upgrade gives contiguous memory access instead of boxed boolean objects.
This optimization triggers when:
- The variable starts as an empty array literal
[] - The next statement is a
for i in range(n)loop - The loop body is a single
push()call with a value that doesn't reference the loop variable
Range Loop Optimization
for i in range(n) compiles to a C-style for loop instead of allocating an array:
for i in range(1000000) {
// loop body
}
// Compiles to: for (let i = 0; i < 1000000; i++) { ... }
// NOT: for (const i of range(1000000)) { ... }This avoids allocating a million-element array just to iterate over indices.
Result.map Chain Fusion
Chains of .map() calls on Ok or Some values are fused into a single operation, eliminating all intermediate allocations:
result = Ok(5)
.map(fn(x) x * 2)
.map(fn(x) x + 3)
.map(fn(x) x * 10)
// Compiler generates: Ok((((5 * 2) + 3) * 10))
// Instead of creating 3 intermediate Ok wrappersImpact: 10x faster for chains of 3+ maps (101ms to 10ms). This is a zero-cost abstraction -- the functional style compiles to the same code as manual computation.
The optimization applies when:
- The receiver is an
Ok()orSome()call - Each
.map()argument is a single-expression lambda with one parameter - Two or more
.map()calls are chained
Result/Option Devirtualization
When the compiler sees a method call on a known Result or Option constructor, it inlines the method body directly, eliminating the object allocation entirely:
// You write:
value = Ok(42).unwrap()
// Compiler generates: const value = 42;
// No Ok wrapper is ever createdThis applies to all methods on Ok(), Err(), Some(), and None:
Ok(x).isOk() // compiles to: true
Err(e).unwrapOr(0) // compiles to: 0
None.isSome() // compiles to: false
Some(x).unwrapOr(d) // compiles to: xFor .map() and .flatMap() with simple lambdas, the lambda body is inlined:
Ok(5).map(fn(x) x * 3) // compiles to: Ok((5 * 3))Combined with map chain fusion, a chain like Ok(val).map(f).map(g).unwrap() compiles down to a single arithmetic expression with zero allocations.
Impact: unwrapOr on pre-created values runs at 7ms/10M iterations -- faster than Go.
Scalar Replacement for Local Results
When a Result or Option is created via an if/else and only accessed through safe methods (.isOk(), .unwrap(), .unwrapOr(), etc.), the compiler replaces the object with a boolean+value pair:
// You write:
r = if x > 0 { Ok(x * 2) } else { Err("negative") }
if r.isOk() { r.unwrap() } else { -1 }
// Compiler generates:
// let r__ok, r__v;
// if (x > 0) { r__ok = true; r__v = x * 2; }
// else { r__ok = false; r__v = "negative"; }
// if (r__ok) { r__v } else { -1 }No Ok or Err object is ever allocated. This is the same strategy that JVM JIT compilers use ("escape analysis" / "scalar replacement"), but applied at compile time.
Impact: Result create+check pattern runs at 17ms/10M iterations (was 36ms). Option create+unwrapOr runs at 10ms/10M iterations (was 190ms -- a 19x improvement).
The optimization triggers when:
- A variable is assigned from an if/else where all branches return Result or Option constructors
- All subsequent uses of the variable are safe method calls (no bare references, no passing to functions, no returning the value)
Lexer Fast Path
The lexer uses substring() extraction instead of character-by-character string concatenation for identifiers and numbers:
Impact: 43% faster lexing (0.079ms to 0.045ms per iteration), 9% faster full compilation pipeline.
HTTP Fast Mode
The compiler detects simple servers at compile time and emits optimized code:
// When the server has no middleware, sessions, WebSockets, or error handlers:
server {
fn get_users() -> [User] { users }
fn add_user(name: String) -> User { ... }
}
// Compiler emits:
// - Synchronous handler (not async)
// - Direct if-chain route dispatch (no middleware pipeline)
// - No AsyncLocalStorage, request IDs, or per-request timingImpact: 64% improvement (66K to 108K req/s).
@wasm -- WebAssembly Compilation
The @wasm decorator compiles a function directly to WebAssembly binary format. No external toolchain required. The Tova compiler includes a complete WASM code generator that produces raw binary bytes embedded in the output.
@wasm fn fibonacci(n: Int) -> Int {
if n <= 1 { return n }
fibonacci(n - 1) + fibonacci(n - 2)
}
// Call it like any other function
print(fibonacci(40))How it works
The compiler's WASM code generator (src/codegen/wasm-codegen.js):
- Parses the function AST
- Infers types from annotations and context
- Emits WASM binary sections (type, function, export, code)
- Uses LEB128 encoding for variable-length integers
- Handles i32/f64 type conversions at instruction boundaries
- Generates glue code that instantiates the WASM module at runtime
If WASM compilation fails (unsupported operation), the compiler falls back to standard codegen with a warning.
Supported operations
| Category | Supported |
|---|---|
| Types | Int (i32), Float (f64), Bool (i32) |
| Arithmetic | +, -, *, /, % |
| Comparison | ==, !=, <, >, <=, >= |
| Logic | and, or, not |
| Control flow | if/elif/else, while, for |
| Variables | var, assignment |
| Calls | Self-recursion, other @wasm functions |
Limitations
- Only numeric types and booleans -- no strings, arrays, or objects
- Can only call other
@wasmfunctions (no external calls) - No closures or captured variables
- Assignment targets must be simple variables
Performance
| Benchmark | @wasm | Standard codegen |
|---|---|---|
| compute 200x500K | 117ms | 170ms |
| fibonacci(40) | 554ms | 936ms |
The WASM path avoids JIT warmup and deoptimization overhead for pure numeric computation.
When to use @wasm
Use @wasm for CPU-bound numeric kernels: recursive algorithms, simulations, mathematical computations. The sweet spot is tight loops over integers or floats with no string/object manipulation.
@fast -- TypedArray Optimization
The @fast decorator enables TypedArray coercion for function parameters. Array parameters with numeric type annotations are automatically converted to typed arrays at function entry, enabling native-speed numeric operations.
@fast fn dot_product(a: [Float], b: [Float]) -> Float {
typedDot(a, b)
}
@fast fn scale_vector(v: [Float], factor: Float) -> [Float] {
typedScale(v, factor)
}
@fast fn normalize(v: [Float]) -> [Float] {
n = typedNorm(v)
typedScale(v, 1.0 / n)
}How it works
- The compiler detects
@faston a function declaration - Array parameters with numeric type annotations (e.g.,
[Float],[Int]) are wrapped in TypedArray constructors at function entry - For-loops over typed arrays use index-based iteration (avoids iterator protocol overhead)
- Numeric array literals in the function body are emitted as typed arrays
Type mapping
| Tova Annotation | TypedArray | Bytes per element |
|---|---|---|
[Int] | Int32Array | 4 |
[Float] | Float64Array | 8 |
[Byte] | Uint8Array | 1 |
[Int8] | Int8Array | 1 |
[Int16] | Int16Array | 2 |
[Int32] | Int32Array | 4 |
[Uint8] | Uint8Array | 1 |
[Uint16] | Uint16Array | 2 |
[Uint32] | Uint32Array | 4 |
[Float32] | Float32Array | 4 |
[Float64] | Float64Array | 8 |
Typed stdlib functions
These are optimized for TypedArray input and available without imports:
| Function | Description |
|---|---|
typedSum(arr) | Sum with Kahan compensation (minimizes float error) |
typedDot(a, b) | Dot product of two arrays |
typedNorm(arr) | L2 norm (Euclidean length) |
typedAdd(a, b) | Element-wise addition (returns new typed array) |
typedScale(arr, s) | Multiply every element by scalar |
typedMap(arr, f) | Map function over elements, preserving type |
typedReduce(arr, f, init) | Reduce with typed array input |
typedSort(arr) | Sort (returns new typed array) |
typedZeros(n) | Float64Array of zeros |
typedOnes(n) | Float64Array of ones |
typedFill(n, val) | New Float64Array filled with value |
typedRange(start, end, step) | Float64Array range |
typedLinspace(start, end, n) | n evenly-spaced Float64Array values |
Performance
| Operation (1M elements, 100 iterations) | Time |
|---|---|
| Dot product | 97ms |
| Kahan sum | 380ms |
| Vector add | 90ms |
| Vector norm | 324ms |
Example: numerically stable summation
@fast fn precise_sum(data: [Float]) -> Float {
typedSum(data)
}
// Regular sum of [1e16, 1, -1e16] might lose the 1
// typed_sum uses Kahan compensated summation to preserve it
result = precise_sum([1e16, 1.0, -1e16])
print(result) // 1.0 (not 0)When to use @fast
Use @fast for array-heavy numeric code: signal processing, statistics, linear algebra, physics simulations, financial calculations. TypedArrays give the runtime contiguous, unboxed memory to work with, which enables SIMD-level optimization.
parallel_map -- Multi-Core Worker Pool
parallel_map distributes array processing across all CPU cores using a persistent worker pool:
results = await parallelMap(large_array, fn(item) {
expensive_computation(item)
})How it works
- On first call, creates one worker thread per CPU core
- Workers persist and are reused for all subsequent calls (zero startup overhead)
- The array is chunked evenly across workers
- Each worker reconstructs the mapped function and processes its chunk
- Results are gathered and returned in order
// Specify number of workers explicitly
results = await parallelMap(data, process_item, 8)Performance
| Implementation | Time (64 items x 10M work) | Speedup |
|---|---|---|
Sequential map() | 1,355ms | 1.0x |
parallel_map (pooled) | 379ms | 3.57x |
The persistent pool eliminates worker creation overhead. Second call latency drops from 50-90ms (per-call workers) to 6ms (pooled workers).
When to use parallel_map
- Array of 4+ elements (falls back to sequential below this)
- Each element requires significant computation (CPU-bound, not I/O-bound)
- For I/O-bound work (network requests, file reads), use
asyncwithPromise.allinstead
Radix Sort via Rust FFI
Tova's sorted() function uses a 3-tier strategy for maximum performance on numeric arrays:
- Arrays > 128 elements, numeric, FFI available: Radix sort via Rust FFI (O(n) time)
- Arrays > 128 elements, numeric, no FFI: Float64Array.sort() fallback
- Arrays <= 128 elements: Standard comparator sort (low overhead)
| Size | Time (Rust FFI) |
|---|---|
| 10K | 0.4ms |
| 100K | 2.4ms |
| 1M | 23.6ms |
Radix sort achieves O(n) time complexity vs comparison sort's O(n log n), which is why it scales efficiently to large arrays.
filled() -- Pre-allocated Arrays
filled(n, value) creates a pre-allocated array in a single operation:
grid = filled(1000, 0)
flags = filled(256, false)
names = filled(10, "unknown")This compiles to new Array(n).fill(val) and is faster than building arrays with push loops because it allocates the full size upfront.
Combining Features
These features compose naturally:
@fast fn process_batch(data: [Float]) -> Float {
typedDot(data, data) |> Math.sqrt()
}
// Process many batches in parallel across all CPU cores
norms = await parallelMap(all_batches, process_batch)For the most demanding workloads, layer them:
@wasm fn kernel(x: Float, y: Float) -> Float {
// Inner computation as native WASM
var result = 0.0
var i = 0
while i < 1000 {
result = result + x * y / (1.0 + result)
i = i + 1
}
result
}
@fast fn process(data: [Float]) -> [Float] {
// TypedArray operations with WASM kernel
typedMap(data, fn(x) kernel(x, 1.0))
}
// Distribute across cores
results = await parallelMap(batches, fn(batch) process(batch))Running the Benchmarks
The benchmark suite lives in benchmarks/ and includes 14 workloads:
# Run all benchmarks
./benchmarks/run_benchmarks.sh
# Quick mode (benchmarks 01-07 only)
./benchmarks/run_benchmarks.sh --quick
# Tova only
./benchmarks/run_benchmarks.sh --tova-only
# Single benchmark
./benchmarks/run_benchmarks.sh 03The runner executes each benchmark and outputs a formatted results table.
Concurrency: Tova vs Go
A dedicated comparison suite benchmarks Tova's concurrency runtime (Tokio + Wasmtime + Crossbeam) against Go's goroutines and channels:
# Run side-by-side comparison (requires go and bun)
bash benchmarks/concurrent/run_comparison.sh
# Show raw output for debugging
VERBOSE=1 bash benchmarks/concurrent/run_comparison.shSix benchmarks are compared: spawn overhead, channel throughput (1M messages), ping-pong latency, fan-out (4 workers), select multiplexing (4 channels), and concurrent compute (40K × fib(30)). The runner prints a formatted table with timing and ratios.