A high-performance tensor library built on WebGPU, designed for both eager and lazy execution with automatic CPU/GPU device management.
Compute shaders are GPU programs that run massively parallel computations. Unlike graphics shaders that render pixels, compute shaders can perform arbitrary calculations on large datasets. WebGPU exposes this power through a modern, cross-platform API that works in browsers and native environments.
Performance Note: For small matrices (< 1K elements), CPU is often faster due to GPU setup overhead. For large matrices (> 100K elements), GPU parallelism dominates. Our library handles both seamlessly.
- Dual execution modes: Eager execution (immediate) and planned compiled execution (fused operations)
- Cross-device: Seamless CPU ↔ GPU transfers
- WebGPU native: Built for modern compute workloads
- TypeScript: Full type safety with device-aware types
import { Tensor } from "./src/tensor.ts";
// Create tensors on CPU or GPU
const a = Tensor.fromArray([1, 2, 3, 4, 5], { device: "gpu" });
const b = Tensor.fromArray([2, 3, 4, 5, 6], { device: "gpu" });
// Method chaining works - operations execute immediately
const result = await a.add(b).mul(2); // Each step creates new tensor
const cpuResult = await result.to("cpu"); // Transfer back
console.log(cpuResult.toArray()); // [6, 10, 14, 18, 22]
import { Tensor, compile } from "./src/index.ts";
// Same chaining syntax, but with major optimizations:
const fusedOp = compile((x: Tensor, y: Tensor) => {
return x.add(y).mul(2).relu(); // Fused into single kernel
});
// Compiled mode advantages:
// ✅ Kernel fusion: Single compute shader instead of 3 separate ones
// ✅ In-place operations: Dynamic buffer allocator minimizes memory usage
// ✅ Auto cleanup: Intermediate tensors destroyed at closure end
// ✅ Reuse: First call compiles, subsequent calls blazing fast
const result = fusedOp(a, b); // Much faster than eager
WebGPU Async Nature: WebGPU operations are inherently async, but we don't always await
intermediate steps since the runtime automatically queues and awaits necessary operations. This allows for better performance through automatic batching.
Syntax Choices:
Tensor.fromArray()
for explicit construction.to(device)
for clear device transfers- Method chaining with automatic queueing
- Device-aware TypeScript types prevent cross-device errors at compile time
Currently using Deno exclusively due to its excellent built-in WebGPU support (--unstable-webgpu
). Future plans include bundling for web browsers and Node.js.
- Deno 1.40+ with WebGPU support
deno task dev # Run basic examples
deno run --unstable-webgpu --allow-all examples/basic_add.ts
deno run --unstable-webgpu --allow-all examples/performance_comparison.ts
deno task test # Run all tests
deno test --unstable-webgpu --allow-all tests/tensor_test.ts
We need help with several areas:
- Better TypeScript support: Current device typing could be more ergonomic
- Shape broadcasting: Automatic shape compatibility
- Error handling: Better error messages and recovery
- Memory leaks: GPU buffer cleanup
- Edge cases: Empty tensors, large arrays, device switching
- Performance regressions: Benchmark against baselines
Easy starter tasks! Copy src/kernels/add.ts
to implement:
- Element-wise ops:
sub.ts
,mul.ts
,div.ts
- Activation functions:
relu.ts
,sigmoid.ts
,tanh.ts
- Reductions:
sum.ts
,mean.ts
,max.ts
Current kernels are naive (1 thread = 1 element):
- 2D tiling: Each thread handles 8x8 tiles
- Memory coalescing: Optimal memory access patterns
- Workgroup optimization: Better thread group utilization
compile()
API: Lazy execution with kernel fusion- Automatic differentiation: Backpropagation support
- Shape inference: Automatic output shape calculation
- Memory pooling: Buffer reuse and allocation optimization
- API reference: Complete function documentation
- Tutorials: WebGPU tensor programming guide
- Examples: Real-world use cases
- Performance testing: Comprehensive CPU vs GPU benchmarks
- Memory profiling: Track buffer allocation and cleanup
- Regression testing: Ensure optimizations don't break performance
Seriously, we need thorough testing of edge cases, memory management, and cross-device operations.
MIT