Skip to content

nebrelbug/shade

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WebGPU Tensor Library

A high-performance tensor library built on WebGPU, designed for both eager and lazy execution with automatic CPU/GPU device management.

What are Compute Shaders?

Compute shaders are GPU programs that run massively parallel computations. Unlike graphics shaders that render pixels, compute shaders can perform arbitrary calculations on large datasets. WebGPU exposes this power through a modern, cross-platform API that works in browsers and native environments.

Performance Note: For small matrices (< 1K elements), CPU is often faster due to GPU setup overhead. For large matrices (> 100K elements), GPU parallelism dominates. Our library handles both seamlessly.

Key Features

  • Dual execution modes: Eager execution (immediate) and planned compiled execution (fused operations)
  • Cross-device: Seamless CPU ↔ GPU transfers
  • WebGPU native: Built for modern compute workloads
  • TypeScript: Full type safety with device-aware types

API Examples

Eager Execution (Current)

import { Tensor } from "./src/tensor.ts";

// Create tensors on CPU or GPU
const a = Tensor.fromArray([1, 2, 3, 4, 5], { device: "gpu" });
const b = Tensor.fromArray([2, 3, 4, 5, 6], { device: "gpu" });

// Method chaining works - operations execute immediately
const result = await a.add(b).mul(2); // Each step creates new tensor
const cpuResult = await result.to("cpu"); // Transfer back

console.log(cpuResult.toArray()); // [6, 10, 14, 18, 22]

Compiled Execution (Planned)

import { Tensor, compile } from "./src/index.ts";

// Same chaining syntax, but with major optimizations:
const fusedOp = compile((x: Tensor, y: Tensor) => {
  return x.add(y).mul(2).relu(); // Fused into single kernel
});

// Compiled mode advantages:
// ✅ Kernel fusion: Single compute shader instead of 3 separate ones
// ✅ In-place operations: Dynamic buffer allocator minimizes memory usage
// ✅ Auto cleanup: Intermediate tensors destroyed at closure end
// ✅ Reuse: First call compiles, subsequent calls blazing fast

const result = fusedOp(a, b); // Much faster than eager

Design Decisions

WebGPU Async Nature: WebGPU operations are inherently async, but we don't always await intermediate steps since the runtime automatically queues and awaits necessary operations. This allows for better performance through automatic batching.

Syntax Choices:

  • Tensor.fromArray() for explicit construction
  • .to(device) for clear device transfers
  • Method chaining with automatic queueing
  • Device-aware TypeScript types prevent cross-device errors at compile time

Platform Support

Currently using Deno exclusively due to its excellent built-in WebGPU support (--unstable-webgpu). Future plans include bundling for web browsers and Node.js.

Getting Started

Prerequisites

  • Deno 1.40+ with WebGPU support

Run Examples

deno task dev              # Run basic examples
deno run --unstable-webgpu --allow-all examples/basic_add.ts
deno run --unstable-webgpu --allow-all examples/performance_comparison.ts

Run Tests

deno task test             # Run all tests
deno test --unstable-webgpu --allow-all tests/tensor_test.ts

Contributing

We need help with several areas:

🔧 API Improvements

  • Better TypeScript support: Current device typing could be more ergonomic
  • Shape broadcasting: Automatic shape compatibility
  • Error handling: Better error messages and recovery

🐛 Bug Hunting

  • Memory leaks: GPU buffer cleanup
  • Edge cases: Empty tensors, large arrays, device switching
  • Performance regressions: Benchmark against baselines

⚡ New Kernels

Easy starter tasks! Copy src/kernels/add.ts to implement:

  • Element-wise ops: sub.ts, mul.ts, div.ts
  • Activation functions: relu.ts, sigmoid.ts, tanh.ts
  • Reductions: sum.ts, mean.ts, max.ts

🚀 Kernel Optimization

Current kernels are naive (1 thread = 1 element):

  • 2D tiling: Each thread handles 8x8 tiles
  • Memory coalescing: Optimal memory access patterns
  • Workgroup optimization: Better thread group utilization

📦 Core Features

  • compile() API: Lazy execution with kernel fusion
  • Automatic differentiation: Backpropagation support
  • Shape inference: Automatic output shape calculation
  • Memory pooling: Buffer reuse and allocation optimization

📚 Documentation

  • API reference: Complete function documentation
  • Tutorials: WebGPU tensor programming guide
  • Examples: Real-world use cases

📊 Benchmarks

  • Performance testing: Comprehensive CPU vs GPU benchmarks
  • Memory profiling: Track buffer allocation and cleanup
  • Regression testing: Ensure optimizations don't break performance

🔍 More Bug Hunting

Seriously, we need thorough testing of edge cases, memory management, and cross-device operations.

License

MIT

About

pytorch-like webgpu-backed computations for ts/js

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published