This project provides a development environment for the Bend programming language using Visual Studio Code and Docker. The environment is configured to support GPU acceleration with NVIDIA CUDA. Getting Started
bend run-c parallel_sum.bend -sIf you have a NVIDIA GPU, you can also run in CUDA (Massively parallel)bend run-cu parallel_sum.bend -sIn Bend, it can be parallelized by just changing the run command. If your code can run in parallel it will run in parallel....
GPU, NVIDIA RTX 4090, 16k threads: 0.21 seconds That's a 57x speedup by doing nothing. No thread spawning, no explicit management of locks, mutexes. We just asked Bend to run our program on RTX, and it did. Simple as that.Bend
bend run-cu: GPU, NVIDIA RTX 4090: 0.21 secondsClick here for the Bitonic Sorter code # Sorting Network = just rotate trees! def sort(d, s, tree): switch d: case 0: return tree case _: (x,y) = tree lft = sort(d-1, 0, x) rgt = sort(d-1, 1, y) return rots(d, s...
GPU, NVIDIA RTX 4090, 16k threads: 0.21 seconds That's a 57x speedup by doing nothing. No thread spawning, no explicit management of locks, mutexes. We just asked Bend to run our program on RTX, and it did. Simple as that.Bend
GPU, NVIDIA RTX 4090, 16k threads: 0.21 seconds That's a 57x speedup by doing nothing. No thread spawning, no explicit management of locks, mutexes. We just asked Bend to run our program on RTX, and it did. Simple as that.Bend
bend run-c parallel_sum.bend -sIf you have a NVIDIA GPU, you can also run in CUDA (Massively parallel)bend run-cu parallel_sum.bend -sIn Bend, it can be parallelized by just changing the run command. If your code can run in parallel it will run in parallel....
GPU, NVIDIA RTX 4090, 16k threads: 0.21 seconds That's a 57x speedup by doing nothing. No thread spawning, no explicit management of locks, mutexes. We just asked Bend to run our program on RTX, and it did. Simple as that.Bend
Moreover, since demo_shader isn't doing many allocations, it can operate entirely inside the GPU's "shared memory" (L1 cache). Each GPU thread has a local space of 64 IC nodes. Functions that don't need more than that, like demo_shader, can run up to 5x faster! On my GPU, it ...