Scaling Our Server for Semantic Search Technology Highlights

January 28, 2026

TL;DR

Exa Highlights extracts and embeds content chunks from search results in real-time.
The original Python pipeline was bottlenecked by the GIL, limiting CPU utilization and causing GPU idleness.
Work was divided between CPU-bound pre-processing and GPU-bound inference.
Attempts to parallelize in Python involved multiple processes and queues, but IPC overhead and GIL contention limited performance.
Setting CPU affinity for the main Python process revealed GIL-blocked IPC as a major bottleneck.
Migrating to Rust allowed for simpler, safer memory management and effective parallelization using rayon.
The Rust implementation removed multiprocessing overheads, resulting in a 4X throughput improvement.
GPU memory (OOM) issues arose from using parallel iterators with Rust's Torch bindings (tch-rs) across multiple GPUs.
Refactoring inference to use serial iterators with async CUDA resolved GPU memory issues.

Continue reading the original article