tech
January 28, 2026
Scaling Our Server for Semantic Search Technology Highlights
This AI research blog post details the engineering challenges and solutions we tackled while scaling our highlights server, a key component of our neural network search engine.

TL;DR
- Exa Highlights extracts and embeds content chunks from search results in real-time.
- The original Python pipeline was bottlenecked by the GIL, limiting CPU utilization and causing GPU idleness.
- Work was divided between CPU-bound pre-processing and GPU-bound inference.
- Attempts to parallelize in Python involved multiple processes and queues, but IPC overhead and GIL contention limited performance.
- Setting CPU affinity for the main Python process revealed GIL-blocked IPC as a major bottleneck.
- Migrating to Rust allowed for simpler, safer memory management and effective parallelization using rayon.
- The Rust implementation removed multiprocessing overheads, resulting in a 4X throughput improvement.
- GPU memory (OOM) issues arose from using parallel iterators with Rust's Torch bindings (tch-rs) across multiple GPUs.
- Refactoring inference to use serial iterators with async CUDA resolved GPU memory issues.