tech

January 28, 2026

Scaling Our Server for Semantic Search Technology Highlights

This AI research blog post details the engineering challenges and solutions we tackled while scaling our highlights server, a key component of our neural network search engine.

Scaling Our Server for Semantic Search Technology Highlights

TL;DR

  • Exa Highlights extracts and embeds content chunks from search results in real-time.
  • The original Python pipeline was bottlenecked by the GIL, limiting CPU utilization and causing GPU idleness.
  • Work was divided between CPU-bound pre-processing and GPU-bound inference.
  • Attempts to parallelize in Python involved multiple processes and queues, but IPC overhead and GIL contention limited performance.
  • Setting CPU affinity for the main Python process revealed GIL-blocked IPC as a major bottleneck.
  • Migrating to Rust allowed for simpler, safer memory management and effective parallelization using rayon.
  • The Rust implementation removed multiprocessing overheads, resulting in a 4X throughput improvement.
  • GPU memory (OOM) issues arose from using parallel iterators with Rust's Torch bindings (tch-rs) across multiple GPUs.
  • Refactoring inference to use serial iterators with async CUDA resolved GPU memory issues.