Skip to main content
← Writing

Three Generations of a Warehouse Routing Engine

· 27 min read
Series parts
  1. Part 1Walking Is the Most Expensive Warehouse Operation
  2. Part 2Three Generations of a Warehouse Routing Engine
  3. Part 3Running a Warehouse System on a 4 GB Server with No Docker
  4. Part 4Streaming Excel to a Database Without Losing a Single Row
On this page

The optimization engine is the core of the warehouse system. An operator scans a barcode; the engine receives a list of items with their warehouse positions and available quantities; it returns an optimized pick sequence. That computation happens hundreds of times per day, and the engine needs to answer hundreds of shortest-path queries per batch order. Each query asks: what is the walking distance between position X and position Y, accounting for shelf racks that block direct paths? The pathfinder sits in the inner loop of the route optimizer. Its speed determines whether operators wait at the scanning station or start walking immediately.

The engine went through three complete rewrites across three languages and two algorithmic paradigms. Not because the first two were wrong; each one hit a different constraint that forced the next evolution. The current implementation, a Rust/WASM binary with Jump Point Search pathfinding and a nearest-neighbor + 2-opt solver, is faster, more deterministic, and more portable than its predecessors. I built an ILP-based exact solver to benchmark it, and the greedy closes within 1% of optimal on average. This post tells the full story: what worked, what broke, and why a simple algorithm with the right local-improvement step turned out to be the better engineering choice.

Generation 1: Node.js + npm libraries (2019)

I started with a Python notebook to visualize the grid and sketch a solution, but I never wanted Python in production. The first shipped version stayed inside Node.js. I forked node-tspsolver, tuned its simulated annealing strategy, and used pathfinding for A* shortest paths. It was a pragmatic MVP: no RPC layer, no child-process integration, just a single Node.js service running the solver inline.

Large batch orders with 10+ items spread across all aisles blocked the Node.js event loop for seconds. HTTP requests timed out. Operators stood at the scanning station waiting for a route that wouldn’t come. On a 4 GB server already running other Vivaldi services (the constraint list from Part 1), the memory overhead made things worse. JavaScript’s garbage collector kicked in at the worst moments, and the A* open/closed sets for large pathfinding queries consumed far more memory than equivalent data structures in a systems language would have. I could see in the profiler that the algorithm converged to the right answer. It just took too long to get there.

I knew I needed to move the computation out of Node.js. The question was where.

Generation 2: C++ with Simulated Annealing (2020)

I rewrote the solver in C++ and bound it to Node.js via Node-API (N-API). I kept simulated annealing from the Node.js version but implemented it natively and tuned it further. SA treats the pick sequence as a permutation and repeatedly swaps pairs of positions, accepting swaps that reduce the total walking distance, and occasionally accepting worse swaps (with probability that decays over a “cooling schedule”) to escape local minima. The trade is runtime for solution quality: given enough iterations, SA converges to a near-optimal tour. Route optimization went from seconds to milliseconds. The system finally worked in production.

But C++ introduced a different class of pain. The native module depended on node-gyp for compilation, and node-gyp was barely usable: cryptic build errors, implicit dependencies on Python 2 and platform-specific toolchains, and failure modes I never fully understood. Upgrading Node.js versions was brittle; a minor version bump could break the native module’s ABI compatibility, requiring a full recompile. I verified builds locally on Windows WSL 2 to match the production Debian environment, pinning the same Node.js version via nvm, and deployed the precompiled binary to the server. That workflow was manageable but fragile; the build had to target the exact same OS and Node.js version as production, and any drift meant a broken deployment I’d only discover after restarting the service.

I lived with these tradeoffs for about two years because the solver produced correct routes and business priority shifted elsewhere. The system was in production, operators used it daily, and the C++ deployment pain was a known quantity I could manage manually.

Generation 3: Rust/WASM (2022)

When I revisited the solver in early 2022, I’d been writing Rust for other projects and realized WebAssembly could eliminate every deployment problem the C++ version had. WASM runs anywhere Node.js runs: no native toolchain, no cross-compilation, no brittle binaries. A .wasm file is as portable as a .js file.

The rewrite was not just a language migration. I replaced simulated annealing entirely with a multi-phase greedy solver using Jump Point Search distance computation. SA had always felt like overkill for a warehouse with fewer than 100 named positions, and the language migration gave me the confidence to rethink the algorithm alongside the runtime. The next section explains why in detail.

The ~144 KB WASM binary loads instantly and produces deterministic output: same input, same route, every time. tsify auto-generates TypeScript types from Rust structs, so the Node.js/Rust boundary is type-checked at compile time. No more mystery crashes at the FFI seam.

Each new solver was built alongside the existing one and swapped in once proven, following the Strangler Fig pattern, applied twice. The first swap replaced the Node.js + npm-library solver with C++/N-API; the second replaced C++/N-API with Rust/WASM. In both cases, the optimization function’s signature stayed stable: pass in the requested items and their slot availability, get back the optimized pick sequence. The API never changed; only the engine behind it did.

find_best_sorting.rs: stable across all three generations
// warehouse-opt/src/opt/find_best_sorting.rs

#[derive(Debug, PartialEq, Serialize, Deserialize, Tsify)]
#[tsify(into_wasm_abi, from_wasm_abi)]
#[serde(rename_all = "camelCase")]
pub struct FindBestSortingParams {
  pub requested_items: Vec<RequestedItem>,
  pub items_in_positions: Vec<ItemInPosition>,
}

#[derive(Debug, PartialEq, Serialize, Deserialize, Tsify)]
#[tsify(into_wasm_abi, from_wasm_abi)]
#[serde(rename_all = "camelCase")]
pub struct FindBestSortingResult {
  pub available: Vec<AvailableItem>,
  pub unavailable: Vec<UnavailableItem>,
}

pub fn find_best_sorting(params: FindBestSortingParams)
  -> FindBestSortingResult
{ /* ... */ }

The #[tsify(into_wasm_abi, from_wasm_abi)] annotation tells tsify to generate TypeScript type declarations for FindBestSortingParams and FindBestSortingResult. The #[serde(rename_all = "camelCase")] handles the naming convention mismatch: Rust uses snake_case, TypeScript expects camelCase, and serde translates between them during WASM serialization. The TypeScript code calling this function never knows it’s talking to Rust.

The WASM bridge itself is seven lines:

The entire WASM facade
// warehouse-opt-wasm/src/lib.rs

use warehouse_opt::{
  find_best_sorting, FindBestSortingParams, FindBestSortingResult
};
use wasm_bindgen::prelude::wasm_bindgen;

#[wasm_bindgen(js_name = findBestSorting)]
pub fn find_best_sorting_wasm(
  params: FindBestSortingParams
) -> FindBestSortingResult {
  find_best_sorting(params)
}

This is the Facade Pattern applied to the Rust/WASM boundary. The warehouse-opt crate contains all domain logic (pathfinding, caching, optimization) and has zero knowledge of WASM. The warehouse-opt-wasm crate is a thin shim that annotates types with wasm_bindgen and delegates to the core library. The core crate is testable with standard cargo test; the WASM crate exists only to produce the .wasm binary. If I ever needed to target native FFI instead of WASM, only the shim would change.

Why I Replaced Simulated Annealing

The transition from simulated annealing to multi-phase greedy deserves its own explanation because it went against the conventional wisdom. SA is a well-studied metaheuristic for TSP; a greedy solver is a textbook baseline that papers use as a comparison point. Replacing a sophisticated algorithm with a simpler one felt like a step backward.

It wasn’t. The comparison table makes the reasoning concrete:

PropertySimulated Annealing (Gen 2)Multi-Phase Greedy + 2-opt (Gen 3)
ObjectiveMinimize total walking distance (classic TSP)Multi-objective: minimize distance + prioritize ergonomics + reduce inventory fragmentation
DeterminismStochastic: different runs may produce different routesDeterministic: same input always produces same output
Solution qualityNear-optimal for pure TSP, but solving the wrong problemWithin ~1% of optimal on the actual warehouse problem
Domain awarenessNone; treats all positions equallyGround-level priority, slot depletion, nearest-neighbor chaining
Verifiability”The RNG settled here”Every decision traceable to a concrete rule

SA optimizes a single scalar: total walking distance. That’s the right objective for the textbook TSP, but it’s not the right objective for a warehouse. An operator who picks 6 items from ground-level shelves and walks slightly farther finishes faster and with less fatigue than one who follows the distance-optimal path but reaches overhead shelves four times. SA can’t express “prefer ground-level shelves” without contorting the cost function into a weighted multi-objective mess that’s hard to tune and impossible to reason about deterministically.

The greedy solver encodes these priorities structurally: it selects ground-level items first, depletes nearly-empty slots to free warehouse space, and orders the final tour with nearest-neighbor chaining improved by 2-opt swaps. The behavior is transparent. An operator who receives a surprising route can trace the logic; with SA, the explanation is “the random number generator explored this permutation space and settled here.” Determinism matters when humans need to trust the output.

Pathfinding: A* and the Symmetry Problem

Before explaining the current solver’s phases, I need to cover the pathfinding algorithm that underpins all of them. The optimizer needs the walking distance between any two warehouse positions, and that distance must account for shelf racks that block direct paths. Computing this distance is a shortest-path problem on a 17x31 grid with obstacles.

A* is the standard algorithm for this. It uses a heuristic function to estimate the remaining distance to the goal, expanding the most promising nodes first. With an admissible heuristic (one that never overestimates), A* is optimal: it finds the true shortest path. It’s the default choice, and it was my first instinct.

The problem is efficiency, not correctness. On a uniform-cost grid with open corridors, many optimal paths exist between any two points. Moving right three cells then down two costs exactly the same as moving down two then right three, or any interleaving of those moves. A* doesn’t know this; it evaluates all symmetric variants, expanding far more nodes than necessary. In a warehouse grid with wide corridors and regular aisle spacing, the symmetry is severe. I profiled the Gen 1 Node.js solver and found A* spending most of its time rediscovering the same shortest distance through different orderings of identical-cost moves. The pathfinder was correct but wasteful.

I found the solution in a 2011 paper by Daniel Harabor and Alban Grastien1: Jump Point Search (JPS). It’s an optimization of A* that prunes redundant path evaluations on uniform-cost grids, maintaining optimality while dramatically reducing node expansions.

The core idea is best understood through progressive refinement, from the simplest mechanism to the most nuanced.

Directional pruning. When A* expands a node, it adds all neighbors to the open set. JPS instead examines the direction the algorithm traveled to reach the current node from its parent. Any neighbor reachable more efficiently through an alternative path, one that doesn’t go through the current node, gets pruned. What remains are “natural neighbors,” the nodes that genuinely require expansion from this direction. On an open grid, this prunes most of the symmetric paths immediately.

Straight-line jumping. When moving horizontally or vertically, JPS prunes all neighbors except the one directly ahead, then “jumps” forward in that direction. It skips intermediate nodes entirely, no expanding, no adding to the open set, until it hits an obstacle or discovers something interesting. In a warehouse corridor that runs 15 cells without interruption, A* would expand all 15 nodes; JPS jumps to the end in one step.

Diagonal jumping. When moving diagonally, JPS first recurses horizontally and vertically from the current position. If either direction reveals something interesting (a forced neighbor or the goal), it stops and records that position. Otherwise, it takes one diagonal step and repeats. This recursive structure is where most of the pruning power comes from: a single diagonal move triggers two full-length straight-line scans, and any findings propagate back immediately.

Forced neighbors. This is the key mechanism. An obstacle adjacent to the current path creates a position that is reachable optimally only through the current node; there’s no alternative path that avoids this node and still reaches that neighbor at the same cost. When JPS detects such a forced neighbor, it stops jumping and adds the current position to the priority queue as a “jump point.” Only jump points enter the queue, and there are far fewer of them than the total nodes A* would expand.

On Harabor’s benchmarks across game-style grid maps, JPS achieves 3-30x speedup over vanilla A* while maintaining optimality2. It requires no preprocessing and no additional memory beyond A*‘s standard open/closed sets.

A* vs JPS Pathfinding
A*0 / 177 nodes
B1C1D1E1F1G1H1I1A1B2C2D2E2F2G2H2I2A2B3C3D3E3F3G3H3I3A3B4C4D4E4F4G4H4I4A4B5C5D5E5F5G5H5I5A5B6C6D6E6F6G6H6I6A6B7C7D7E7F7G7H7I7A7B8C8D8E8F8G8A8B9C9D9E9F9G9A9B10C10D10E10F10G10A10D11E11F11G11A11D12F12J5J4J3J2J1
JPS0 / 13 nodes
B1C1D1E1F1G1H1I1A1B2C2D2E2F2G2H2I2A2B3C3D3E3F3G3H3I3A3B4C4D4E4F4G4H4I4A4B5C5D5E5F5G5H5I5A5B6C6D6E6F6G6H6I6A6B7C7D7E7F7G7H7I7A7B8C8D8E8F8G8A8B9C9D9E9F9G9A9B10C10D10E10F10G10A10D11E11F11G11A11D12F12J5J4J3J2J1
Start
End
A* expanded
JPS expanded
Jump point

JPS in This Warehouse

The 17x31 warehouse grid is an ideal JPS target. Long, obstacle-free corridors let the algorithm jump far before encountering forced neighbors. Regular aisle spacing means most jumps traverse the full corridor width without interruption. The shelf racks form a predictable grid of obstacles that creates forced neighbors at consistent, sparse intervals, exactly the topology where JPS delivers its best speedups.

The implementation supports 8-directional movement with an octile distance heuristic: cardinal moves cost 1, diagonal moves cost 2\sqrt{2}. The octile heuristic is tighter than Chebyshev: it computes the exact cost of the optimal unconstrained path, so JPS expands fewer nodes before reaching the goal.

On heuristics: a heuristic in A*/JPS is a function h(n)h(n) that estimates the remaining cost from node nn to the goal. An admissible heuristic never overestimates the true cost. With Chebyshev distance, h(n)=max(Δx,Δy)h(n) = \max(|\Delta x|, |\Delta y|), treating diagonal moves as cost 1. With octile distance, h(n)=max(Δx,Δy)+(21)min(Δx,Δy)h(n) = \max(|\Delta x|, |\Delta y|) + (\sqrt{2} - 1) \cdot \min(|\Delta x|, |\Delta y|), which accounts for the actual 2\sqrt{2} cost of diagonal moves. Both are admissible; octile is tighter, so the search wastes less time on nodes that look promising but aren’t.

Compile-Time Grid Pipeline

JPS operates on (x,y)(x, y) grid coordinates, not on position labels like “A1” or “C7.” The system needs a mapping from human-readable labels to grid coordinates, and that mapping must stay in sync with the physical warehouse layout. The code generation pipeline that produces this mapping evolved through three generations alongside the solver.

Generation 1: Haskell to TypeScript + SQL. A Haskell program parsed a warehouse map definition and generated TypeScript source files (grid arrays, position-to-coordinate maps) plus SQL INSERT INTO statements populating every valid position in the database.

Generation 2: Haskell to C++ + SQL. The same Haskell script extended to generate C++ source files when the solver moved to C++.

Generation 3: Rust build.rs, compile-time. The Rust build system incorporated code generation directly, eliminating the separate Haskell tool. A text file called grid.txt is the single source of truth for the warehouse layout. Each cell is either 0 (walkable floor), 1 (shelf/obstacle), @ (entry door, the tour’s start and end point), or a position label like A1:6 (a named rack with 6 shelves, also walkable). The build.rs script reads this file at compile time and generates generated.rs containing five artifacts:

  1. A phf_map!, a compile-time perfect hash map for O(1) label-to-coordinate lookups. The entry door @ is included as a first-class coordinate.
  2. A SHELF_MAP, the maximum shelf count per rack, used for validation.
  3. A static 17x31 Matrix<u8> encoding walkable vs. obstacle cells.
  4. ENTRY_RACK: &str = "@", the tour’s origin, resolved from the @ marker’s grid position.
  5. A OnceCell singleton wrapping the WarehouseGrid struct.
build.rs: reads grid.txt, emits generated.rs
fn main() {
  let out_path = PathBuf::from(env::var("OUT_DIR").unwrap());
  let grid = fs::read_to_string("grid.txt")
    .expect("Unable to read grid.txt file!");

  let grid: Vec<Vec<GridTile>> = grid.lines().map(|line| {
    line.trim_start_matches('[')
        .trim_end_matches(']')
        .split(",")
        .map(|s| match s.trim() {
          "0"      => GridTile::Walkable,
          "1"      => GridTile::Shelf,
          "@"      => { entry_door = Some(...); GridTile::Walkable },
          position => GridTile::Position2D(position.into()),
        })
        .collect()
  }).collect();

  let generated_rs = make_generated_rs(grid);
  fs::write(out_path.join("generated.rs"), &generated_rs).unwrap();

  println!("cargo:rerun-if-changed=build.rs");
  println!("cargo:rerun-if-changed=grid.txt");
}

The build.rs parses each cell, extracts position labels with their (column,row)(column, row) coordinates, sorts them lexicographically by aisle, and emits the phf_map! entries grouped by aisle with blank lines between groups. The @ marker gets its own entry in both COORD_MAP and SHELF_MAP (with 0 shelves), so distance("@", "B5") works through the same JPS infrastructure as any rack-to-rack query.

A separate Node.js script generates the SQL INSERT INTO statements for the database. A layout change requires three steps: edit grid.txt, run the SQL generation script, recompile and redeploy. If the grid file is malformed, a typo in a position label or an inconsistent row length, the build fails. No malformed grid reaches production.

Codegen Pipeline
Click any stage to see what it produces

From Position Labels to Grid Coordinates

The generated phf_map! maps every position label to its (x,y)(x, y) coordinate on the 17x31 grid:

generated.rs: compile-time perfect hash map (abbreviated)
const COORD_MAP: CoordMap = phf_map! {
  // position label => (x, y) coordinate in the warehouse grid
  "@" => ( 0, 16 ),   // entry door, tour start/end

  "A1" => ( 0, 1 ),
  "A2" => ( 1, 2 ),
  "A3" => ( 1, 3 ),
  // ...
  "B1" => ( 5, 0 ),
  "B2" => ( 5, 1 ),
  "B3" => ( 5, 2 ),
  // ...
  "I1" => ( 29, 0 ),
  "I7" => ( 29, 6 ),

  "J1" => ( 29, 15 ),
  "J5" => ( 25, 15 ),
};

“Perfect hashing” means the hash function has no collisions; every key maps to a unique bucket. The phf crate computes this function at compile time by analyzing the full key set. At runtime, looking up COORD_MAP["C7"] is a single array index operation, not a hash table probe. There’s no runtime parsing of position strings, no string splitting, no character inspection. The label goes in, the coordinate comes out, in O(1) with a small constant.

On perfect hashing: a standard HashMap uses a general-purpose hash function and resolves collisions with chaining or probing. A perfect hash function is constructed for a specific, known key set and guarantees zero collisions. The tradeoff is that the key set must be known at compile time, so you can’t insert new keys at runtime. For a warehouse grid that changes only when the physical layout changes (and triggers a recompile), this tradeoff is ideal.

When the optimizer needs the distance between two positions, it translates both labels to coordinates via the phf_map!, then passes those coordinates to JPS. The map is regenerated whenever grid.txt changes, so a layout reorganization cannot produce stale coordinate lookups. If the grid file and the code disagree, the build fails. The disagreement is impossible in production.

Distance Cache

The real performance win comes from caching JPS results, not from raw JPS speed. With ~96 named positions plus the @ entry door, the maximum number of unique pairwise distances is (972)=4,656\binom{97}{2} = 4{,}656.

Distance Matrix Heatmap
Each cell represents the walking distance between two warehouse positions. The 96 positions form a symmetric matrix with 4,560 unique pairs. Teal = nearby, red = far apart. Click or tap a cell to see the shortest distance between two points.
A
B
C
D
E
F
G
H
I
J
A
B
C
D
E
F
G
H
I
J
Source (rows) / Target (cols)
Walking distance (steps)
0
47
Explore the heatmap to inspect a position pair. Tap or click a cell to see the shortest distance between two points.

A custom SortedPair key type exploits the mathematical property that d(A,B)=d(B,A)d(A, B) = d(B, A):

sorted_pair.rs: symmetric edge key
/// A homogeneous pair of values that are always sorted in ascending order.
/// SortedPair::new(1, 2) and SortedPair::new(2, 1) are both stored as
/// (1, 2) under the hood. This allows using SortedPair as a key in a
/// HashMap whenever the order of the pair doesn't matter, e.g., for
/// modeling symmetric edges in a graph.
#[derive(PartialEq, Eq, PartialOrd, Ord, Hash, Debug)]
pub struct SortedPair<T> {
  smaller: T,
  bigger: T,
}

impl<T: Ord> SortedPair<T> {
  pub fn new(a: T, b: T) -> SortedPair<T> {
    if a < b {
      SortedPair { smaller: a, bigger: b }
    } else {
      SortedPair { smaller: b, bigger: a }
    }
  }
}

SortedPair::new("C7", "A1") and SortedPair::new("A1", "C7") produce the same key. This halves the number of JPS computations required to fill the cache. The distance values are stored as u16. JPS computes a floating-point octile distance, which gets multiplied by 100 and rounded to an integer. A u16 can represent distances up to 655.35 grid units; the warehouse is only 17x31, so the longest possible path fits comfortably.

The cache sits inside the WarehouseGrid struct, guarded by an RwLock for thread safety (required by the OnceCell singleton, even though the system doesn’t use multi-threading):

warehouse_matrix.rs: distance cache with read/write lock
#[derive(Debug, Default)]
pub struct WarehouseGrid {
  grid: Matrix<u8>,
  distance_cache: RwLock<HashMap<SortedPair<String>, u16>>,
}

impl WarehouseGrid {
  /// Returns the walking distance between two warehouse positions.
  /// Checks the cache first; computes via JPS on a miss.
  pub fn distance(&self, source_label: &str, target_label: &str) -> u16 {
    let key: SortedPair<String> =
      (source_label.to_string(), target_label.to_string()).into();

    let distance_cache = self.distance_cache.read().unwrap();
    if let Some(distance) = distance_cache.get(&key) {
      return *distance;
    }
    drop(distance_cache);

    let distance = self.compute_distance(source_label, target_label);

    let mut distance_cache = self.distance_cache.write().unwrap();
    distance_cache.insert(key, distance);
    distance
  }
}

The lookup pattern is read-optimistic: acquire a read lock, check if the key exists, return immediately on a hit. On a miss, drop the read lock, compute the distance via JPS, acquire a write lock, insert the result. After a brief warm-up period, the first few batch orders that exercise different position pairs, nearly all distance lookups are cache hits. JPS only runs for previously unseen position pairs, and with 4,656 maximum entries, the cache fills quickly.

The Current Algorithm

The solver executes in four phases. Each phase narrows the problem: the first removes what can’t be done, the second identifies slots worth emptying, the third selects items with ergonomic priority, and the fourth optimizes the walk sequence.

Phase 1: Availability Filtering

Build a sorted map of requested items with their warehouse quantities. For each item, compare the total available quantity across all positions against the requested quantity. Items where demand exceeds supply are extracted into an unavailable list with the unfulfilled gap. The remaining items proceed to Phase 2 with their quantities capped to what’s available. This phase is O(n)O(n) in the number of requested items.

Phase 2: Depletion Detection

Before any picking decisions, identify positions that could be fully emptied by this order. A position is “depletable” if the item exists at 2+ warehouse locations and units_in_positionremaining_demand\text{units\_in\_position} \leq \text{remaining\_demand}. Emptying a slot frees physical warehouse space, a business priority that the old radial-sort algorithm couldn’t express.

The depletable set is computed once and stored as a sorted Vec for O(logn)O(\log n) binary-search lookups in the sort comparators. No heap allocations occur during comparisons.

Phase 3: Tiered Selection

Items are partitioned into ground-level (shelves 1, 4, 7, 10, where (s1)mod3=0(s - 1) \bmod 3 = 0) and upper-shelf. Within each tier, items are sorted by a composite key:

sort key=(is_depletable, JPS_distance, units_in_position)\text{sort key} = (\text{is\_depletable} \downarrow,\ \text{JPS\_distance} \uparrow,\ \text{units\_in\_position} \uparrow)

Depletable positions sort first; among depletable positions, closer ones come first; among equidistant positions, smaller stocks come first (draining small slots before large ones). A fold then walks through the sorted items, consuming requested quantities and producing AvailableItem records. Ground-level items are processed first; the remaining demand carries over to upper-shelf positions.

Phase 4: NN + 2-opt Tour Ordering

The items selected in Phase 3 are now reordered for the actual walking tour. This is where the biggest improvement came from.

What I had before: a one-shot sort by distance from the entry door @. This produces a radial sweep: positions at similar distances from the door but in different aisles get interleaved, causing the operator to zig-zag. On hand-crafted scenarios, this radial sort produced tours 55% longer than optimal.

What I have now: nearest-neighbor construction followed by 2-opt local search. Starting from @, the algorithm repeatedly picks the closest unvisited position. Then it runs a 2-opt improvement pass: for every pair of edges (i,j)(i, j) in the tour, it checks whether reversing the sub-segment between them would shorten the total distance. If so, it applies the reversal and repeats until no improvement is found.

The 2-opt operates on the open path (@pos[0]pos[n1]@ \to \text{pos}[0] \to \cdots \to \text{pos}[n-1], no return-to-origin assumption), so it’s composable with the two-tier structure: ground positions are optimized as one open path from @, then upper-shelf positions are optimized as another open path from the last ground position. The return leg to @ is added separately at the end.

This is O(n2)O(n^2) per pass; for 15 positions, that’s ~225 edge-pair checks. Trivial. The improvement over pure nearest-neighbor is substantial because 2-opt fixes the “greedy trap” where NN makes a locally optimal choice that forces a later backtrack.

Route Comparison
Naive
C7I5H2H1B3A2E5A5
0 steps
Optimal
A2B3A5C7E5I5H1H2
0 steps
Destination
Naive route
Optimized route

The two sorted lists are chained:

let available_items: Vec<_> = ground_level_selected_items
.into_iter()
.chain(non_ground_selected_items.into_iter())
.collect();

The output is a single FindBestSortingResult containing the available items in pick order and the unavailable items with their unfulfilled quantities.

Solver Generations
Click an era to expand. Run the solver to see simulated execution.

Measuring Against the Optimum

For a long time, I had no way of knowing how close the greedy was to optimal. I assumed the Travelling Salesman Problem was too expensive to solve exactly for benchmarking purposes, and that simulated annealing or heuristic comparisons were the best I could do.

I was wrong. I should have built an ILP formulation much earlier.

The warehouse TSP instances are small, at most 15 positions per order, split into two tiers of fewer than 10 each. Modern ILP solvers handle these sizes in milliseconds. I used HiGHS (open-source, no licensing cost) via the good_lp Rust crate with an MTZ subtour-elimination formulation3.

The formulation for each tier’s sub-tour: given nn positions and a depot, find the minimum-cost open Hamiltonian path starting at the depot. The trick to convert a cycle formulation into an open path is to add a virtual end node ee with zero-cost edges to and from all real nodes. Forcing xe0=1x_{e \to 0} = 1 (virtual end connects back to depot) closes the cycle, but since the virtual edges are free, the real-edge cost equals the open-path cost.

Decision variables:

xij{0,1} i,j{0,,n,e}x_{ij} \in \{0, 1\} \quad \forall\ i, j \in \{0, \ldots, n, e\}

ui{1,,m1} i1(u0=0)u_i \in \{1, \ldots, m-1\} \quad \forall\ i \geq 1 \quad (u_0 = 0)

Objective:

mini,jcijxij\min \sum_{i,j} c_{ij} \cdot x_{ij}

where cijc_{ij} is the cost matrix (JPS distance + shelf access penalty for arriving at position jj), and edges involving the virtual end have zero cost.

Constraints:

jixij=1 i(exactly one outgoing edge)\sum_{j \neq i} x_{ij} = 1 \quad \forall\ i \qquad \text{(exactly one outgoing edge)}

jixji=1 i(exactly one incoming edge)\sum_{j \neq i} x_{ji} = 1 \quad \forall\ i \qquad \text{(exactly one incoming edge)}

xii=0 i(no self-loops)x_{ii} = 0 \quad \forall\ i \qquad \text{(no self-loops)}

xe,0=1,x0,e=0(virtual end returns to depot; depot doesn’t skip to end)x_{e,0} = 1, \quad x_{0,e} = 0 \qquad \text{(virtual end returns to depot; depot doesn't skip to end)}

uiuj+mxijm1 i,j1(MTZ subtour elimination)u_i - u_j + m \cdot x_{ij} \leq m - 1 \quad \forall\ i, j \geq 1 \qquad \text{(MTZ subtour elimination)}

The MTZ constraints ensure the solution forms a single connected path rather than multiple disjoint loops. The uiu_i variables encode the visit order, and the constraint uiuj+mxijm1u_i - u_j + m \cdot x_{ij} \leq m - 1 prevents any subset of nodes from forming an isolated cycle.

What the Numbers Showed

I ran both solvers across 10 hand-crafted scenarios (8-14 positions each) and 50 random scenarios (5 items each), using three different cost strategies:

StrategyDescriptionMean gapp50Max gap
Walk-onlyPure JPS distance, no shelf penalty0.8%0.0%24.7%
Walk + shelfLinear shelf penalty (300 per floor)0.7%0.0%20.0%
Walk + heavy shelfNon-linear (600 mid / 1500 high)0.5%0.0%15.7%

The median gap is zero; the greedy finds the optimal tour in more than half the scenarios. The mean gap is under 1%. The worst case (24.7%) appears in a single pathological random scenario. On the hand-crafted scenarios, 8 out of 10 match the optimum exactly; the remaining two have gaps of 0.4%.

This was the most important thing I learned in this project: I should have started with the ILP formulation earlier. Not to use it in production (HiGHS depends on native C++ and can’t compile to WASM) but as a benchmark. For years I was tuning the greedy heuristic with no ground truth, relying on intuition about whether a route “looked right.” The ILP gave me an exact upper bound on solution quality. It told me when the greedy was already optimal and when it wasn’t. It told me that the radial-sort output ordering was terrible (55% gap) and that nearest-neighbor + 2-opt closed the gap almost entirely (0.4% on the same scenarios). Without the ILP, I’d still be guessing.

The greedy runs in ~2-6ms; the ILP takes 10-700ms depending on instance size. Both are fast enough for an interactive system, but only the greedy compiles to WASM. The ILP lives in a separate warehouse-opt-bench crate that’s never shipped to production; it exists solely to keep the greedy honest.

Performance

Four optimizations compound to produce the system’s runtime characteristics:

OptimizationImpact
Memoization with symmetric key normalizationCustom SortedPair key type halves JPS computations by encoding d(A,B)=d(B,A)d(A,B) = d(B,A) at the type level
Compile-time code generationbuild.rs reads grid.txt and generates perfect hash map, static grid literal, lazy singleton, coordinate lookup function
Aggressive release profileLink-time optimization, opt-level = 3, disabled overflow checks (safe: u16 distances max at 655.35 grid units, warehouse is only 17x31)
Compact binary~144 KB compiled WASM, loads instantly on the memory-constrained 4 GB server
Release profile: every byte earns its keep
# rust/Cargo.toml

[profile.release]
lto = true              # link-time optimization
strip = "debuginfo"     # strip debug sections
opt-level = 3           # optimize for speed
overflow-checks = false # safe: u16 max is 65535, warehouse is 17x31

The lto = true directive enables link-time optimization across all crate boundaries, allowing the compiler to inline functions from warehouse-opt into warehouse-opt-wasm and eliminate dead code that wouldn’t be visible within a single crate. strip = "debuginfo" removes debug sections from the binary without affecting function names (useful for any remaining stack traces). overflow-checks = false disables runtime integer overflow detection, safe here because u16 distances cap at 65,535 and the warehouse’s longest possible path is well under that limit.

The REST API measures WASM execution time for every batch order request. Route computation completes in sub-second time; the full API response including database queries and serialization stays well within interactive thresholds. The first request after a server restart is slower (WASM module instantiation, grid singleton initialization, cold distance cache), but subsequent requests benefit from the warm cache and initialized singleton.

On the migration strategy: the Strangler Fig pattern made the three-generation evolution possible without ever taking the system offline. Each new solver was deployed alongside the previous one, with a feature flag in the API to switch between them. Once the new solver passed validation on real orders (same inputs, comparing outputs), the flag was flipped and the old implementation was removed in the next deploy. The API contract, FindBestSortingParams in and FindBestSortingResult out, never changed. The optimization engine is a replaceable module behind a stable interface, and I’ve replaced it twice.


Three generations, three languages, two algorithmic paradigms, and the final engine is the simplest of the three. The ~144 KB WASM binary with compile-time grid generation, JPS pathfinding, and a nearest-neighbor + 2-opt solver runs faster, deploys easier, and produces routes within 1% of the mathematical optimum. The greedy doesn’t need to be optimal; it needs to encode ergonomic priorities and inventory defragmentation that a stochastic optimizer can’t express, and it needs to produce the same route every time. The ILP sitting next to it in the benchmark crate ensures I know the cost of those tradeoffs in exact numbers, not gut feelings.

Part 3 picks up at the system boundary: how the REST API, gRPC services, and WASM engine are composed into a deployable system. I’ll cover the monorepo structure, the Vertical Slice Architecture that repeats across all API modules, the hybrid REST/gRPC architecture, the log pipeline that turns stdout into email alerts, and the systemd deployment model that replaced Docker on a server where Docker wasn’t allowed. The constraints from Part 1 and the optimization engine from this post are the foundation; Part 3 shows how they’re wired together and kept running.