Stochastic Greedy: Scaling Submodular Maximization to Massive Datasets

In Part 2, I showed that the greedy algorithm achieves a $(1 - 1/e) \approx 0.632$ approximation ratio for maximizing a monotone submodular function under a cardinality constraint, the best any polynomial-time algorithm can guarantee. The algorithm is simple: at each of $k$ steps, scan all $n$ elements, pick the one with the largest marginal gain, add it to the solution.

The cost is $\mathcal{O}(n \cdot k)$ evaluations of $f$ . For modest problem sizes, this is perfectly adequate. The trouble starts when $n$ is large.

The Scalability Problem

Consider a concrete scenario: you are selecting $k = 1{,}000$ representative images from a corpus of $n = 10{,}000{,}000$ for a dataset summarization task. Each evaluation of $f$ , which might involve computing pairwise similarities or coverage scores, takes around 1ms. The greedy algorithm needs $n \cdot k = 10^{10}$ evaluations. At 1ms each, that is $10^7$ seconds, or roughly 115 days.

Lazy Greedy (Part 2) can help in practice by skipping redundant recomputations, sometimes by a factor of 5x to 100x. But the worst-case bound remains $\mathcal{O}(n \cdot k)$ , and on adversarial or poorly structured instances the priority queue offers no benefit. We need an algorithm whose theoretical runtime is better, not just one that is faster on benign inputs.

The question is whether we can replace the linear dependence on $k$ with something smaller, while preserving (or nearly preserving) the $(1 - 1/e)$ guarantee.

The Key Idea: Random Subsampling

The answer turns out to be straightforward. Instead of scanning all $n$ remaining elements at each step to find the best marginal gain, sample a small random subset $Q$ and pick the best element from $Q$ .

The intuition is natural. The optimal solution $S^*$ contains $k$ elements scattered across the ground set $\mathcal{V}$ . If you draw a random sample of size $s$ from the $n$ available elements, the probability that $Q$ contains at least one element from $S^*$ is $1 - (1 - k/n)^s$ ¹. Set $s = (n/k) \cdot \ln(1/\varepsilon)$ and this probability exceeds $1 - \varepsilon$ . The sample does not need to be large; it just needs to be large enough to “hit” at least one good element with high probability.

Hit Probability Simulator

Below is a ground set of 100 elements. 5 are "good" (the optimal solution S*). Each time you draw a sample, the algorithm picks s random elements and checks whether any overlap with S*. A hit means the sample found at least one good element. The question is: how large does s need to be to hit reliably?

Good elements k

Sample size s20

with ε = 0.01: s = 92, guaranteeing P(hit) ≥ 99%

Ground set (n = 100)

S* (good)

Sampled

Hit (overlap)

How hit probability grows with sample size

Theory: P(hit) = 1 − (1−k/n)^s

Click "Draw sample" to pick s = 20 random elements and check for overlap with S*.

Try this: Set s to a small value (say 5), draw a few samples, and notice how often you miss. Then drag s up to 40–50 and draw again — hits become near-certain. The curve on the right shows why.

If you have worked with stochastic gradient descent, the pattern is familiar². SGD replaces the full gradient computation (over all data points) with a noisy estimate computed on a mini-batch. The estimate is biased on any single step, but the cumulative effect over many steps converges to the right answer, with controllable variance. Stochastic Greedy operates on the same principle: each step is noisier than full greedy, but the overall solution quality degrades only by a small, tunable factor $\varepsilon$ .

The Stochastic Greedy Algorithm

Mirzasoleiman et al. (2015)³ introduced this algorithm under the evocative title “Lazier Than Lazy Greedy.” The entire modification to the standard greedy algorithm fits in a single line: the for each over $\mathcal{V} \setminus S$ becomes a for each over a random sample $Q$ :

\boxed{ \begin{aligned} & \textbf{StochasticGreedy}(f, \mathcal{V}, k, \varepsilon) \\ & \quad s \leftarrow \left\lfloor \frac{n}{k} \cdot \ln\!\left(\frac{1}{\varepsilon}\right) \right\rfloor \\ & \quad S_0 \leftarrow \varnothing \\ & \quad \textbf{for } j = 1 \textbf{ to } k\textbf{:} \\ & \quad \quad Q \leftarrow \text{random subset of } \mathcal{V} \setminus S_{j-1} \text{ of size } \min(s,\ |\mathcal{V} \setminus S_{j-1}|) \\ & \quad \quad e^* \leftarrow \arg\max_{e \in Q}\, f(e \mid S_{j-1}) \\ & \quad \quad S_j \leftarrow S_{j-1} \cup \{e^*\} \\ & \quad \textbf{return } S_k \end{aligned} }

The sample size $s = \lfloor (n/k) \cdot \ln(1/\varepsilon) \rfloor$ is the only new parameter. It controls the tradeoff between speed and approximation quality:

Smaller $\varepsilon$ $\Rightarrow$ larger $s$ $\Rightarrow$ closer to full greedy $\Rightarrow$ better guarantee, slower.
Larger $\varepsilon$ $\Rightarrow$ smaller $s$ $\Rightarrow$ more aggressive subsampling $\Rightarrow$ faster, weaker guarantee.

The sample $Q$ is re-drawn independently at every step, which means Stochastic Greedy covers the entire ground set in expectation over $k$ iterations.

Runtime Analysis

Evaluations per step: $s = \mathcal{O}((n/k) \cdot \ln(1/\varepsilon))$ instead of $n$ .

Total evaluations: $k \cdot s = k \cdot \mathcal{O}\!\left(\frac{n}{k} \cdot \ln\frac{1}{\varepsilon}\right) = \mathcal{O}(n \cdot \ln(1/\varepsilon))$ .

The factor of $k$ in the greedy runtime has been replaced by $\ln(1/\varepsilon)$ . For any fixed $\varepsilon$ , this is a constant, so the total cost is essentially linear in $n$ .

A concrete comparison makes the difference tangible:

	$n$	$k$	$\varepsilon$	Evaluations
Greedy	$1{,}000{,}000$	$1{,}000$	-	$10^9$
Stochastic Greedy	$1{,}000{,}000$	$1{,}000$	$0.01$	$\approx 4.6 \times 10^6$

That is a ~217x speedup. At 1ms per evaluation, the wall-clock time drops from 11.6 days to 77 minutes.

The speedup factor is $k / \ln(1/\varepsilon)$ . As $k$ grows, the advantage of Stochastic Greedy becomes more pronounced. For $k = 10{,}000$ and $\varepsilon = 0.01$ , the speedup is $\sim$ 2,174x.

The Approximation Guarantee

Theorem (Mirzasoleiman et al. 2015). Let $f : 2^{\mathcal{V}} \rightarrow \mathbb{R}_+$ be a monotone submodular function, $k$ a positive integer, and $\varepsilon > 0$ . The Stochastic Greedy algorithm returns a set $S_k$ satisfying:
$\mathbb{E}[f(S_k)] \geq \left(1 - \frac{1}{e} - \varepsilon\right) \cdot f(S^*)$

The guarantee is $(1 - 1/e - \varepsilon)$ in expectation, compared to the deterministic $(1 - 1/e)$ of standard greedy. The additive loss of $\varepsilon$ is the price we pay for subsampling; it can be made arbitrarily small by increasing $s$ .

Why Does It Still Work? Proof Intuition

The proof follows the same three-step structure as the greedy proof from Part 2, with a probabilistic layer on top.

Step 1: Bounding a single step. Recall that in the standard greedy proof, we showed that the element $e^*$ picked by greedy satisfies $f(e^* \mid S_j) \geq \delta_j / k$ , where $\delta_j = f(S^*) - f(S_j)$ is the gap at step $j$ . The argument relied on the fact that the $k$ elements of $S^*$ collectively account for at least $\delta_j$ in marginal gain, so at least one of them has gain $\geq \delta_j / k$ , and greedy picks the best over all elements.

With Stochastic Greedy, we pick the best over a random sample $Q$ of size $s$ . The probability that $Q$ contains none of the $k$ “good” elements from $S^*$ is at most⁴:

\left(1 - \frac{k}{n}\right)^s \leq e^{-sk/n} = e^{-\ln(1/\varepsilon)} = \varepsilon

With probability $\geq 1 - \varepsilon$ , the sample $Q$ contains at least one element with marginal gain $\geq \delta_j / k$ . Stochastic Greedy picks the best in $Q$ , so it does at least as well.

Step 2: Compounding. Just as before, the gap shrinks geometrically at each step. The per-step shrinkage factor is slightly worse, $(1 - (1 - \varepsilon)/k)$ instead of $(1 - 1/k)$ , because we occasionally miss all good elements (with probability $\varepsilon$ ).

Step 3: The limit. After $k$ steps, the expected gap is bounded by:

\mathbb{E}[\delta_k] \leq \left(1 - \frac{1-\varepsilon}{k}\right)^k \cdot f(S^*) \leq e^{-(1-\varepsilon)} \cdot f(S^*)

This already shows the same qualitative picture as in Part 2: the gap still shrinks geometrically, just with a slightly weaker exponent. The full proof in Mirzasoleiman et al. sharpens this recurrence into the cleaner additive form stated in the theorem above:

\mathbb{E}[f(S_k)] \geq \left(1 - \frac{1}{e} - \varepsilon\right) \cdot f(S^*)

The structure is identical to the deterministic proof. The only difference is the $\varepsilon$ leakage in the per-step progress bound. Submodularity gives the gap-closing argument, random sampling gives the hit probability, and the rest is the same geometric compounding from Part 2.

Comparison Table

Property	Greedy	Lazy Greedy	Stochastic Greedy
Evaluations per step	$n$	$n$ (worst case)	$(n/k) \cdot \ln(1/\varepsilon)$
Total evaluations	$\mathcal{O}(n \cdot k)$	$\mathcal{O}(n \cdot k)$ (worst case)	$\mathcal{O}(n \cdot \ln(1/\varepsilon))$
Approximation ratio	$1 - 1/e$	$1 - 1/e$	$1 - 1/e - \varepsilon$ (expected)
Deterministic	Yes	Yes	No
Practical speedup over Greedy	Baseline	5x–100x (instance-dependent)	$\sim k / \ln(1/\varepsilon)$ (predictable)

The tradeoff is explicit: Stochastic Greedy trades a small, tunable approximation loss $\varepsilon$ for a predictable runtime improvement of factor $k / \ln(1/\varepsilon)$ . Lazy Greedy can sometimes match or beat this in practice, but offers no worst-case guarantee beyond standard greedy.

Practical Guidance

Small $n$ (under $\sim$ 10,000). Standard Greedy or Lazy Greedy is fine. The overhead of random sampling and the $\varepsilon$ degradation are not worth it when the full scan over $\mathcal{V}$ is already cheap.

Moderate $n$ , small $k$ . Lazy Greedy tends to perform well here. When $k$ is small, the priority queue rarely needs many re-evaluations per step, and the deterministic guarantee of $(1 - 1/e)$ is preferable to the expected $(1 - 1/e - \varepsilon)$ .

Large $n$ , moderate-to-large $k$ . This is where Stochastic Greedy shines. The runtime $\mathcal{O}(n \cdot \ln(1/\varepsilon))$ is independent of $k$ , making it the only option that scales gracefully when both $n$ and $k$ are large.

Choice of $\varepsilon$ . A common conservative choice is $\varepsilon = 1/(4n)$ ⁵. In practice, $\varepsilon = 0.01$ or even $\varepsilon = 0.1$ works well; the theoretical bound is a worst case, and real-world instances tend to behave better. The corresponding $\ln(1/\varepsilon)$ values are $\approx 4.6$ and $\approx 2.3$ , respectively, which keeps the sample size small.

Key Takeaways

Standard Greedy requires $\mathcal{O}(n \cdot k)$ evaluations, which becomes prohibitive when both $n$ and $k$ are large.
Stochastic Greedy replaces the full scan at each step with a random subsample of size $s = \lfloor (n/k) \cdot \ln(1/\varepsilon) \rfloor$ , reducing the total cost to $\mathcal{O}(n \cdot \ln(1/\varepsilon))$ .
The approximation guarantee degrades from $(1 - 1/e)$ to $(1 - 1/e - \varepsilon)$ in expectation, where $\varepsilon$ is a tunable parameter. For practical values of $\varepsilon$ , the speedup factor is $\sim k / \ln(1/\varepsilon)$ , often 100x or more.
The algorithm is trivial to implement: the only change from standard greedy is replacing the scan over $\mathcal{V} \setminus S$ with a scan over a random sample $Q$ .