Skip to main content
← Writing

Stochastic Greedy: Scaling Submodular Maximization to Massive Datasets

· 9 min read
Series parts
  1. Part 1An Introduction to Submodularity
  2. Part 2The Greedy Algorithm for Submodular Maximization
  3. Part 3Stochastic Greedy: Scaling Submodular Maximization to Massive Datasets
On this page

In Part 2, I showed that the greedy algorithm achieves a (11/e)0.632(1 - 1/e) \approx 0.632 approximation ratio for maximizing a monotone submodular function under a cardinality constraint, the best any polynomial-time algorithm can guarantee. The algorithm is simple: at each of kk steps, scan all nn elements, pick the one with the largest marginal gain, add it to the solution.

The cost is O(nk)\mathcal{O}(n \cdot k) evaluations of ff. For modest problem sizes, this is perfectly adequate. The trouble starts when nn is large.

The Scalability Problem

Consider a concrete scenario: you are selecting k=1,000k = 1{,}000 representative images from a corpus of n=10,000,000n = 10{,}000{,}000 for a dataset summarization task. Each evaluation of ff, which might involve computing pairwise similarities or coverage scores, takes around 1ms. The greedy algorithm needs nk=1010n \cdot k = 10^{10} evaluations. At 1ms each, that is 10710^7 seconds, or roughly 115 days.

Lazy Greedy (Part 2) can help in practice by skipping redundant recomputations, sometimes by a factor of 5x to 100x. But the worst-case bound remains O(nk)\mathcal{O}(n \cdot k), and on adversarial or poorly structured instances the priority queue offers no benefit. We need an algorithm whose theoretical runtime is better, not just one that is faster on benign inputs.

The question is whether we can replace the linear dependence on kk with something smaller, while preserving (or nearly preserving) the (11/e)(1 - 1/e) guarantee.

The Key Idea: Random Subsampling

The answer turns out to be straightforward. Instead of scanning all nn remaining elements at each step to find the best marginal gain, sample a small random subset QQ and pick the best element from QQ.

The intuition is natural. The optimal solution SS^* contains kk elements scattered across the ground set V\mathcal{V}. If you draw a random sample of size ss from the nn available elements, the probability that QQ contains at least one element from SS^* is 1(1k/n)s1 - (1 - k/n)^s1. Set s=(n/k)ln(1/ε)s = (n/k) \cdot \ln(1/\varepsilon) and this probability exceeds 1ε1 - \varepsilon. The sample does not need to be large; it just needs to be large enough to “hit” at least one good element with high probability.

Hit Probability Simulator

Below is a ground set of 100 elements. 5 are "good" (the optimal solution S*). Each time you draw a sample, the algorithm picks s random elements and checks whether any overlap with S*. A hit means the sample found at least one good element. The question is: how large does s need to be to hit reliably?

20
with ε = 0.01: s = 92, guaranteeing P(hit) ≥ 99%
Ground set (n = 100)
S* (good)
Sampled
Hit (overlap)
How hit probability grows with sample size
0204060801000.000.250.500.751.0sample size stheory: 64.2%
Theory: P(hit) = 1 − (1−k/n)s
Click "Draw sample" to pick s = 20 random elements and check for overlap with S*.

Try this: Set s to a small value (say 5), draw a few samples, and notice how often you miss. Then drag s up to 40–50 and draw again — hits become near-certain. The curve on the right shows why.

If you have worked with stochastic gradient descent, the pattern is familiar2. SGD replaces the full gradient computation (over all data points) with a noisy estimate computed on a mini-batch. The estimate is biased on any single step, but the cumulative effect over many steps converges to the right answer, with controllable variance. Stochastic Greedy operates on the same principle: each step is noisier than full greedy, but the overall solution quality degrades only by a small, tunable factor ε\varepsilon.

The Stochastic Greedy Algorithm

Mirzasoleiman et al. (2015)3 introduced this algorithm under the evocative title “Lazier Than Lazy Greedy.” The entire modification to the standard greedy algorithm fits in a single line: the for each over VS\mathcal{V} \setminus S becomes a for each over a random sample QQ:

StochasticGreedy(f,V,k,ε)snkln ⁣(1ε)S0for j=1 to k:Qrandom subset of VSj1 of size min(s, VSj1)eargmaxeQf(eSj1)SjSj1{e}return Sk\boxed{ \begin{aligned} & \textbf{StochasticGreedy}(f, \mathcal{V}, k, \varepsilon) \\ & \quad s \leftarrow \left\lfloor \frac{n}{k} \cdot \ln\!\left(\frac{1}{\varepsilon}\right) \right\rfloor \\ & \quad S_0 \leftarrow \varnothing \\ & \quad \textbf{for } j = 1 \textbf{ to } k\textbf{:} \\ & \quad \quad Q \leftarrow \text{random subset of } \mathcal{V} \setminus S_{j-1} \text{ of size } \min(s,\ |\mathcal{V} \setminus S_{j-1}|) \\ & \quad \quad e^* \leftarrow \arg\max_{e \in Q}\, f(e \mid S_{j-1}) \\ & \quad \quad S_j \leftarrow S_{j-1} \cup \{e^*\} \\ & \quad \textbf{return } S_k \end{aligned} }

The sample size s=(n/k)ln(1/ε)s = \lfloor (n/k) \cdot \ln(1/\varepsilon) \rfloor is the only new parameter. It controls the tradeoff between speed and approximation quality:

  • Smaller ε\varepsilon \Rightarrow larger ss \Rightarrow closer to full greedy \Rightarrow better guarantee, slower.
  • Larger ε\varepsilon \Rightarrow smaller ss \Rightarrow more aggressive subsampling \Rightarrow faster, weaker guarantee.

The sample QQ is re-drawn independently at every step, which means Stochastic Greedy covers the entire ground set in expectation over kk iterations.

Runtime Analysis

Evaluations per step: s=O((n/k)log(1/ε))s = \mathcal{O}((n/k) \cdot \log(1/\varepsilon)) instead of nn.

Total evaluations: ks=kO ⁣(nklog1ε)=O(nlog(1/ε))k \cdot s = k \cdot \mathcal{O}\!\left(\frac{n}{k} \cdot \log\frac{1}{\varepsilon}\right) = \mathcal{O}(n \cdot \log(1/\varepsilon)).

The factor of kk in the greedy runtime has been replaced by log(1/ε)\log(1/\varepsilon). For any fixed ε\varepsilon, this is a constant, so the total cost is essentially linear in nn.

A concrete comparison makes the difference tangible:

nnkkε\varepsilonEvaluations
Greedy1,000,0001{,}000{,}0001,0001{,}000-10910^9
Stochastic Greedy1,000,0001{,}000{,}0001,0001{,}0000.010.014.6×106\approx 4.6 \times 10^6

That is a ~217x speedup. At 1ms per evaluation, the wall-clock time drops from 11.6 days to 77 minutes.

The speedup factor is k/log(1/ε)k / \log(1/\varepsilon). As kk grows, the advantage of Stochastic Greedy becomes more pronounced. For k=10,000k = 10{,}000 and ε=0.01\varepsilon = 0.01, the speedup is \sim2,174x.

The Approximation Guarantee

Theorem (Mirzasoleiman et al. 2015). Let f:2VR+f : 2^{\mathcal{V}} \rightarrow \mathbb{R}_+ be a monotone submodular function, kk a positive integer, and ε>0\varepsilon > 0. The Stochastic Greedy algorithm returns a set SkS_k satisfying:

E[f(Sk)](11eε)f(S)\mathbb{E}[f(S_k)] \geq \left(1 - \frac{1}{e} - \varepsilon\right) \cdot f(S^*)

The guarantee is (11/eε)(1 - 1/e - \varepsilon) in expectation, compared to the deterministic (11/e)(1 - 1/e) of standard greedy. The additive loss of ε\varepsilon is the price we pay for subsampling; it can be made arbitrarily small by increasing ss.

Why Does It Still Work? Proof Intuition

The proof follows the same three-step structure as the greedy proof from Part 2, with a probabilistic layer on top.

Step 1: Bounding a single step. Recall that in the standard greedy proof, we showed that the element ee^* picked by greedy satisfies f(eSj)δj/kf(e^* \mid S_j) \geq \delta_j / k, where δj=f(S)f(Sj)\delta_j = f(S^*) - f(S_j) is the gap at step jj. The argument relied on the fact that the kk elements of SS^* collectively account for at least δj\delta_j in marginal gain, so at least one of them has gain δj/k\geq \delta_j / k, and greedy picks the best over all elements.

With Stochastic Greedy, we pick the best over a random sample QQ of size ss. The probability that QQ contains none of the kk “good” elements from SS^* is at most4:

(1kn)sesk/n=eln(1/ε)=ε\left(1 - \frac{k}{n}\right)^s \leq e^{-sk/n} = e^{-\ln(1/\varepsilon)} = \varepsilon

With probability 1ε\geq 1 - \varepsilon, the sample QQ contains at least one element with marginal gain δj/k\geq \delta_j / k. Stochastic Greedy picks the best in QQ, so it does at least as well.

Step 2: Compounding. Just as before, the gap shrinks geometrically at each step. The per-step shrinkage factor is slightly worse, (1(1ε)/k)(1 - (1 - \varepsilon)/k) instead of (11/k)(1 - 1/k), because we occasionally miss all good elements (with probability ε\varepsilon).

Step 3: The limit. After kk steps, the expected gap is bounded by:

E[δk](11εk)kf(S)(1e+ε)f(S)\mathbb{E}[\delta_k] \leq \left(1 - \frac{1-\varepsilon}{k}\right)^k \cdot f(S^*) \leq \left(\frac{1}{e} + \varepsilon\right) \cdot f(S^*)

Rearranging: E[f(Sk)](11/eε)f(S)\mathbb{E}[f(S_k)] \geq (1 - 1/e - \varepsilon) \cdot f(S^*).

The structure is identical to the deterministic proof. The only difference is the ε\varepsilon leakage at each step, which accumulates to an additive ε\varepsilon in the final bound. The proof is clean because submodularity gives the per-step bound, random sampling gives the “hit probability,” and the rest is the same geometric compounding from Part 2.

Comparison Table

PropertyGreedyLazy GreedyStochastic Greedy
Evaluations per stepnnnn (worst case)(n/k)log(1/ε)(n/k) \cdot \log(1/\varepsilon)
Total evaluationsO(nk)\mathcal{O}(n \cdot k)O(nk)\mathcal{O}(n \cdot k) (worst case)O(nlog(1/ε))\mathcal{O}(n \cdot \log(1/\varepsilon))
Approximation ratio11/e1 - 1/e11/e1 - 1/e11/eε1 - 1/e - \varepsilon (expected)
DeterministicYesYesNo
Practical speedup over GreedyBaseline5x–100x (instance-dependent)k/log(1/ε)\sim k / \log(1/\varepsilon) (predictable)

The tradeoff is explicit: Stochastic Greedy trades a small, tunable approximation loss ε\varepsilon for a predictable runtime improvement of factor k/log(1/ε)k / \log(1/\varepsilon). Lazy Greedy can sometimes match or beat this in practice, but offers no worst-case guarantee beyond standard greedy.

Practical Guidance

Small nn (under \sim10,000). Standard Greedy or Lazy Greedy is fine. The overhead of random sampling and the ε\varepsilon degradation are not worth it when the full scan over V\mathcal{V} is already cheap.

Moderate nn, small kk. Lazy Greedy tends to perform well here. When kk is small, the priority queue rarely needs many re-evaluations per step, and the deterministic guarantee of (11/e)(1 - 1/e) is preferable to the expected (11/eε)(1 - 1/e - \varepsilon).

Large nn, moderate-to-large kk. This is where Stochastic Greedy shines. The runtime O(nlog(1/ε))\mathcal{O}(n \cdot \log(1/\varepsilon)) is independent of kk, making it the only option that scales gracefully when both nn and kk are large.

Choice of ε\varepsilon. A common conservative choice is ε=1/(4n)\varepsilon = 1/(4n)5. In practice, ε=0.01\varepsilon = 0.01 or even ε=0.1\varepsilon = 0.1 works well; the theoretical bound is a worst case, and real-world instances tend to behave better. The corresponding log(1/ε)\log(1/\varepsilon) values are 4.6\approx 4.6 and 2.3\approx 2.3, respectively, which keeps the sample size small.

Key Takeaways

  • Standard Greedy requires O(nk)\mathcal{O}(n \cdot k) evaluations, which becomes prohibitive when both nn and kk are large.
  • Stochastic Greedy replaces the full scan at each step with a random subsample of size s=(n/k)ln(1/ε)s = \lfloor (n/k) \cdot \ln(1/\varepsilon) \rfloor, reducing the total cost to O(nlog(1/ε))\mathcal{O}(n \cdot \log(1/\varepsilon)).
  • The approximation guarantee degrades from (11/e)(1 - 1/e) to (11/eε)(1 - 1/e - \varepsilon) in expectation, where ε\varepsilon is a tunable parameter. For practical values of ε\varepsilon, the speedup factor is k/log(1/ε)\sim k / \log(1/\varepsilon), often 100x or more.
  • The algorithm is trivial to implement: the only change from standard greedy is replacing the scan over VS\mathcal{V} \setminus S with a scan over a random sample QQ.

Further Reading

  • B. Mirzasoleiman, A. Badanidiyuru, A. Karbasi, J. Vondrák, and A. Krause, “Lazier Than Lazy Greedy” (2015), the original Stochastic Greedy paper.
  • A. Badanidiyuru and J. Vondrák, “Fast Algorithms for Maximizing Submodular Functions”6 (2014), the Decreasing Threshold Greedy framework.
  • A. Krause and D. Golovin, “Submodular Function Maximization” (2014), a comprehensive survey covering greedy, continuous relaxation, and streaming methods.