## Friday, August 12, 2011

### Information Theory of Compressed Sensing, Sparse Vectors

The papers leading up to this point are the papers in the compressed sensing literature about sparse recovery (recovering sparse vectors of unknown support with few linear measurements relative to the ambient dimension), and most recently, a paper about quantization of compressed sensing measurements1. Here is the specific setup:

High dimensional space $\mathbb{R}^N$
Class of $k$-sparse signals $\Sigma_k = \{ x\in \mathbb{R}^N: |\mbox{supp}(x)| = k \}$
Measurement matrix $\Phi \in \mathbb{R}^{m\times N}$, $m \ll N$, matrix satisfies some special property.
Compression:  Given $x \in \Sigma_k$,  compute $y = \Phi x$, store $y$.
Recovery:  Solve $\min_z \|z\|_1$ s.t. $\Phi z = y$.

The recent theoretical results allow successful recovery with $m \sim C_1 k \log(N/k)$ for some constant $C_1$. Furthermore, if the measurements are perturbed by some small vector $e$ with $\|e\|_2 \leq \epsilon$, then using a modified recovery algorithm: $\min_z \|z\|_1$ s.t. $\| \Phi z - y \|_2 \leq \epsilon$, the recovered vector $x_\ast$ satisfies  $\|x - x_\ast\|_2 \leq C_2 \epsilon / \sqrt{m}$.

If we apply this result to quantization error, rounding each measurement $y_i$ to the nearest point in an evenly spaced grid $\delta \mathbb{Z}$, then we note $\|e\|_2 \leq \sqrt{m} \delta$ so that the recovered vector will be off by $\leq C_2 \delta$. Note there is no dependence on $m$. By rounding each measurement independently, we have not taken advantage of the fact that the measurements are correlated (a union of $k$-dimensional subspaces embedded into $m > k$ dimensions). The higher the number of measurements, the more correlated the measurements are.

The main result of the quantization paper above is that if we use $\Sigma\Delta$ modulation as the quantization scheme, then we can achieve a better error bound which depends on $m$:

Theorem B. Let $\Phi$ be an $m\times N$ matrix whose entries are i.i.d. according to $\mathcal{N}(0,1)$. Suppose $\alpha \in (0,1)$ and $\lambda := m/k \geq c(\log N)^{1/(1-\alpha)}$ where $c = c(r,\alpha)$. Then there are two constants $c'$ and $C$ that depend only on $r$ such that with probability at least $1-\exp(-c' m \lambda^{-\alpha})$ on the draw of $\Phi$, the following holds: For every $x\in \Sigma_k$ such that $\min_{j \in \rm{supp}(x)} |x_j| \geq C\delta$, the reconstruction satisfies
$\|x - \tilde{x}\|_2 \lesssim \delta \lambda^{-\alpha (r-1/2)}$

Here $r$ is the order of $\Sigma\Delta$ used, where we note that for this particular result, for larger $r$ we need more bits in the quantization to keep the scheme stable. This result is also just for random Gaussian measurements. We will ignore the size condition $|x_j| \geq C \delta$ for now, but the paper mentions that this technical condition is also practical ($\delta$ is smaller than the max uncertainty from rounding, may as well assume elements with magnitude $\leq C \delta$ are $0$)

For the rest, assume also that $|x_j| \leq 1$, so that we are studying the compression of $X = \Sigma_k \cap B_1(l^\infty(\mathbb{R}^N))$. To summarize the information-theoretic results of the paper, for an $r$-th order scheme, the number of bits needed in the quantizer is $\approx \log_2\left(\frac{5}{C\delta} \ 2^{r+1/2} \lambda^{(1-\alpha)/2} k\right)$, and the corresponding distortion is $D(r,\lambda) \lesssim \frac{\lambda^{-\alpha(r-1/2)} \delta}{2^{r+1/2}}$. Given a bit budget $B$, let $b(r,\lambda)$ be number of bits of the quantizer. We need to minimize the distortion $D(r,\lambda)$ such that $k\lambda b(r,\lambda) \leq B$. Then this will give us a rate-distortion curve $D(B)$ using compressed sensing and sigma delta quantization. (To be continued... Edit: There's something missing here and I can't seem to figure out what I need to make all the parameters and constants make sense...)

To compare to a sillier encoding scheme, we can simply store ordered pairs $(x_i, i)$, where we use $b$ bits to store the coefficient and $\log_2 N$ bits to store the index. In total this would use $bk \log(N)$ bits. Setting this to the bit budget $B$, we have $b = B / (k \log(N))$. Then to compute the distortion, again we assume $|x_i| \leq 1$, and if we use spacing $\delta$ for quantization, then we need $b = \log_2(2/\delta)$. Then $\delta = 2^{(b-1)}$. Each entry would pick up an error of at most $\delta/2 = 2^{-b}$, and then the overall distortion in the $2$-norm would be $\sqrt{k}\delta/2 = \sqrt{k} 2^{-B/(k \log(N))} =: D(B)$.
Or if we look at the inverse function,
$B(D) = k\ \log_2(\sqrt{k} / D) \log(N)$

Kolmogorov Entropy

We can also compute the theoretical limits via Kolmogorov entropy, or logarithm of the minimum number of balls (with respect to a fixed norm, $l^2$ say) of radius $D$ needed to cover the set $X$. In this setup, a cover corresponds to an encoding scheme, which is simply to map a given element to the nearest ball center. Then the number of bits of the scheme is the logarithm of the size of the cover.

The minimum cover size is the same as the maximum number of points mutually separated by $D$, since a cover must necessarily cover each point, and no $D$-ball can cover two such $D$-separated points. A maximal set of $D$-separated points can be used to form a cover by using the points as the centers of the $D$-balls, and it is necessarily a cover by maximality of the separated set.

Consider again $X=\Sigma_k \cap B_1(l^\infty(\mathbb{R}^N))$. This is a $k$-dimensional subset, and effectively the ($k$-dimensional) volume of $X$ is $\binom{N}{k} 2^k$. Now consider a maximal $\epsilon$-separated set. If we consider (disjoint) $l^2(\mathbb{R}^N)$ balls of radius $D/2$ around each point in the set, and intersect with $X$, we have a subset of $X$ with effective volume bounded below by
${\rm \#\ points} \times {\rm vol} (B_{D/2}(l^2(\mathbb{R}^k)))$ so that
${\rm \#\ points} \leq \frac{1}{c_k} (2/D)^k \binom{N}{k} 2^k \sim \frac{1}{c_k \sqrt{k}} (4\lambda / D)^k$
Here $c_k = \pi^{k/2} / \Gamma(k/2+1) \sim (2\pi e/k)^{k/2} / \sqrt{\pi k}$
In other words, the optimal bitrate given distortion has the bound
$B(D) \lesssim k \log(c\sqrt{k} \lambda/D)$

(What's hidden: This bound is probably not that great because intersecting the ball with $X$ may have many $k$-dimensional pieces, especially around the origin where all $\binom{N}{k}$ hyperplanes pass through)

Another upper bound can be obtained by multiplying the number of optimal $\epsilon$-separated points in $B_1(l^\infty(\mathbb{R}^k)$ by $\binom{N}{k}$. This gives
$B(D) \lesssim k \log(c\lambda / D)$
(generic result for bounded sets in finite dimensions can be found in paper by Kolmogorov-Tihomirov, $\epsilon$-entropy and $\epsilon$-capacity of sets in function spaces)

Edit: The difference of $\sqrt{k}$ seems to be a difference in normalization... actually the $\sqrt{k}$ inside the logarithm does not affect the asymptotics with respect to $1/D$, as $D$ decreases to $0$, so maybe something is being ignored in the latter estimate.