Blurring the Line Between Deep Learning and Physical Models

Quarks to Cosmos with AI - CMU 2021


François Lanusse










slides at eiffl.github.io/talks/QtC2021

the $\Lambda$CDM view of the Universe















the Rubin Observatory Legacy Survey of Space and Time

  • 1000 images each night, 15 TB/night for 10 years

  • 18,000 square degrees, observed once every few days

  • Tens of billions of objects, each one observed $\sim1000$ times

Previous generation survey: SDSS




















Image credit: Peter Melchior

Current generation survey: DES




















Image credit: Peter Melchior

LSST precursor survey: HSC




















Image credit: Peter Melchior

Can AI solve all of our problems?

Credit: NAOJ






























Credit: NAOJ






























A perfect task for Deep Learning


CMUDeepLens Architecture
(Lanusse et al. 2017)
Huang et al. (2019) - arXiv:1906.00970

  • This works great, as long as you don't care too much about which lenses are identified and why
Branched GAN model for deblending (Reiman & Göhre, 2018)
The issue with using deep learning as a black-box
  • No explicit control of noise, PSF, number of sources.
    • Model would have to be retrained for all observing configurations

  • No guarantees on the network output (e.g. flux preservation, artifacts)

  • No proper uncertainty quantification.

Focus of this talk

  • Deep Learning for astronomical data reduction

  • Bayesian Neural Networks

  • Physical Bayesian inference


This talk
Generic approach to uncertainty quantification and interpretability:
  • Physical Forward Models
  • Deep Generative Models
  • Bayesian Inference

Deep Generative Models as Data-Driven Bayesian Priors

Gravitational lensing

Galaxy shapes as estimators for gravitational shear
$$ e = \gamma + e_i \qquad \mbox{ with } \qquad e_i \sim \mathcal{N}(0, I)$$
  • We are trying the measure the ellipticity $e$ of galaxies as an estimator for the gravitational shear $\gamma$

Weak Lensing Mass-Mapping as an Inverse Problem

Shear $\gamma$
Convergence $\kappa$
$$\gamma_1 = \frac{1}{2} (\partial_1^2 - \partial_2^2) \ \Psi \quad;\quad \gamma_2 = \partial_1 \partial_2 \ \Psi \quad;\quad \kappa = \frac{1}{2} (\partial_1^2 + \partial_2^2) \ \Psi$$
$$\boxed{\gamma = \mathbf{P} \kappa}$$

Linear inverse problems

$\boxed{y = \mathbf{A}x + n}$

$\mathbf{A}$ is known and encodes our physical understanding of the problem.
$\Longrightarrow$ When non-invertible or ill-conditioned, the inverse problem is ill-posed with no unique solution $x$
Deconvolution
Inpainting
Denoising

What Would a Bayesian Do?

$\boxed{y = \mathbf{A}x + n}$

The Bayesian view of the problem:
$$ p(x | y) \propto p(y | x) \ p(x) $$
  • $p(y | x)$ is the data likelihood, which contains the physics

  • $p(x)$ is the prior knowledge on the solution.


    With these concepts in hand we can:
  • Estimate for instance the Maximum A Posteriori solution:
    $$\hat{x} = \arg\max\limits_x \ \log p(y \ | \ x) + \log p(x)$$
  • Estimate from the full posterior p(x|y) with MCMC or Variational Inference methods.

How do you choose the prior ?

Classical examples of signal priors

Sparse
$$ \log p(x) = \parallel \mathbf{W} x \parallel_1 $$
Gaussian $$ \log p(x) = x^t \mathbf{\Sigma^{-1}} x $$
Total Variation $$ \log p(x) = \parallel \nabla x \parallel_1 $$

But what about this?




$\Longrightarrow$ Prior in the form of numerical simulations.

Can we use Deep Learning to embed simulation-driven priors within a physical Bayesian model?

What is generative modeling?


  • The goal of generative modeling is to learn the distribution $\mathbb{P}$ from which the training set $X = \{x_0, x_1, \ldots, x_n \}$ is drawn.

  • Usually, this means building a parametric model $\mathbb{P}_\theta$ that tries to be close to $\mathbb{P}$.


True $\mathbb{P}$

Samples $x_i \sim \mathbb{P}$

Model $\mathbb{P}_\theta$


  • Once trained, you can typically sample from $\mathbb{P}_\theta$ and/or evaluate the likelihood $p_\theta(x)$.

Do you know this person?



















Probably not, this is a randomly generated person: thispersondoesntexist.com

Getting started with Deep Priors: deep denoising example

$$ \boxed{{\color{Orchid} y} = {\color{SkyBlue} x} + n} $$
  • Let us assume we have access to examples of $ {\color{SkyBlue} x}$ without noise.

  • We learn the distribution of noiseless data $\log p_\theta(x)$ from samples using a deep generative model.

  • The solution should lie on the realistic data manifold, symbolized by the two-moons distribution.

    We want to solve for the Maximum A Posterior solution:

    $$\arg \max - \frac{1}{2} \parallel {\color{Orchid} y} - {\color{SkyBlue} x} \parallel_2^2 + \log p_\theta({\color{SkyBlue} x})$$ This can be done by gradient descent as long as one has access to the score function $\frac{\color{orange} d \color{orange}\log \color{orange}p\color{orange}(\color{orange}x\color{orange})}{\color{orange} d \color{orange}x}$.

The score is all you need!


  • Whether you are looking for the MAP or sampling with HMC or MALA, you only need access to the score of the posterior: $$\frac{\color{orange} d \color{orange}\log \color{orange}p\color{orange}(\color{orange}x \color{orange}|\color{orange} y\color{orange})}{\color{orange} d \color{orange}x}$$
    • Gradient descent: $x_{t+1} = x_t + \tau \nabla_x \log p(x_t | y) $
    • Langevin algorithm: $x_{t+1} = x_t + \tau \nabla_x \log p(x_t | y) + \sqrt{2\tau} n_t$



  • The score of the full posterior is simply: $$\nabla_x \log p(x |y) = \underbrace{\nabla_x \log p(y |x)}_{\mbox{known}} \quad + \quad \underbrace{\nabla_x \log p(x)}_{\mbox{can be learned}}$$ $\Longrightarrow$ all we have to do is model/learn the score of the prior.

Neural Score Estimation by Denoising Score Matching

  • Denoising Score Matching: An optimal Gaussian denoiser learns the score of a given distribution.
    • If $x \sim \mathbb{P}$ is corrupted by additional Gaussian noise $u \in \mathcal{N}(0, \sigma^2)$ to yield $$x^\prime = x + u$$
    • Let's consider a denoiser $r_\theta$ trained under an $\ell_2$ loss: $$\mathcal{L}=\parallel x - r_\theta(x^\prime, \sigma) \parallel_2^2$$
    • The optimal denoiser $r_{\theta^\star}$ verifies: $$\boxed{\boldsymbol{r}_{\theta^\star}(\boldsymbol{x}', \sigma) = \boldsymbol{x}' + \sigma^2 \nabla_{\boldsymbol{x}} \log p_{\sigma^2}(\boldsymbol{x}')}$$
$\boldsymbol{x}'$
$\boldsymbol{x}$
$\boldsymbol{x}'- \boldsymbol{r}^\star(\boldsymbol{x}', \sigma)$
$\boldsymbol{r}^\star(\boldsymbol{x}', \sigma)$

Training a Neural Score Estimator in practice




A standard UNet
  • We use a very standard residual UNet, and we adopt a residual score matching loss: $$ \mathcal{L}_{DSM} = \underset{\boldsymbol{x} \sim P}{\mathbb{E}} \underset{\begin{subarray}{c} \boldsymbol{u} \sim \mathcal{N}(0, I) \\ \sigma_s \sim \mathcal{N}(0, s^2) \end{subarray}}{\mathbb{E}} \parallel \boldsymbol{u} + \sigma_s \boldsymbol{r}_{\theta}(\boldsymbol{x} + \sigma_s \boldsymbol{u}, \sigma_s) \parallel_2^2$$ $\Longrightarrow$ direct estimator of the score $\nabla \log p_\sigma(x)$

  • Lipschitz regularization to improve robustness:

    Without regularization
    With regularization

Efficient sampling by Annealed HMC

  • Even with gradients, sampling in high number of dimensions is difficult!
    $\Longrightarrow$ Use a parallel annealing strategy to effectively sample from full distribution.

  • We use the fact that our score network $\mathbf{r}_\theta(x, \sigma)$ is learning a noise-convolved distribution $\nabla \log p_\sigma$
    $$\sigma_1 > \sigma_2 > \sigma_3 > \sigma_4 $$
  • Run many HMC chains in parallel, progressively annealing the $\sigma$ to 0, keep last point in the chain as independent sample.

Example of one chain during annealing

Annealed Langevin samples from DSM model in Song & Ermon (2020)

Back to the Mass-Mapping Problem

$$ \log p( \kappa | e) = \underbrace{\log p(e | \kappa)}_{\simeq -\frac{1}{2} \parallel e - P \kappa \parallel_\Sigma^2} + \log p(\kappa) +cst $$
  • The likelihood term is known analytically.
  • There is no close form expression for the full non-Gaussian prior of the convergence.
    However:
    • We do have access to samples of full prior through simulations: $X = \{x_0, x_1, \ldots, x_n \}$ with $x_i \sim \mathbb{P}$
$\Longrightarrow$ Our strategy: Learn the prior score $\nabla \log p(\kappa)$ with Denoising Score Matching, and then sample the full posterior with annealed HMC.

Illustration on MassiveNuS simulations

Probabilistic Mass-Mapping of the HST COSMOS field


  • COSMOS shear data from Schrabback et al. 2010

  • Prior learned from MassiveNuS at fiducial cosmology (320x320 maps at 0.4 arcsec resolution).

  • Known massive X-ray clusters indicated with crosses, along with their redshifts, right pannel shows cutouts of central cluster from multiple posterior samples.


Uncertainty quantification in Magnetic Resonance Imaging (MRI)

Ramzi, Remy, Lanusse et al. 2020


$$\boxed{y = \mathbf{M} \mathbf{F} x + n}$$



$\Longrightarrow$ We can see which parts of the image are well constrained by data, and which regions are uncertain.

Example of application to deblending


  • Hybrid Physical-Deep Learning Model for Astronomical Inverse Problems
    F. Lanusse, P. Melchior, F. Moolekamp


$\mathcal{L} = \frac{1}{2} \parallel \mathbf{\Sigma}^{-1/2} (\ Y - P \ast A S \ ) \parallel_2^2 - \sum_{i=1}^K \log p_{\theta}(S_i)$



isolated galaxy $\log p_\theta(x) = 3293.7$
artificial blend $\log p_\theta(x) = 3100.5 $

What if I do not have access to perfect data/simulations to train my prior?

Complications specific to astronomical images: spot the differences!


CelebA

HSC PDR-2 wide

  • There is noise
  • We have a Point Spread Function

A Physicist's approach: let's build a model

Lanusse et al. (2020)
$\longrightarrow$
$g_\theta$
$\longrightarrow$
PSF
$\longrightarrow$
Pixelation
$\longrightarrow$
Noise
Probabilistic model
$$ x \sim ? $$
$$ x \sim \mathcal{N}(z, \Sigma) \quad z \sim ? $$
latent $z$ is a denoised galaxy image
$$ x \sim \mathcal{N}( \mathbf{P} z, \Sigma) \quad z \sim ?$$
latent $z$ is a super-resolved and denoised galaxy image
$$ x \sim \mathcal{N}( \mathbf{P} (\Pi \ast z), \Sigma) \quad z \sim ? $$
latent $z$ is a deconvolved, super-resolved, and denoised galaxy image
$$ x \sim \mathcal{N}( \mathbf{P} (\Pi \ast g_\theta(z)), \Sigma) \quad z \sim \mathcal{N}(0, \mathbf{I}) $$
latent $z$ is a Gaussian sample
$\theta$ are parameters of the model



$\Longrightarrow$ Decouples the morphology model from the observing conditions.

Flow-VAE samples



Going one step further: generative models as data-driven priors


$\boxed{y = \mathbf{A}x + n}$

$\mathbf{A}$ is known and encodes our physical understanding of the problem.

$y$

Simulated Rubin Observatory
$$ = \qquad \Pi \qquad \ast $$
$x$

Hubble Space Telescope
$$+ \qquad n$$

Bayesian Inference a.k.a. Uncertainty Quantification

The Bayesian view of the problem: $$ p(z | x ) \propto p_\theta(x | z, \Sigma, \mathbf{\Pi}) p(z)$$ where:
  • $p( z | x )$ is the posterior
  • $p( x | z )$ is the data likelihood, contains the physics
  • $p( z )$ is the prior
Data
$x_n$
Truth
$x_0$

Posterior samples
$g_\theta(z)$

$\mathbf{P} (\Pi \ast g_\theta(z))$
Median
Data residuals
$x_n - \mathbf{P} (\Pi \ast g_\theta(z))$
Standard Deviation
$\Longrightarrow$ Uncertainties are fully captured by the posterior.

takeaways



Benefits of Bayesian forward modeling
Using a Bayesian approach has great advantages: model-based physical interpretation & uncertainty quantification.

  • Explicit likelihood, uses of all of our physical knowledge.
    $\Longrightarrow$ The method can be applied for varying PSF, noise, or even different instruments!

  • Deep generative models can be used to provide data driven priors.
    $\Longrightarrow$ Embed prior only accessible from samples (e.g. numerical simulations), can be embeded in hierarchical model.

  • Neural Score Estimation is a scalable approach to learn a prior score.

  • We implemented the "ultimate mass-mapping method"
    $\Longrightarrow$ Recovered the best convergence map of the COSMOS field to date.



Automatically Differentiable Physics

traditional cosmological inference

HSC cosmic shear power spectrum
HSC Y1 constraints on $(S_8, \Omega_m)$
(Hikage et al. 2018)
  • Measure the ellipticity $\epsilon = \epsilon_i + \gamma$ of all galaxies
    $\Longrightarrow$ Noisy tracer of the weak lensing shear $\gamma$

  • Compute summary statistics based on 2pt functions,
    e.g. the power spectrum

  • Run an MCMC to recover a posterior on model parameters, using an analytic likelihood $$ p(\theta | x ) \propto \underbrace{p(x | \theta)}_{\mathrm{likelihood}} \ \underbrace{p(\theta)}_{\mathrm{prior}}$$
Main limitation: the need for an explicit likelihood
We can only compute the likelihood for simple summary statistics and on large scales

$\Longrightarrow$ We are dismissing most of the information!

A different road: forward modeling

  • Instead of trying to analytically evaluate the likelihood, let us build a forward model of the observables.

  • Each component of the model is now tractable, but at the cost of a large number of latent variables.

$\Longrightarrow$ How to peform efficient inference in this large number of dimensions?


    A non-exhaustive list of methods:
  • Hamiltonian Monte-Carlo
  • Variational Inference
  • MAP+Laplace
  • Gold Mining
  • Dimensionality reduction by Fisher-Information Maximization


What do they all have in common?
-> They require fast, accurate, differentiable forward simulations
(Schneider et al. 2015)

Forward Models in Cosmology

Linear Field
Final Dark Matter

Dark Matter Halos
Galaxies
$\longrightarrow$
N-body simulations
$\longrightarrow$
Group Finding
algorithms
$\longrightarrow$
Semi-analytic &
distribution models

How do we make cosmological simulations fast and differentiable?

the hammer behind the Deep Learning revolution


  • Automatic differentiation allows you to compute analytic derivatives of arbitraty expressions:
    If I form the expression $y = a * x + b$, it is separated in fundamental ops: $$ y = u + b \qquad u = a * x $$ then gradients can be obtained by the chain rule: $$\frac{\partial y}{\partial x} = \frac{\partial y}{\partial u} \frac{ \partial u}{\partial x} = 1 \times a = a$$

  • This is a fundamental tool in Machine Learning, and autodiff frameworks include TensorFlow, JAX or PyTorch.
    $\Longrightarrow$ In addition to autodiff, they all provide GPU/TPU acceleration.


the Fast Particle-Mesh scheme

The idea: approximate gravitational forces by estimating densities on a grid.
  • The numerical scheme:

    • Estimate the density of particles on a mesh
      => compute gravitational forces by FFT

    • Interpolate forces at particle positions

    • Update particle velocity and positions, and iterate

  • Fast and simple, at the cost of approximating short range interactions.
$\Longrightarrow$ Only a series of FFTs and interpolations.

Introducing FlowPM: Particle-Mesh Simulations in TensorFlow

Modi, Lanusse, Seljak (2020)

import tensorflow as tf
import flowpm
# Defines integration steps
stages = np.linspace(0.1, 1.0, 10, endpoint=True)

initial_conds = flowpm.linear_field(32,       # size of the cube
                                    100,      # Physical size
                                    ipklin,   # Initial powerspectrum
                                    batch_size=16)

# Sample particles and displace them by LPT
state = flowpm.lpt_init(initial_conds, a0=0.1)

# Evolve particles down to z=0
final_state = flowpm.nbody(state, stages, 32)

# Retrieve final density field
final_field = flowpm.cic_paint(tf.zeros_like(initial_conditions),
                               final_state[0])
													
  • Seamless interfacing with deep learning components
  • Gradients readily available









Bayesian Reconstruction of Cosmological Fields


Going back to simpler times...
$$\arg\max_s \ \log p(x | f(s)) \ + \ \log p(s) $$ where:
  • $f$ is the forward model
  • $p(x | f(s))$ is a tractable data likelihood
  • $s$ are the initial conditions (early universe)
  • $x$ is the data (e.g. galaxies, lensing, etc.)

MAP optimization in action

$$\arg\max_s \ \log p(x_{dm} | f(s)) \ + \ \log p(s) $$
credit: C. Modi


True initial conditions
$s_0$

Reconstructed initial conditions $s$

Reconstructed dark matter distribution $x_{dm} = f(s)$

Data
$x_{dm} = f(s_0)$


If only MAP optimization was easy...

$$\arg\max_s \ \log p(x_{dm} | f(s)) \ + \ \log p(s) $$
Direct optimization
ad-hoc annealing
  • Direct optimization of MAP leads to poor solutions on large scales.
  • Annealing recovers unbiased large scales, but at the cost of ad-hoc tempering procedure.

Instead of guessing an optimization scheme, could we learn to optimize?

A closer look at the optimization algorithm


$$\arg\max_x \ \log p(y | f(x)) \ + \ \log p(x) $$
  • Standard Gradient Descent Algorithm: $$x_{i+1} = x_i - \epsilon \left[ \nabla_x{\log p(y | f(x_i))} + \nabla_x \log p(x_i) \right]$$
    $$x_{i+1} = x_i - \Gamma \left(\nabla_x{\log p(y | f(x_i))}, \nabla_x \log p(x_i) \right]$$ with update function $\Gamma: (u,v) \rightarrow \epsilon(u + v)$


  • Many algorithms (e.g. ADAM, LBFGS) can expressed in this form with a different choice of $\Gamma$.

$\Longrightarrow$ What if we could learn this update function?

Recurrent Inference Machines for Solving Inverse Problems
Putzky & Welling, 2017

  • Introduce a Recurrent Neural Network (RNN) $h_\phi$, and state variable $s$, so that: $$ s_{i+1} = h^*_\phi( \nabla \log p(y|x_{i}), x_i, s_{i})$$ $$ x_{i+1} = x_i + h_\phi(\nabla \log p(y|x_{i}), x_i, s_{i+1})$$
  • Train according to: $$\mathcal{L} = \sum_{i}^T w_i\mathcal{L}(x_i, x)$$

Illustration on inverse problems






From left to right: input masked image, increasing number of steps of solutions, ground truth.




CosmicRIM: Recurrence Inference Machines for Initial Condition Reconstruction

Modi, Lanusse, Seljak, Spergel, Perreault-Levasseur (2021)

Recurrent Neural Network Architecture
  • A few notable differences to a vanilla RIM:
    • We provide gradients of both prior and likelihood to the model.

    • Because our forward model couples scales, we use a multiscale U-Net architecture.

    • Input gradients are pre-scaled with the ADAM formula.

Experiments

Settings
  • Forward model: $64^3$ particles, 400 Mpc/h box, 2LPT dynamics with 2nd order bias model
  • RIM: 10 steps, trained under l2 loss
Initial conditions cross-correlation
Transfer function
  • CosmicRIM: Learn to optimize by embedding a Neural Network in the optimization algorithm.
    $\Longrightarrow$ converges 40x faster than LBFGS.

Mesh FlowPM: distributed, GPU-accelerated, and automatically differentiable simulations

  • Developed a Mesh TensorFlow implementation that can scale on GPU clusters (horovod+NCCL).


  • For a $2048^3$ simulation:
    • Distributed on 256 NVIDIA V100 GPUs
    • Runtime: 3 mins


  • Don't hesitate to reach out if you have a use case for model parallelism!

Conclusion

Conclusion

What can be gained by merging Deep Learning and Physical Models?
$\Longrightarrow$ Makes Bayesian inference possible at scale and with non-trivial models! This provides robustness, interpretability, and uncertainty quantification.

  • Complement known physical models with data-driven components
    • Use data-driven generative model as prior for solving inverse problems.


  • Differentiable physical models for fast inference



Thank you !