# Blurring the Line Between Deep Learning and Physical Models

## François Lanusse

slides at eiffl.github.io/talks/LPC2021

### the Rubin Observatory Legacy Survey of Space and Time

• 1000 images each night, 15 TB/night for 10 years

• 18,000 square degrees, observed once every few days

• Tens of billions of objects, each one observed $\sim1000$ times

Previous generation survey: SDSS

Image credit: Peter Melchior

Current generation survey: DES

Image credit: Peter Melchior

LSST precursor survey: HSC

Image credit: Peter Melchior

### The challenges of modern surveys

$\Longrightarrow$ Modern surveys will provide large volumes of high quality data

A Blessing
• Unprecedented statistical power

• Great potential for new discoveries

A Curse
• Existing methods are reaching their limits at every step of the science analysis

• Control of systematics is paramount
LSST forecast on dark energy parameters
$\Longrightarrow$ Dire need for novel analysis techniques to fully realize the potential of modern surveys.

## Can AI solve all of our problems?

Credit: NAOJ

Credit: NAOJ

### A perfect task for Deep Learning

CMUDeepLens Architecture
(Lanusse et al. 2017)
Huang et al. (2019) - arXiv:1906.00970

• This works great, as long as you don't care too much about which lenses are identified and why

### The AI bubble

astro-ph abstracts mentioning Deep Learning, CNN, or Neural Networks

### Be careful to ask the right questions!

The most common pitfall: Covariate Shift

$\Longrightarrow$ The training data plays an integral part in the answer
Branched GAN model for deblending (Reiman & Göhre, 2018)
The issue with using deep learning as a black-box
• No explicit control of noise, PSF, number of sources.
• Model would have to be retrained for all observing configurations

• No guarantees on the network output (e.g. flux preservation, artifacts)

• No proper uncertainty quantification.

### Focus of this talk

• Deep Learning for astronomical data reduction

• Bayesian Neural Networks

• Physical Bayesian inference

This talk
Generic approach to uncertainty quantification and interpretability:
• Physical Forward Models
• Deep Generative Models
• Bayesian Inference

# Deep Generative Models as Data-Driven Bayesian Priors

### Gravitational lensing

Galaxy shapes as estimators for gravitational shear
$$e = \gamma + e_i \qquad \mbox{ with } \qquad e_i \sim \mathcal{N}(0, I)$$
• We are trying the measure the ellipticity $e$ of galaxies as an estimator for the gravitational shear $\gamma$

### Weak Lensing Mass-Mapping as an Inverse Problem

Shear $\gamma$
Convergence $\kappa$
$$\gamma_1 = \frac{1}{2} (\partial_1^2 - \partial_2^2) \ \Psi \quad;\quad \gamma_2 = \partial_1 \partial_2 \ \Psi \quad;\quad \kappa = \frac{1}{2} (\partial_1^2 + \partial_2^2) \ \Psi$$
$$\boxed{\gamma = \mathbf{P} \kappa}$$

### Illustration on the Dark Energy Survey (DES) Y3

Jeffrey, et al. (2021)

### Linear inverse problems

$\boxed{y = \mathbf{A}x + n}$

$\mathbf{A}$ is known and encodes our physical understanding of the problem.
$\Longrightarrow$ When non-invertible or ill-conditioned, the inverse problem is ill-posed with no unique solution $x$
Deconvolution
Inpainting
Denoising

### What Would a Bayesian Do?

$\boxed{y = \mathbf{A}x + n}$

The Bayesian view of the problem:
$$p(x | y) \propto p(y | x) \ p(x)$$
• $p(y | x)$ is the data likelihood, which contains the physics

• $p(x)$ is the prior knowledge on the solution.

With these concepts in hand we can:
• Estimate for instance the Maximum A Posteriori solution:
$$\hat{x} = \arg\max\limits_x \ \log p(y \ | \ x) + \log p(x)$$
• Estimate from the full posterior p(x|y) with MCMC or Variational Inference methods.

### Classical examples of signal priors

Sparse
$$\log p(x) = \parallel \mathbf{W} x \parallel_1$$
Gaussian $$\log p(x) = x^t \mathbf{\Sigma^{-1}} x$$
Total Variation $$\log p(x) = \parallel \nabla x \parallel_1$$

$\Longrightarrow$ Prior in the form of numerical simulations.

## Can we use Deep Learning to embed simulation-driven priors within a physical Bayesian model?

### What is generative modeling?

• The goal of generative modeling is to learn the distribution $\mathbb{P}$ from which the training set $X = \{x_0, x_1, \ldots, x_n \}$ is drawn.

• Usually, this means building a parametric model $\mathbb{P}_\theta$ that tries to be close to $\mathbb{P}$.

True $\mathbb{P}$

Samples $x_i \sim \mathbb{P}$

Model $\mathbb{P}_\theta$

• Once trained, you can typically sample from $\mathbb{P}_\theta$ and/or evaluate the likelihood $p_\theta(x)$.

### Why isn't it easy?

• The curse of dimensionality put all points far apart in high dimension

Distance between pairs of points drawn from a Gaussian distribution.

• Classical methods for estimating probability densities, i.e. Kernel Density Estimation (KDE) start to fail in high dimension because of all the gaps

### The evolution of generative models

• Deep Belief Network
(Hinton et al. 2006)

• Variational AutoEncoder
(Kingma & Welling 2014)

• Generative Adversarial Network
(Goodfellow et al. 2014)

• Wasserstein GAN
(Arjovsky et al. 2017)

### A visual Turing test

Fake PixelCNN samples

Real galaxies from SDSS

### Getting started with Deep Priors: deep denoising example

$$\boxed{{\color{Orchid} y} = {\color{SkyBlue} x} + n}$$
• Let us assume we have access to examples of ${\color{SkyBlue} x}$ without noise.

• We learn the distribution of noiseless data $\log p_\theta(x)$ from samples using a deep generative model.

• The solution should lie on the realistic data manifold, symbolized by the two-moons distribution.

We want to solve for the Maximum A Posterior solution:

$$\arg \max - \frac{1}{2} \parallel {\color{Orchid} y} - {\color{SkyBlue} x} \parallel_2^2 + \log p_\theta({\color{SkyBlue} x})$$ This can be done by gradient descent as long as one has access to the score function $\frac{\color{orange} d \color{orange}\log \color{orange}p\color{orange}(\color{orange}x\color{orange})}{\color{orange} d \color{orange}x}$.

### The score is all you need!

• Whether you are looking for the MAP or sampling with HMC or MALA, you only need access to the score of the posterior: $$\frac{\color{orange} d \color{orange}\log \color{orange}p\color{orange}(\color{orange}x \color{orange}|\color{orange} y\color{orange})}{\color{orange} d \color{orange}x}$$
• Gradient descent: $x_{t+1} = x_t + \tau \nabla_x \log p(x_t | y)$
• Langevin algorithm: $x_{t+1} = x_t + \tau \nabla_x \log p(x_t | y) + \sqrt{2\tau} n_t$

• The score of the full posterior is simply: $$\nabla_x \log p(x |y) = \underbrace{\nabla_x \log p(y |x)}_{\mbox{known}} \quad + \quad \underbrace{\nabla_x \log p(x)}_{\mbox{can be learned}}$$ $\Longrightarrow$ all we have to do is model/learn the score of the prior.

### Neural Score Estimation by Denoising Score Matching

• Denoising Score Matching: An optimal Gaussian denoiser learns the score of a given distribution.
• If $x \sim \mathbb{P}$ is corrupted by additional Gaussian noise $u \in \mathcal{N}(0, \sigma^2)$ to yield $$x^\prime = x + u$$
• Let's consider a denoiser $r_\theta$ trained under an $\ell_2$ loss: $$\mathcal{L}=\parallel x - r_\theta(x^\prime, \sigma) \parallel_2^2$$
• The optimal denoiser $r_{\theta^\star}$ verifies: $$\boxed{\boldsymbol{r}_{\theta^\star}(\boldsymbol{x}', \sigma) = \boldsymbol{x}' + \sigma^2 \nabla_{\boldsymbol{x}} \log p_{\sigma^2}(\boldsymbol{x}')}$$
$\boldsymbol{x}'$
$\boldsymbol{x}$
$\boldsymbol{x}'- \boldsymbol{r}^\star(\boldsymbol{x}', \sigma)$
$\boldsymbol{r}^\star(\boldsymbol{x}', \sigma)$

### Training a Neural Score Estimator in practice

A standard UNet
• We use a very standard residual UNet, and we adopt a residual score matching loss: $$\mathcal{L}_{DSM} = \underset{\boldsymbol{x} \sim P}{\mathbb{E}} \underset{\begin{subarray}{c} \boldsymbol{u} \sim \mathcal{N}(0, I) \\ \sigma_s \sim \mathcal{N}(0, s^2) \end{subarray}}{\mathbb{E}} \parallel \boldsymbol{u} + \sigma_s \boldsymbol{r}_{\theta}(\boldsymbol{x} + \sigma_s \boldsymbol{u}, \sigma_s) \parallel_2^2$$ $\Longrightarrow$ direct estimator of the score $\nabla \log p_\sigma(x)$

• Lipschitz regularization to improve robustness:

Without regularization
With regularization

### Efficient sampling by Annealed HMC

• Even with gradients, sampling in high number of dimensions is difficult!
$\Longrightarrow$ Use a parallel annealing strategy to effectively sample from full distribution.

• We use the fact that our score network $\mathbf{r}_\theta(x, \sigma)$ is learning a noise-convolved distribution $\nabla \log p_\sigma$
$$\sigma_1 > \sigma_2 > \sigma_3 > \sigma_4$$
• Run many HMC chains in parallel, progressively annealing the $\sigma$ to 0, keep last point in the chain as independent sample.

### Back to the Mass-Mapping Problem

$$\log p( \kappa | e) = \underbrace{\log p(e | \kappa)}_{\simeq -\frac{1}{2} \parallel e - P \kappa \parallel_\Sigma^2} + \log p(\kappa) +cst$$
• The likelihood term is known analytically.
• There is no close form expression for the full non-Gaussian prior of the convergence.
However:
• We do have access to samples of full prior through simulations: $X = \{x_0, x_1, \ldots, x_n \}$ with $x_i \sim \mathbb{P}$
$\Longrightarrow$ Our strategy: Learn the prior score $\nabla \log p(\kappa)$ with Denoising Score Matching, and then sample the full posterior with annealed HMC.

### Illustration on $\kappa$-TNG simulations

True convergence map
Wiener Filter
Posterior Mean (ours)

Posterior samples

### Probabilistic Mass-Mapping of the HST COSMOS field

Remy, Lanusse et al. 2020

• COSMOS shear data from Schrabback et al. 2010

• Prior learned from MassiveNuS at fiducial cosmology (320x320 maps at 0.4 arcsec resolution).

• Known massive X-ray clusters indicated with crosses, along with their redshifts, right pannel shows cutouts of central cluster from multiple posterior samples.

### Uncertainty quantification in Magnetic Resonance Imaging (MRI)

Ramzi, Remy, Lanusse et al. 2020

$$\boxed{y = \mathbf{M} \mathbf{F} x + n}$$

$\Longrightarrow$ We can see which parts of the image are well constrained by data, and which regions are uncertain.

### Example of application to deblending

• Hybrid Physical-Deep Learning Model for Astronomical Inverse Problems
F. Lanusse, P. Melchior, F. Moolekamp

$\mathcal{L} = \frac{1}{2} \parallel \mathbf{\Sigma}^{-1/2} (\ Y - P \ast A S \ ) \parallel_2^2 - \sum_{i=1}^K \log p_{\theta}(S_i)$

isolated galaxy $\log p_\theta(x) = 3293.7$
artificial blend $\log p_\theta(x) = 3100.5$

### takeaways

Benefits of Bayesian forward modeling
Using a Bayesian approach has great advantages: model-based physical interpretation & uncertainty quantification.

• Explicit likelihood, uses of all of our physical knowledge.
$\Longrightarrow$ The method can be applied for varying PSF, noise, or even different instruments!

• Deep generative models can be used to provide data driven priors.
$\Longrightarrow$ Embed prior only accessible from samples (e.g. numerical simulations), can be embeded in hierarchical model.

• Neural Score Estimation is a scalable approach to learn a prior score.

• We implemented the "ultimate mass-mapping method"
$\Longrightarrow$ Recovered the best convergence map of the COSMOS field to date.

# Automatically Differentiable Physics

### traditional cosmological inference

HSC cosmic shear power spectrum
HSC Y1 constraints on $(S_8, \Omega_m)$
(Hikage et al. 2018)
• Measure the ellipticity $\epsilon = \epsilon_i + \gamma$ of all galaxies
$\Longrightarrow$ Noisy tracer of the weak lensing shear $\gamma$

• Compute summary statistics based on 2pt functions,
e.g. the power spectrum

• Run an MCMC to recover a posterior on model parameters, using an analytic likelihood $$p(\theta | x ) \propto \underbrace{p(x | \theta)}_{\mathrm{likelihood}} \ \underbrace{p(\theta)}_{\mathrm{prior}}$$
Main limitation: the need for an explicit likelihood
We can only compute the likelihood for simple summary statistics and on large scales

$\Longrightarrow$ We are dismissing most of the information!

### A different road: forward modeling

• Instead of trying to analytically evaluate the likelihood, let us build a forward model of the observables.

• Each component of the model is now tractable, but at the cost of a large number of latent variables.

$\Longrightarrow$ How to peform efficient inference in this large number of dimensions?

A non-exhaustive list of methods:
• Hamiltonian Monte-Carlo
• Variational Inference
• MAP+Laplace
• Gold Mining
• Dimensionality reduction by Fisher-Information Maximization

What do they all have in common?
-> They require fast, accurate, differentiable forward simulations
(Schneider et al. 2015)

### Forward Models in Cosmology

Linear Field
Final Dark Matter

Dark Matter Halos
Galaxies
$\longrightarrow$
N-body simulations
$\longrightarrow$
Group Finding
algorithms
$\longrightarrow$
Semi-analytic &
distribution models

## How do we make cosmological simulations fast and differentiable?

### the hammer behind the Deep Learning revolution

• Automatic differentiation allows you to compute analytic derivatives of arbitraty expressions:
If I form the expression $y = a * x + b$, it is separated in fundamental ops: $$y = u + b \qquad u = a * x$$ then gradients can be obtained by the chain rule: $$\frac{\partial y}{\partial x} = \frac{\partial y}{\partial u} \frac{ \partial u}{\partial x} = 1 \times a = a$$

• This is a fundamental tool in Machine Learning, and autodiff frameworks include TensorFlow, JAX or PyTorch.
$\Longrightarrow$ In addition to autodiff, they all provide GPU/TPU acceleration.

### the Fast Particle-Mesh scheme

The idea: approximate gravitational forces by estimating densities on a grid.
• The numerical scheme:

• Estimate the density of particles on a mesh
=> compute gravitational forces by FFT

• Interpolate forces at particle positions

• Update particle velocity and positions, and iterate

• Fast and simple, at the cost of approximating short range interactions.
$\Longrightarrow$ Only a series of FFTs and interpolations.

### Introducing FlowPM: Particle-Mesh Simulations in TensorFlow

Modi, Lanusse, Seljak (2020)

import tensorflow as tf
import flowpm
# Defines integration steps
stages = np.linspace(0.1, 1.0, 10, endpoint=True)

initial_conds = flowpm.linear_field(32,       # size of the cube
100,      # Physical size
ipklin,   # Initial powerspectrum
batch_size=16)

# Sample particles and displace them by LPT
state = flowpm.lpt_init(initial_conds, a0=0.1)

# Evolve particles down to z=0
final_state = flowpm.nbody(state, stages, 32)

# Retrieve final density field
final_field = flowpm.cic_paint(tf.zeros_like(initial_conditions),
final_state[0])

• Seamless interfacing with deep learning components

### Bayesian Reconstruction of Cosmological Fields

Going back to simpler times...
$$\arg\max_s \ \log p(x | f(s)) \ + \ \log p(s)$$ where:
• $f$ is the forward model
• $p(x | f(s))$ is a tractable data likelihood
• $s$ are the initial conditions (early universe)
• $x$ is the data (e.g. galaxies, lensing, etc.)

### MAP optimization in action

$$\arg\max_s \ \log p(x_{dm} | f(s)) \ + \ \log p(s)$$
credit: C. Modi

True initial conditions
$s_0$

Reconstructed initial conditions $s$

Reconstructed dark matter distribution $x_{dm} = f(s)$

Data
$x_{dm} = f(s_0)$

### If only MAP optimization was easy...

$$\arg\max_s \ \log p(x_{dm} | f(s)) \ + \ \log p(s)$$
Direct optimization
• Direct optimization of MAP leads to poor solutions on large scales.
• Annealing recovers unbiased large scales, but at the cost of ad-hoc tempering procedure.

## Instead of guessing an optimization scheme, could we learn to optimize?

### A closer look at the optimization algorithm

$$\arg\max_x \ \log p(y | f(x)) \ + \ \log p(x)$$
• Standard Gradient Descent Algorithm: $$x_{i+1} = x_i - \epsilon \left[ \nabla_x{\log p(y | f(x_i))} + \nabla_x \log p(x_i) \right]$$
$$x_{i+1} = x_i - \Gamma \left(\nabla_x{\log p(y | f(x_i))}, \nabla_x \log p(x_i) \right]$$ with update function $\Gamma: (u,v) \rightarrow \epsilon(u + v)$

• Many algorithms (e.g. ADAM, LBFGS) can expressed in this form with a different choice of $\Gamma$.

$\Longrightarrow$ What if we could learn this update function?

### Recurrent Inference Machines for Solving Inverse ProblemsPutzky & Welling, 2017

• Introduce a Recurrent Neural Network (RNN) $h_\phi$, and state variable $s$, so that: $$s_{i+1} = h^*_\phi( \nabla \log p(y|x_{i}), x_i, s_{i})$$ $$x_{i+1} = x_i + h_\phi(\nabla \log p(y|x_{i}), x_i, s_{i+1})$$
• Train according to: $$\mathcal{L} = \sum_{i}^T w_i\mathcal{L}(x_i, x)$$

### Illustration on inverse problems

From left to right: input masked image, increasing number of steps of solutions, ground truth.

### CosmicRIM: Recurrence Inference Machines for Initial Condition Reconstruction

Modi, Lanusse, Seljak, Spergel, Perreault-Levasseur (2021)

Recurrent Neural Network Architecture
• A few notable differences to a vanilla RIM:
• We provide gradients of both prior and likelihood to the model.

• Because our forward model couples scales, we use a multiscale U-Net architecture.

• Input gradients are pre-scaled with the ADAM formula.

### Experiments

Settings
• Forward model: $64^3$ particles, 400 Mpc/h box, 2LPT dynamics with 2nd order bias model
• RIM: 10 steps, trained under l2 loss
Initial conditions cross-correlation
Transfer function
• CosmicRIM: Learn to optimize by embedding a Neural Network in the optimization algorithm.
$\Longrightarrow$ converges 40x faster than LBFGS.

### Mesh FlowPM: distributed, GPU-accelerated, and automatically differentiable simulations

• Developed a Mesh TensorFlow implementation that can scale on GPU clusters (horovod+NCCL).

• For a $2048^3$ simulation:
• Distributed on 256 NVIDIA V100 GPUs
• Runtime: 3 mins

• Don't hesitate to reach out if you have a use case for model parallelism!

# Conclusion

### Conclusion

What can be gained by merging Deep Learning and Physical Models?
$\Longrightarrow$ Makes Bayesian inference possible at scale and with non-trivial models! This provides robustness, interpretability, and uncertainty quantification.

• Complement known physical models with data-driven components
• Use data-driven generative model as prior for solving inverse problems.

• Differentiable physical models for fast inference

Thank you !