Merging deep learning with physical models

for the analysis of modern cosmological surveys

François Lanusse

slides at eiffl.github.io/Vienna2022

the $\Lambda$CDM view of the Universe

the Rubin Observatory Legacy Survey of Space and Time

1000 images each night, 15 TB/night for 10 years

18,000 square degrees, observed once every few days

Tens of billions of objects, each one observed $\sim1000$ times

Previous generation survey: SDSS

Image credit: Peter Melchior

Current generation survey: DES

Image credit: Peter Melchior

LSST precursor survey: HSC

Image credit: Peter Melchior

The challenges of modern surveys

$\Longrightarrow$ Modern surveys will provide large volumes of high quality data

A Blessing

Unprecedented statistical power

Great potential for new discoveries

A Curse

Existing methods are reaching their limits at every step of the science analysis

Control of systematics is paramount

LSST forecast on dark energy parameters

$\Longrightarrow$ Dire need for novel analysis techniques to fully realize the potential of modern surveys.

Can AI solve all of our problems?

The AI bubble

astro-ph abstracts mentioning Deep Learning, CNN, or Neural Networks

Branched GAN model for deblending (Reiman & Göhre, 2018)

The issue with using deep learning as a black-box

No explicit control of noise, PSF, number of sources.
- Model would have to be retrained for all observing configurations

No guarantees on the network output (e.g. flux preservation, artifacts)

No proper uncertainty quantification.

Focus of this talk

Deep Learning for astronomical data reduction

Bayesian Neural Networks

Physical Bayesian inference

This talk

Generic approach to uncertainty quantification and interpretability:

(Differentiable) Physical Forward Models
Deep Generative Models
Bayesian Inference

Generative Modeling

The Key to Manipulating Implicit Distributions

What is generative modeling?

The goal of generative modeling is to learn an implicit distribution $\mathbb{P}$ from which the training set $X = \{x_0, x_1, \ldots, x_n \}$ is drawn.

Usually, this means building a parametric model $\mathbb{P}_\theta$ that tries to be close to $\mathbb{P}$.

True $\mathbb{P}$

Samples $x_i \sim \mathbb{P}$

Model $\mathbb{P}_\theta$

Once trained, you can typically sample from $\mathbb{P}_\theta$ and/or evaluate the likelihood $p_\theta(x)$.

Why isn't it easy?

The curse of dimensionality put all points far apart in high dimension

Distance between pairs of points drawn from a Gaussian distribution.

Classical methods for estimating probability densities, i.e. Kernel Density Estimation (KDE) start to fail in high dimension because of all the gaps

The evolution of generative models

Deep Belief Network
(Hinton et al. 2006)

Variational AutoEncoder
(Kingma & Welling 2014)

Generative Adversarial Network
(Goodfellow et al. 2014)

Wasserstein GAN
(Arjovsky et al. 2017)

A visual Turing test

Fake PixelCNN samples

Real galaxies from SDSS

Why are these generative models useful?

Implicit distributions are everywhere!

Case I: Examples from data, no accurate physical model

Mandelbaum et al. 2014

Case II: Physical model only available as a simulator

Osato et al. 2020

$\Longrightarrow$ Generative models will enable Bayesian inference in cases where implicit distributions are involved, by providing a tractable $p_\theta(x)$.

Solving Inverse Problems

with deep implicit priors

Linear inverse problems

$\boxed{y = \mathbf{A}x + n}$

$\mathbf{A}$ is known and encodes our physical understanding of the problem.
$\Longrightarrow$ When non-invertible or ill-conditioned, the inverse problem is ill-posed with no unique solution $x$

Deconvolution

Inpainting

Denoising

Gravitational lensing

Galaxy shapes as estimators for gravitational shear

$$ e = \gamma + e_i \qquad \mbox{ with } \qquad e_i \sim \mathcal{N}(0, I)$$

We are trying the measure the ellipticity $e$ of galaxies as an estimator for the gravitational shear $\gamma$

Weak Lensing Mass-Mapping as an Inverse Problem

Shear $\gamma$

Convergence $\kappa$

$$\gamma_1 = \frac{1}{2} (\partial_1^2 - \partial_2^2) \ \Psi \quad;\quad \gamma_2 = \partial_1 \partial_2 \ \Psi \quad;\quad \kappa = \frac{1}{2} (\partial_1^2 + \partial_2^2) \ \Psi$$

$$\boxed{\gamma = \mathbf{P} \kappa}$$

Illustration on the Dark Energy Survey (DES) Y3

Jeffrey, et al. (2021)

What Would a Bayesian Do?

$\boxed{y = \mathbf{A}x + n}$

The Bayesian view of the problem:
$$ p(x | y) \propto p(y | x) \ p(x) $$

$p(y | x)$ is the data likelihood, which contains the physics

$p(x)$ is the prior knowledge on the solution.

Estimate for instance the Maximum A Posteriori solution:
$$\hat{x} = \arg\max\limits_x \ \log p(y \ | \ x) + \log p(x)$$
Estimate from the full posterior p(x|y) with MCMC or Variational Inference methods.

How do you choose the prior ?

Classical examples of signal priors

Sparse

$$ \log p(x) = \parallel \mathbf{W} x \parallel_1 $$

Gaussian

$$ \log p(x) = x^t \mathbf{\Sigma^{-1}} x $$

Total Variation

$$ \log p(x) = \parallel \nabla x \parallel_1 $$

But what about this?

$\Longrightarrow$ Implicit prior $p(x)$ in the form of cosmological simulations.

Getting started with Deep Priors: deep denoising example

$$ \boxed{{\color{Orchid} y} = {\color{SkyBlue} x} + n} $$

Let us assume we have access to examples of $ {\color{SkyBlue} x}$ without noise.

We learn the distribution of noiseless data $\log p_\theta(x)$ from samples using a deep generative model.

The solution should lie on the realistic data manifold, symbolized by the two-moons distribution.

We want to solve for the Maximum A Posterior solution:
$$\arg \max - \frac{1}{2} \parallel {\color{Orchid} y} - {\color{SkyBlue} x} \parallel_2^2 + \log p_\theta({\color{SkyBlue} x})$$ This can be done by gradient descent as long as one has access to the score function $\frac{\color{orange} d \color{orange}\log \color{orange}p\color{orange}(\color{orange}x\color{orange})}{\color{orange} d \color{orange}x}$.

Writing down the convergence map log posterior

$$ \log p( \kappa | e) = \underbrace{\log p(e | \kappa)}_{\simeq -\frac{1}{2} \parallel e - P \kappa \parallel_\Sigma^2} + \log p(\kappa) +cst $$

The likelihood term is known analytically.
There is no close form expression for the full non-Gaussian prior of the convergence.
However:
- We do have access to samples of full implicit prior through simulations: $X = \{x_0, x_1, \ldots, x_n \}$ with $x_i \sim \mathbb{P}$

$\Longrightarrow$ Our strategy: Learn the prior from simulation, and then sample the full posterior.

The score is all you need!

Whether you are looking for the MAP or sampling with HMC or MALA, you only need access to the score of the posterior: $$\frac{\color{orange} d \color{orange}\log \color{orange}p\color{orange}(\color{orange}x \color{orange}|\color{orange} y\color{orange})}{\color{orange} d \color{orange}x}$$
- Gradient descent: $x_{t+1} = x_t + \tau \nabla_x \log p(x_t | y) $
- Langevin algorithm: $x_{t+1} = x_t + \tau \nabla_x \log p(x_t | y) + \sqrt{2\tau} n_t$

Neural Score Estimation by Denoising Score Matching

Denoising Score Matching: An optimal Gaussian denoiser learns the score of a given distribution.
- If $x \sim \mathbb{P}$ is corrupted by additional Gaussian noise $u \in \mathcal{N}(0, \sigma^2)$ to yield $$x^\prime = x + u$$
- Let's consider a denoiser $r_\theta$ trained under an $\ell_2$ loss: $$\mathcal{L}=\parallel x - r_\theta(x^\prime, \sigma) \parallel_2^2$$
- The optimal denoiser $r_{\theta^\star}$ verifies: $$\boxed{\boldsymbol{r}_{\theta^\star}(\boldsymbol{x}', \sigma) = \boldsymbol{x}' + \sigma^2 \nabla_{\boldsymbol{x}} \log p_{\sigma^2}(\boldsymbol{x}')}$$

$\boldsymbol{x}'$

$\boldsymbol{x}$

$\boldsymbol{x}'- \boldsymbol{r}^\star(\boldsymbol{x}', \sigma)$

$\boldsymbol{r}^\star(\boldsymbol{x}', \sigma)$

Training a Neural Score Estimator in practice

A standard UNet

We use a very standard residual UNet, and we adopt a residual score matching loss: $$ \mathcal{L}_{DSM} = \underset{\boldsymbol{x} \sim P}{\mathbb{E}} \underset{\begin{subarray}{c} \boldsymbol{u} \sim \mathcal{N}(0, I) \\ \sigma_s \sim \mathcal{N}(0, s^2) \end{subarray}}{\mathbb{E}} \parallel \boldsymbol{u} + \sigma_s \boldsymbol{r}_{\theta}(\boldsymbol{x} + \sigma_s \boldsymbol{u}, \sigma_s) \parallel_2^2$$ $\Longrightarrow$ direct estimator of the score $\nabla \log p_\sigma(x)$

Lipschitz regularization to improve robustness:

Without regularization

With regularization

Efficient sampling by Annealed HMC

Even with gradients, sampling in high number of dimensions is difficult!
$\Longrightarrow$ Use a parallel annealing strategy to effectively sample from full distribution.

We use the fact that our score network $\mathbf{r}_\theta(x, \sigma)$ is learning a noise-convolved distribution $\nabla \log p_\sigma$
$$\sigma_1 > \sigma_2 > \sigma_3 > \sigma_4 $$
Run many HMC chains in parallel, progressively annealing the $\sigma$ to 0, keep last point in the chain as independent sample.

Example of one chain during annealing

Illustration on $\kappa$-TNG simulations

Remy, Lanusse, et al. (2022)

True convergence map

Traditional Kaiser-Squires

Wiener Filter

Posterior Mean (ours)

Posterior samples

Reconstruction of the HST/ACS COSMOS field

COSMOS shear data from Schrabback et al. 2010
Prior learned from $\kappa$-TNG simulation from Osato et al. 2021.

Massey et al. (2007)

Remy et al. (2022) Posterior mean

Remy et al. (2022) Posterior samples

Example of application to deblending

Hybrid Physical-Deep Learning Model for Astronomical Inverse Problems F. Lanusse, P. Melchior, F. Moolekamp

The Scarlet algorithm: deblending as an optimization problem

Melchior et al. 2018

$$ \mathcal{L} = \frac{1}{2} \parallel \mathbf{\Sigma}^{-1/2} (\ Y - P \ast A S \ ) \parallel_2^2 - \sum_{i=1}^K \log p_{\theta}(S_i) + \sum_{i=1}^K g_i(A_i) + \sum_{i=1}^K f_i(S_i)$$

Where for a $K$ component blend:

$P$ is the convolution with the instrumental response

$A_i$ are channel-wise galaxy SEDs, $S_i$ are the morphology models

$\mathbf{\Sigma}$ is the noise covariance

$\log p_\theta$ is a PixelCNN prior

$f_i$ and $g_i$ are arbitrary additional non-smooth consraints, e.g. positivity, monotonicity...

$\Longrightarrow$ Explicit physical modeling of the observed sky

Training the morphology prior

Postage stamps of isolated COSMOS galaxies used for training, at WFIRST resolution and fixed fiducial PSF

isolated galaxy

$\log p_\theta(x) = 3293.7$

artificial blend

$\log p_\theta(x) = 3100.5 $

Scarlet in action

Input blend

Solution

Residuals

Classic priors (monotonicity, symmetry).

Deep Morphology prior.

True Galaxy

Deep Morphology Prior Solution

Monotonicity + Symmetry Solution

Uncertainty quantification in Magnetic Resonance Imaging (MRI)

Ramzi, Remy, Lanusse et al. 2020

$$\boxed{y = \mathbf{M} \mathbf{F} x + n}$$

$\Longrightarrow$ We can see which parts of the image are well constrained by data, and which regions are uncertain.

takeaways

Benefits of Bayesian forward modeling for inverse problems

Using a Bayesian approach has great advantages: model-based physical interpretation & uncertainty quantification.

Explicit likelihood, uses of all of our physical knowledge.
$\Longrightarrow$ The method can be applied for varying noise, observing conditions, or different instruments

Deep generative models can be used to provide simulation or data driven priors.
$\Longrightarrow$ Embed prior only accessible from samples (e.g. numerical simulations or data).

Simulation-Based Inference

Leveraging Physical Simulators for Inference with Generative Modeling

limits of traditional cosmological inference

HSC cosmic shear power spectrum

HSC Y1 constraints on $(S_8, \Omega_m)$

(Hikage et al. 2018)

Measure the ellipticity $\epsilon = \epsilon_i + \gamma$ of all galaxies
$\Longrightarrow$ Noisy tracer of the weak lensing shear $\gamma$

Compute summary statistics based on 2pt functions,
e.g. the power spectrum

Run an MCMC to recover a posterior on model parameters, using an analytic likelihood $$ p(\theta | x ) \propto \underbrace{p(x | \theta)}_{\mathrm{likelihood}} \ \underbrace{p(\theta)}_{\mathrm{prior}}$$

Main limitation: the need for an explicit likelihood

We can only compute the likelihood for simple summary statistics and on large scales

$\Longrightarrow$ We are dismissing a large amount of information!

A different road: forward modeling

Instead of trying to analytically evaluate the likelihood, let us build a forward model of the observables.

$\Longrightarrow$ Learn an implicit likelihood $p(x|\theta)$ given by the simulator as our physical model

End-to-end framework for likelihood-free parameter inference with DES SV

Jeffrey, Alsing, Lanusse (2021)

Suite of N-body + raytracing simulations: $\mathcal{D}$

Our method

A two-steps approach to inference

Automatically learn an optimal low-dimensional summary statistic $$y = f_\varphi(\kappa_{KS}) $$

Use Neural Density Estimation to build an estimate $p_\phi$ of the likelihood function $p(y \ | \ \theta)$ (Neural Likelihood Estimation)

Run a conventional Markov Chain Monte Carlo sampling

Learning summary statistics by Variational Mutual Information Maximization

Mutual information between $X$ and $Y$:
“"amount of information" obtained about one random variable through observing the other random variable”

Given a parametric summarizing function $y = f_\phi(\kappa(\theta))$ optimizing $f_\phi$ can be done by maximizing $I(y, \theta)$.

In practice, $f_\phi$ is a CNN, trained to maximize a variational lower bound on the mutual information: $$ I(y ; \theta) \ \ge \ \mathbb{E}_{y, \theta} [ \log q_\phi(\theta | y) ] + H(\Theta) $$

deep residual networks for lensing maps compression

Deep Residual Network $y = f_\phi(x)$ followed by neural density estimator $q_\phi(\theta | y)$

Training on weak lensing maps simulated for different cosmologies

Training by Variational Mutual Information Maximization: $$\mathbb{E}_{(x, \theta) \in \mathcal{D}} [ \log q_\phi(\theta | f_\phi(y) ) ]$$

Estimating the likelihood by Neural Density Estimation

$\Longrightarrow$ We cannot assume a Gaussian likelihood for the summary $y = f_\phi(\kappa)$ but we can learn $p(y | \theta)$: Neural Likelihood Estimation.

Dinh et al. 2016

Neural Likelihood Estimation by Normalizing Flow

We use a conditional Normalizing Flow to build an explicit model for the likelihood function $$ \log p_\varphi (y | \theta)$$

In practice we use the pyDELFI package and an ensemble of NDEs for robustness.

Once learned, we can use the likelihood as part of a conventional MCMC chain

Parameter constraints from DES SV data

Automatically Differentiable Physics

the hammer behind the Deep Learning revolution: Automatic Differentation

Automatic differentiation allows you to compute analytic derivatives of arbitraty expressions:
If I form the expression $y = a * x + b$, it is separated in fundamental ops: $$ y = u + b \qquad u = a * x $$ then gradients can be obtained by the chain rule: $$\frac{\partial y}{\partial x} = \frac{\partial y}{\partial u} \frac{ \partial u}{\partial x} = 1 \times a = a$$

This is a fundamental tool in Machine Learning, and autodiff frameworks include TensorFlow and PyTorch.

Enters JAX: NumPy + Autograd + GPU

JAX follows the NumPy api!


      										import jax.numpy as np

Arbitrary order derivatives
Accelerated execution on GPU and TPU

Back to forward modeling: the Hierarchical Bayesian Inference perspective

Another approach to using simulations is to consider them as large Hierarchical Bayesian Models.

Each component of the model is now tractable, but at the cost of a large number of latent variables.

$\Longrightarrow$ How to peform efficient inference in this large number of dimensions?

Hamiltonian Monte-Carlo
Variational Inference
MAP+Laplace
Gold Mining
Dimensionality reduction by Fisher-Information Maximization

What do they all have in common?
-> They require fast, accurate, differentiable forward simulations

(Schneider et al. 2015)

How do we simulate the Universe in a fast and differentiable way?

Forward Models in Cosmology

Linear Field

Final Dark Matter

Dark Matter Halos

Galaxies

$\longrightarrow$

N-body simulations

$\longrightarrow$
Group Finding
algorithms

$\longrightarrow$
Semi-analytic &
distribution models

You can try to learn the simulation...

Learning particle displacement with a UNet. S. He, et al. (2019)

The issue with using deep learning as a black-box

No guarantees to work outside of training regime.
No guarantees to capture dependence on cosmology accurately.

the Fast Particle-Mesh scheme for N-body simulations

The idea: approximate gravitational forces by estimating densities on a grid.

The numerical scheme:
- Estimate the density of particles on a mesh
  => compute gravitational forces by FFT
- Interpolate forces at particle positions
- Update particle velocity and positions, and iterate

Fast and simple, at the cost of approximating short range interactions.

$\Longrightarrow$ Only a series of FFTs and interpolations.

introducing FlowPM: Particle-Mesh Simulations in TensorFlow

Modi, Lanusse, Seljak (2020)

https://github.com/DifferentiableUniverseInitiative/flowpm


      													import tensorflow as tf
      													import flowpm
      													# Defines integration steps
      													stages = np.linspace(0.1, 1.0, 10, endpoint=True)

      													initial_conds = flowpm.linear_field(32,       # size of the cube
      													                                   100,       # Physical size
      													                                   ipklin,    # Initial powerspectrum
      													                                   batch_size=16)

      													# Sample particles and displace them by LPT
      													state = flowpm.lpt_init(initial_conds, a0=0.1)

      													# Evolve particles down to z=0
      													final_state = flowpm.nbody(state, stages, 32)

      													# Retrieve final density field
      													final_field = flowpm.cic_paint(tf.zeros_like(initial_conditions),
      													                               final_state[0])

Seamless interfacing with deep learning components
Mesh TensorFlow implementation for distribution on supercomputers

Mesh FlowPM: distributed, GPU-accelerated, and automatically differentiable simulations

We developed a Mesh TensorFlow implementation that can scale on GPU clusters (horovod+NCCL).

For a $2048^3$ simulation:
- Distributed on 256 NVIDIA V100 GPUs
- Runtime: 3 mins

Don't hesitate to reach out if you have a use case for model parallelism!

https://github.com/DifferentiableUniverseInitiative/mesh

Example use-case: reconstructing initial conditions by MAP optimization

Going back to simpler times...

$$\arg\max_z \ \log p(x_{dm} = f(z)) \ + \ p(z) $$ where:

$f$ is FlowPM
$z$ are the initial conditions (early universe)
$x_{dm}$ is the present day dark matter distribution

MAP optimization in action

$$\arg\max_z \ \log p(x_{dm} = f(z)) \ + \ p(z) $$

credit: C. Modi

True initial conditions
$z_0$

Reconstructed initial conditions $z$

Reconstructed dark matter distribution $x = f(z)$

Data
$x_{DM} = f(z_0)$

Check out this blogpost for more details
https://blog.tensorflow.org/2020/03/simulating-universe-in-tensorflow.html

CosmicRIM: Recurrent Inference Machines for Initial Condition Reconstruction

Modi, Lanusse, Seljak, Spergel, Perreault-Levasseur (2021)

Recurrent Neural Network Architecture

Initial conditions cross-correlation

CosmicRIM: Learn to optimize by embedding a Neural Network in the optimization algorithm.
$\Longrightarrow$ converges 40x faster than LBFGS.

Conclusion

Merging Deep Learning with Physical Models for Bayesian Inference

$\Longrightarrow$ Makes Bayesian inference possible at scale and with non-trivial models!

Complement known physical models with data-driven components
- Use data-driven generative model as prior for solving inverse problems.
Enables inference in high dimension from numerical simulators.
- Automagically construct summary statistics.
- Provides the density estimation tools needed.
Differentiable physical models for fast inference
- Differentiability enables Bayesian inference over large scale simulations.
- Models can directly be embedded alongside deep learning components.

Thank you !