# A view of the potential of Deep Learning for Cosmology

## François Lanusse

slides at eiffl.github.io/talks/Gif2021

### the Rubin Observatory Legacy Survey of Space and Time

• 1000 images each night, 15 TB/night for 10 years

• 18,000 square degrees, observed once every few days

• Tens of billions of objects, each one observed $\sim1000$ times

Previous generation survey: SDSS

Image credit: Peter Melchior

Current generation survey: DES

Image credit: Peter Melchior

LSST precursor survey: HSC

Image credit: Peter Melchior

### The challenges of modern surveys

$\Longrightarrow$ Modern surveys will provide large volumes of high quality data

A Blessing
• Unprecedented statistical power

• Great potential for new discoveries

A Curse
• Existing methods are reaching their limits at every step of the science analysis

• Control of systematics is paramount
LSST forecast on dark energy parameters
$\Longrightarrow$ Dire need for novel analysis techniques to fully realize the potential of modern surveys.

## Can AI solve all of our problems?

### WWAD: What Would an Astrophysicist Do ?

gri composite
g - $\alpha$ i
detected areas
HST images
RingFinder (Gavazzi et al. 2014)

Gavazzi et al. (2014), Collett (2015)

$\Longrightarrow$ Plainly intractable at the scale of LSST

### The Deep Learning approach: Convolutional Neural Networks

CMUDeepLens Architecture
(Lanusse et al. 2017)
The Deep Learning approach in three steps:

• Step I: Introduce a parameteric function $f_\theta$, this is your neural network. For images, choose a CNN.

• Step II: Create a dataset $\mathcal{D}$ of examples $x_i$ and associated labels $y_i$: $$\mathcal{D} = \{ (x_i, y_i) \}_{i \in [0, N]}$$

• Step III: Optimize the parameters $\theta$ as to minimize an appropriate loss function.
For classification, the binary cross entropy: $$\arg\min_\theta \sum_{i=1}^N y_i \log f_\theta(x_i) + (1 - y_i) \log f_{\theta}(x_i)$$

### The Euclid strong-lens finding challenge

Metcalf, . . ., Lanusse, et al. (2018)

Better accuracy than human visual inspection!
• Before: Experts would carefully craft image statistics they would expect a priori to be useful for their problem.

• After: Reframe the task as an optimization problem, and let the optimization figure out how to solve the task

### The AI bubble

astro-ph abstracts mentioning Deep Learning, CNN, or Neural Networks

### Be careful to ask the right questions!

The most common pitfall: Covariate Shift

$\Longrightarrow$ The training data plays an integral part in the answer

### An example of bad idea: Naive Image Deconvolution with a CNN

$y = P \ast x + n$

Observed $y$

Ground-Based Telescope

$f_\theta$

some deep Convolutional Neural Network

Unknown $x$

Hubble Space Telescope

• A standard approach would be to train a neural network $f_\theta$ to estimate $x$ given $y$ on a dataset $\mathcal{D}=\{(x_i, yi)\}$.
• Is this a good idea?

### Let's be critical for a minute

Is Strong Lens Finding a good application of Deep Learning?

Training examples

Example of real data

### What can Deep Learning do for Cosmology?

What are the right questions for Cosmology?
• As physicists, we want to remain model-based.
We want to compare an interpretable/physical model to data.

• We already have the foundational statistical tools for comparing models and data: Bayesian inference

• $\Longrightarrow$ Wherever Deep Learning can be useful is in places where Bayesian inference approach becomes classically intractable.
What we will talk about today:

• Likelihood-Free Inference

• Data-driven priors for Inverse Problems

• Automatically Differentiable Physics

## Likelihood-Free Inference

### limits of traditional cosmological inference

HSC cosmic shear power spectrum
HSC Y1 constraints on $(S_8, \Omega_m)$
(Hikage et al. 2018)
• Measure the ellipticity $\epsilon = \epsilon_i + \gamma$ of all galaxies
$\Longrightarrow$ Noisy tracer of the weak lensing shear $\gamma$

• Compute summary statistics based on 2pt functions,
e.g. the power spectrum

• Run an MCMC to recover a posterior on model parameters, using an analytic likelihood $$p(\theta | x ) \propto \underbrace{p(x | \theta)}_{\mathrm{likelihood}} \ \underbrace{p(\theta)}_{\mathrm{prior}}$$
Main limitation: the need for an explicit likelihood
We can only compute the likelihood for simple summary statistics and on large scales

$\Longrightarrow$ We are dismissing a large amount of information!

## Can I use a Deep Learning to perform proper Bayesian inference without analytic likelihoods?

### let us rephrase the question

• I assume a forward model of the observations: $$p( x ) = p(x | \theta) \ p(\theta) \nonumber$$ All I ask is the ability to sample from the model, to obtain $\mathcal{D} = \{x_i, \theta_i \}_{i\in \mathbb{N}}$

• I am going to assume $q_\phi(\theta | x)$ a parametric conditional density

• Optimize the parameters $\phi$ of $q_{\phi}$ according to $$\min\limits_{\phi} \sum\limits_{i} - \log q_{\phi}(\theta_i | x_i) \nonumber$$ In the limit of large number of samples and sufficient flexibility $$\boxed{q_{\phi^\ast}(\theta | x) \approx p(\theta | x)} \nonumber$$
$\Longrightarrow$ One can asymptotically recover the posterior by optimizing a parametric estimator over
the Bayesian joint distribution
$\Longrightarrow$ One can asymptotically recover the posterior by optimizing a Deep Neural Network over
a simulated training set.

### The Likelihood-Free Inference approach

• Instead of trying to analytically evaluate the likelihood, let us build a forward model of the observables.

$\Longrightarrow$ The simulator becomes the physical model

A two-steps approach to inference
• Automatically learn an optimal low-dimensional summary statistic $$y = f_\varphi(\kappa_{KS})$$
• Use Neural Density Estimation to either:
• build an estimate $p_\phi$ of the likelihood function $p(y \ | \ \theta)$ (Neural Likelihood Estimation)

• build an estimate $p_\phi$ of the posterior distribution $p(\theta \ | \ y)$ (Neural Posterior Estimation)

### Neural Density Estimation

Bishop (1994)
• Mixture Density Networks (MDN) $$p(\theta | x) = \prod_i \pi_i(x) \ \mathcal{N}\left(\mu_i(x), \ \sigma_i(x) \right) \nonumber$$

• Flourishing Machine Learning literature on density estimators
GLOW, (Kingma & Dhariwal, 2018)

### End-to-end framework for likelihood-free parameter inference with DES SV

Jeffrey, Alsing, Lanusse (2021)

Suite of N-body + raytracing simulations: $\mathcal{D}$

### Learning summary statistics by Variational Mutual Information Maximization

• Mutual information between $X$ and $Y$:
“"amount of information" obtained about one random variable through observing the other random variable”

• Given a parametric summarizing function $y = f_\phi(\kappa(\theta))$ optimizing $f_\phi$ can be done by maximizing $I(y, \theta)$.

• In practice, $f_\phi$ is a CNN, trained to maximize a variational lower bound on the mutual information: $$I(y ; \theta) \ \ge \ \mathbb{E}_{y, \theta} [ \log q_\phi(\theta | y) ] + H(\Theta)$$

### deep residual networks for lensing maps compression

• Deep Residual Network $y = f_\phi(x)$ followed by mixture density network $q_\phi(\theta | y)$

• Training on weak lensing maps simulated for different cosmologies

• Optimization of the variational lower bound: $$\mathbb{E}_{(x, \theta) \in \mathcal{D}} [ \log q_\phi(\theta | f_\phi(y) ) ]$$

### Estimating the likelihood by Neural Density Estimation

$\Longrightarrow$ We cannot assume a Gaussian likelihood for the summary $y = f_\phi(\kappa)$ but we can learn $p(y | \theta)$: Neural Likelihood Estimation.

Dinh et al. 2016
Neural Likelihood Estimation by Normalizing Flow
• We use a conditional Normalizing Flow to build an explicit model for the likelihood function $$\log p_\varphi (y | \theta)$$

• In practice we use the pyDELFI package and an ensemble of NDEs for robustness.

• Once learned, we can use the likelihood as part of a conventional MCMC chain

### Let's be critical for a minute

Is this Likelihood-Free Inference approach a good application of Deep Learning?

Deep Learning For Cosmological Inference
• It's only a shift in how we model physics and observables: $$\mbox{analytic modeling} \rightarrow \mbox{forward simulation}$$ but we remain within the same physical Bayesian framework.

• This allows us to free ourselves from the restrictions of tractable analytic likelihoods (choice of summary statistics, handling of systematics, etc...)

# Deep Generative Models as Data-Driven Bayesian Priors

### Gravitational lensing

Galaxy shapes as estimators for gravitational shear
$$e = \gamma + e_i \qquad \mbox{ with } \qquad e_i \sim \mathcal{N}(0, I)$$
• We are trying the measure the ellipticity $e$ of galaxies as an estimator for the gravitational shear $\gamma$

### Weak Lensing Mass-Mapping as an Inverse Problem

Shear $\gamma$
Convergence $\kappa$
$$\gamma_1 = \frac{1}{2} (\partial_1^2 - \partial_2^2) \ \Psi \quad;\quad \gamma_2 = \partial_1 \partial_2 \ \Psi \quad;\quad \kappa = \frac{1}{2} (\partial_1^2 + \partial_2^2) \ \Psi$$
$$\boxed{\gamma = \mathbf{P} \kappa}$$

### Linear inverse problems

$\boxed{y = \mathbf{A}x + n}$

$\mathbf{A}$ is known and encodes our physical understanding of the problem.
$\Longrightarrow$ When non-invertible or ill-conditioned, the inverse problem is ill-posed with no unique solution $x$
Deconvolution
Inpainting
Denoising

### What Would a Bayesian Do?

$\boxed{y = \mathbf{A}x + n}$

The Bayesian view of the problem:
$$p(x | y) \propto p(y | x) \ p(x)$$
• $p(y | x)$ is the data likelihood, which contains the physics

• $p(x)$ is the prior knowledge on the solution.

With these concepts in hand we can:
• Estimate for instance the Maximum A Posteriori solution:
$$\hat{x} = \arg\max\limits_x \ \log p(y \ | \ x) + \log p(x)$$
• Estimate from the full posterior p(x|y) with MCMC or Variational Inference methods.

### Classical examples of signal priors

Sparse
$$\log p(x) = \parallel \mathbf{W} x \parallel_1$$
Gaussian $$\log p(x) = x^t \mathbf{\Sigma^{-1}} x$$
Total Variation $$\log p(x) = \parallel \nabla x \parallel_1$$

$\Longrightarrow$ Prior in the form of numerical simulations.

## Can we use Deep Learning to embed simulation-driven priors within a physical Bayesian model?

### What is generative modeling?

• The goal of generative modeling is to learn the distribution $\mathbb{P}$ from which the training set $X = \{x_0, x_1, \ldots, x_n \}$ is drawn.

• Usually, this means building a parametric model $\mathbb{P}_\theta$ that tries to be close to $\mathbb{P}$.

True $\mathbb{P}$

Samples $x_i \sim \mathbb{P}$

Model $\mathbb{P}_\theta$

• Once trained, you can typically sample from $\mathbb{P}_\theta$ and/or evaluate the likelihood $p_\theta(x)$.

### Why isn't it easy?

• The curse of dimensionality put all points far apart in high dimension

Distance between pairs of points drawn from a Gaussian distribution.

• Classical methods for estimating probability densities, i.e. Kernel Density Estimation (KDE) start to fail in high dimension because of all the gaps

### The evolution of generative models

• Deep Belief Network
(Hinton et al. 2006)

• Variational AutoEncoder
(Kingma & Welling 2014)

(Goodfellow et al. 2014)

• Wasserstein GAN
(Arjovsky et al. 2017)

### A visual Turing test

Fake PixelCNN samples

Real galaxies from SDSS

### Getting started with Deep Priors: deep denoising example

$$\boxed{{\color{Orchid} y} = {\color{SkyBlue} x} + n}$$
• Let us assume we have access to examples of ${\color{SkyBlue} x}$ without noise.

• We learn the distribution of noiseless data $\log p_\theta(x)$ from samples using a deep generative model.

• The solution should lie on the realistic data manifold, symbolized by the two-moons distribution.

We want to solve for the Maximum A Posterior solution:

$$\arg \max - \frac{1}{2} \parallel {\color{Orchid} y} - {\color{SkyBlue} x} \parallel_2^2 + \log p_\theta({\color{SkyBlue} x})$$ This can be done by gradient descent as long as one has access to the score function $\frac{\color{orange} d \color{orange}\log \color{orange}p\color{orange}(\color{orange}x\color{orange})}{\color{orange} d \color{orange}x}$.

### Neural Score Estimation by Denoising Score Matching

• Denoising Score Matching: An optimal Gaussian denoiser learns the score of a given distribution.
• If $x \sim \mathbb{P}$ is corrupted by additional Gaussian noise $u \in \mathcal{N}(0, \sigma^2)$ to yield $$x^\prime = x + u$$
• Let's consider a denoiser $r_\theta$ trained under an $\ell_2$ loss: $$\mathcal{L}=\parallel x - r_\theta(x^\prime, \sigma) \parallel_2^2$$
• The optimal denoiser $r_{\theta^\star}$ verifies: $$\boxed{\boldsymbol{r}_{\theta^\star}(\boldsymbol{x}', \sigma) = \boldsymbol{x}' + \sigma^2 \nabla_{\boldsymbol{x}} \log p_{\sigma^2}(\boldsymbol{x}')}$$
$\boldsymbol{x}'$
$\boldsymbol{x}$
$\boldsymbol{x}'- \boldsymbol{r}^\star(\boldsymbol{x}', \sigma)$
$\boldsymbol{r}^\star(\boldsymbol{x}', \sigma)$

### Back to the Mass-Mapping Problem

$$\log p( \kappa | e) = \underbrace{\log p(e | \kappa)}_{\simeq -\frac{1}{2} \parallel e - P \kappa \parallel_\Sigma^2} + \log p(\kappa) +cst$$
• The likelihood term is known analytically.
• There is no close form expression for the full non-Gaussian prior of the convergence.
However:
• We do have access to samples of full prior through simulations: $X = \{x_0, x_1, \ldots, x_n \}$ with $x_i \sim \mathbb{P}$
$\Longrightarrow$ Our strategy: Learn the prior score $\nabla \log p(\kappa)$ with Denoising Score Matching, and then sample the full posterior with annealed HMC.

### Probabilistic Mass-Mapping of the HST COSMOS field

• COSMOS shear data from Schrabback et al. 2010

• Prior learned from MassiveNuS at fiducial cosmology (320x320 maps at 0.4 arcsec resolution).

• Known massive X-ray clusters indicated with crosses, along with their redshifts, right pannel shows cutouts of central cluster from multiple posterior samples.

### Uncertainty quantification in Magnetic Resonance Imaging (MRI)

Ramzi, Remy, Lanusse et al. 2020

$$\boxed{y = \mathbf{M} \mathbf{F} x + n}$$

$\Longrightarrow$ We can see which parts of the image are well constrained by data, and which regions are uncertain.

### Example of application to deblending

• Hybrid Physical-Deep Learning Model for Astronomical Inverse Problems
F. Lanusse, P. Melchior, F. Moolekamp

$\mathcal{L} = \frac{1}{2} \parallel \mathbf{\Sigma}^{-1/2} (\ Y - P \ast A S \ ) \parallel_2^2 - \sum_{i=1}^K \log p_{\theta}(S_i)$

isolated galaxy $\log p_\theta(x) = 3293.7$
artificial blend $\log p_\theta(x) = 3100.5$

### Let's be critical for a minute

Is this Deep Priors approach a good application of Deep Learning?

Deep Data-Driven Priors for Inverse Problems
• Deep Generative Models are only a way to perform Bayesian inference in a classically intractable regime: from having only access to samples from the prior p(x) with a high dimensional $x$.

• We retain a physical explicit likelihood (can directly account for changes in masks, noise, etc. without retraining)

## Automatically Differentiable Physics

### the hammer behind the Deep Learning revolution

• Automatic differentiation allows you to compute analytic derivatives of arbitraty expressions:
If I form the expression $y = a * x + b$, it is separated in fundamental ops: $$y = u + b \qquad u = a * x$$ then gradients can be obtained by the chain rule: $$\frac{\partial y}{\partial x} = \frac{\partial y}{\partial u} \frac{ \partial u}{\partial x} = 1 \times a = a$$

• This is a fundamental tool in Machine Learning, and autodiff frameworks include TensorFlow, JAX or PyTorch.
$\Longrightarrow$ In addition to autodiff, they all provide GPU/TPU acceleration.

### jax-cosmo: Finally a differentiable cosmology library, and it's in JAX!

• Easy to use, runs on GPU, and validated against the DESC Core Cosmology Library.

import jax_cosmo as jc

# Defining a Cosmology
cosmo = jc.Planck15()

# Define a redshift distribution with smail_nz(a, b, z0)
nz = jc.redshift.smail_nz(1., 2., 1.)

# Build a lensing tracer with a single redshift bin
probe = probes.WeakLensing([nz])

# Compute angular Cls for some ell
ell = np.logspace(0.1,3)
cls = angular_cl(cosmo_jax, ell, [probe])

• Fisher matrices become trivial and exact!

# Fisher matrix in just one line:
F = - jax.hessian(gaussian_likelihood)(theta)

No derivatives were harmed by finite differences in the computation of this Fisher!

### LSST DESC 3x2pt Tomography Challenge

Zuntz, Lanusse, et al. (2021)
Description of the challenge
“Given (g)riz photometry, find a tomographic bin assignment method that optimizes a 3x2pt analysis.”
• Metrics: Total Signal-to-Noise: $m_{SNR} = \sqrt{\mu^t \mathbf{C}^{-1} \mu}$ ; DETF Figure of Merit: $m_{FOM} = \frac{1}{\sqrt{ \det(\mathbf{F}^{-1})}}$
• Conventional strategy: Use a photoz code to estimate redshifts, then bin galaxies based on their photoz.

• Strategy with Differentiable Physics:
• Introduce a parametric bin assignement function $f_\theta(x_{phot})$
• Optimize $\theta$ by back-propagating through the challenge metrics.

### Another approach to cosmological inference: Hierarchical Bayesian Modeling

• Instead of trying to analytically evaluate the likelihood, let us build a forward model of the observables.

• Each component of the model is now tractable, but at the cost of a large number of latent variables.

$\Longrightarrow$ How to peform efficient inference in this large number of dimensions?

A non-exhaustive list of methods:
• Hamiltonian Monte-Carlo
• Variational Inference
• MAP+Laplace
• Gold Mining
• Dimensionality reduction by Fisher-Information Maximization

What do they all have in common?
-> They require fast, accurate, differentiable forward simulations
(Schneider et al. 2015)

### Forward Models in Cosmology

Linear Field
Final Dark Matter

Dark Matter Halos
Galaxies
$\longrightarrow$
N-body simulations
$\longrightarrow$
Group Finding
algorithms
$\longrightarrow$
Semi-analytic &
distribution models

## How do we make cosmological simulations fast and differentiable?

### You can try to learn the simulation...

Learning particle displacement with a UNet. S. He, et al. (2019)

The issue with using deep learning as a black-box
• No guarantees to work outside of training regime.
• No guarantees to capture dependence on cosmology accurately.

### the Fast Particle-Mesh scheme

The idea: approximate gravitational forces by estimating densities on a grid.
• The numerical scheme:

• Estimate the density of particles on a mesh
=> compute gravitational forces by FFT

• Interpolate forces at particle positions

• Update particle velocity and positions, and iterate

• Fast and simple, at the cost of approximating short range interactions.
$\Longrightarrow$ Only a series of FFTs and interpolations.

### Introducing FlowPM: Particle-Mesh Simulations in TensorFlow

Modi, Lanusse, Seljak (2020)

import tensorflow as tf
import flowpm
# Defines integration steps
stages = np.linspace(0.1, 1.0, 10, endpoint=True)

initial_conds = flowpm.linear_field(32,       # size of the cube
100,      # Physical size
ipklin,   # Initial powerspectrum
batch_size=16)

# Sample particles and displace them by LPT
state = flowpm.lpt_init(initial_conds, a0=0.1)

# Evolve particles down to z=0
final_state = flowpm.nbody(state, stages, 32)

# Retrieve final density field
final_field = flowpm.cic_paint(tf.zeros_like(initial_conditions),
final_state[0])

• Seamless interfacing with deep learning components

### Bayesian Reconstruction of Cosmological Fields

Going back to simpler times...
$$\arg\max_s \ \log p(x | f(s)) \ + \ \log p(s)$$ where:
• $f$ is the forward model
• $p(x | f(s))$ is a tractable data likelihood
• $s$ are the initial conditions (early universe)
• $x$ is the data (e.g. galaxies, lensing, etc.)

### MAP optimization in action

$$\arg\max_s \ \log p(x_{dm} | f(s)) \ + \ \log p(s)$$
credit: C. Modi

True initial conditions
$s_0$

Reconstructed initial conditions $s$

Reconstructed dark matter distribution $x_{dm} = f(s)$

Data
$x_{dm} = f(s_0)$

### If only MAP optimization was easy...

$$\arg\max_s \ \log p(x_{dm} | f(s)) \ + \ \log p(s)$$
Direct optimization
• Direct optimization of MAP leads to poor solutions on large scales.
• Annealing recovers unbiased large scales, but at the cost of ad-hoc tempering procedure.

## Instead of guessing an optimization scheme, could we learn to optimize?

### A closer look at the optimization algorithm

$$\arg\max_x \ \log p(y | f(x)) \ + \ \log p(x)$$
• Standard Gradient Descent Algorithm: $$x_{i+1} = x_i - \epsilon \left[ \nabla_x{\log p(y | f(x_i))} + \nabla_x \log p(x_i) \right]$$
$$x_{i+1} = x_i - \Gamma \left(\nabla_x{\log p(y | f(x_i))}, \nabla_x \log p(x_i) \right]$$ with update function $\Gamma: (u,v) \rightarrow \epsilon(u + v)$

• Many algorithms (e.g. ADAM, LBFGS) can expressed in this form with a different choice of $\Gamma$.

$\Longrightarrow$ What if we could learn this update function?

### Recurrent Inference Machines for Solving Inverse ProblemsPutzky & Welling, 2017

• Introduce a Recurrent Neural Network (RNN) $h_\phi$, and state variable $s$, so that: $$s_{i+1} = h^*_\phi( \nabla \log p(y|x_{i}), x_i, s_{i})$$ $$x_{i+1} = x_i + h_\phi(\nabla \log p(y|x_{i}), x_i, s_{i+1})$$
• Train according to: $$\mathcal{L} = \sum_{i}^T w_i\mathcal{L}(x_i, x)$$

(source)

### Illustration on inverse problems

From left to right: input masked image, increasing number of steps of solutions, ground truth.

### CosmicRIM: Recurrence Inference Machines for Initial Condition Reconstruction

Modi, Lanusse, Seljak, Spergel, Perreault-Levasseur (2021)

Recurrent Neural Network Architecture
• A few notable differences to a vanilla RIM:
• We provide gradients of both prior and likelihood to the model.

• Because our forward model couples scales, we use a multiscale U-Net architecture.

### Experiments

Settings
• Forward model: $64^3$ particles, 400 Mpc/h box, 2LPT dynamics with 2nd order bias model
• RIM: 10 steps, trained under l2 loss
Initial conditions cross-correlation
Transfer function
• CosmicRIM: Learn to optimize by embedding a Neural Network in the optimization algorithm.
$\Longrightarrow$ converges 40x faster than LBFGS.

## Conclusion

### Conclusion

Where do I think Deep Learning can be useful for Cosmology?
$\Longrightarrow$ Makes Bayesian inference possible at scale and with non-trivial models!

• Enables inference in high dimension from numerical simulators.
• Automagically construct summary statistics.
• Provides the density estimation tools needed.

• Complement known physical models with data-driven components
• Use data-driven generative model as prior for solving inverse problems.

• Differentiable physical models for fast inference

Thank you !