A view of the potential of Deep Learning for Cosmology

Ecole de Gif - Marseille, 2021

François Lanusse

slides at eiffl.github.io/talks/Gif2021

the $\Lambda$CDM view of the Universe

the Rubin Observatory Legacy Survey of Space and Time

  • 1000 images each night, 15 TB/night for 10 years

  • 18,000 square degrees, observed once every few days

  • Tens of billions of objects, each one observed $\sim1000$ times

Previous generation survey: SDSS

Image credit: Peter Melchior

Current generation survey: DES

Image credit: Peter Melchior

LSST precursor survey: HSC

Image credit: Peter Melchior

The challenges of modern surveys

$\Longrightarrow$ Modern surveys will provide large volumes of high quality data

A Blessing
  • Unprecedented statistical power

  • Great potential for new discoveries

A Curse
  • Existing methods are reaching their limits at every step of the science analysis

  • Control of systematics is paramount
LSST forecast on dark energy parameters
$\Longrightarrow$ Dire need for novel analysis techniques to fully realize the potential of modern surveys.

Can AI solve all of our problems?

example of application: gravitational time delays

WWAD: What Would an Astrophysicist Do ?

gri composite
g - $\alpha$ i
detected areas
HST images
RingFinder (Gavazzi et al. 2014)

Gavazzi et al. (2014), Collett (2015)

$\Longrightarrow$ Plainly intractable at the scale of LSST

The Deep Learning approach: Convolutional Neural Networks

CMUDeepLens Architecture
(Lanusse et al. 2017)
The Deep Learning approach in three steps:

  • Step I: Introduce a parameteric function $f_\theta$, this is your neural network. For images, choose a CNN.

  • Step II: Create a dataset $\mathcal{D}$ of examples $x_i$ and associated labels $y_i$: $$\mathcal{D} = \{ (x_i, y_i) \}_{i \in [0, N]}$$

  • Step III: Optimize the parameters $\theta$ as to minimize an appropriate loss function.
    For classification, the binary cross entropy: $$\arg\min_\theta \sum_{i=1}^N y_i \log f_\theta(x_i) + (1 - y_i) \log f_{\theta}(x_i)$$

The Euclid strong-lens finding challenge

Metcalf, . . ., Lanusse, et al. (2018)

Better accuracy than human visual inspection!
A true paradigm shift
  • Before: Experts would carefully craft image statistics they would expect a priori to be useful for their problem.

  • After: Reframe the task as an optimization problem, and let the optimization figure out how to solve the task

The AI bubble

astro-ph abstracts mentioning Deep Learning, CNN, or Neural Networks

Be careful to ask the right questions!

The most common pitfall: Covariate Shift

$\Longrightarrow$ The training data plays an integral part in the answer

An example of bad idea: Naive Image Deconvolution with a CNN

$ y = P \ast x + n $

Observed $y$

Ground-Based Telescope


some deep Convolutional Neural Network

Unknown $x$

Hubble Space Telescope

  • A standard approach would be to train a neural network $f_\theta$ to estimate $x$ given $y$ on a dataset $\mathcal{D}=\{(x_i, yi)\}$.
  • Is this a good idea?

Let's be critical for a minute

Is Strong Lens Finding a good application of Deep Learning?

Training examples

Example of real data

What can Deep Learning do for Cosmology?

What are the right questions for Cosmology?
  • As physicists, we want to remain model-based.
    We want to compare an interpretable/physical model to data.

  • We already have the foundational statistical tools for comparing models and data: Bayesian inference

  • $\Longrightarrow$ Wherever Deep Learning can be useful is in places where Bayesian inference approach becomes classically intractable.
What we will talk about today:

  • Likelihood-Free Inference

  • Data-driven priors for Inverse Problems

  • Automatically Differentiable Physics

Likelihood-Free Inference

limits of traditional cosmological inference

HSC cosmic shear power spectrum
HSC Y1 constraints on $(S_8, \Omega_m)$
(Hikage et al. 2018)
  • Measure the ellipticity $\epsilon = \epsilon_i + \gamma$ of all galaxies
    $\Longrightarrow$ Noisy tracer of the weak lensing shear $\gamma$

  • Compute summary statistics based on 2pt functions,
    e.g. the power spectrum

  • Run an MCMC to recover a posterior on model parameters, using an analytic likelihood $$ p(\theta | x ) \propto \underbrace{p(x | \theta)}_{\mathrm{likelihood}} \ \underbrace{p(\theta)}_{\mathrm{prior}}$$
Main limitation: the need for an explicit likelihood
We can only compute the likelihood for simple summary statistics and on large scales

$\Longrightarrow$ We are dismissing a large amount of information!

Can I use a Deep Learning to perform proper Bayesian inference without analytic likelihoods?

let us rephrase the question

  • I assume a forward model of the observations: \begin{equation} p( x ) = p(x | \theta) \ p(\theta) \nonumber \end{equation} All I ask is the ability to sample from the model, to obtain $\mathcal{D} = \{x_i, \theta_i \}_{i\in \mathbb{N}}$

  • I am going to assume $q_\phi(\theta | x)$ a parametric conditional density

  • Optimize the parameters $\phi$ of $q_{\phi}$ according to \begin{equation} \min\limits_{\phi} \sum\limits_{i} - \log q_{\phi}(\theta_i | x_i) \nonumber \end{equation} In the limit of large number of samples and sufficient flexibility \begin{equation} \boxed{q_{\phi^\ast}(\theta | x) \approx p(\theta | x)} \nonumber \end{equation}
$\Longrightarrow$ One can asymptotically recover the posterior by optimizing a parametric estimator over
the Bayesian joint distribution
$\Longrightarrow$ One can asymptotically recover the posterior by optimizing a Deep Neural Network over
a simulated training set.

The Likelihood-Free Inference approach

  • Instead of trying to analytically evaluate the likelihood, let us build a forward model of the observables.

$\Longrightarrow$ The simulator becomes the physical model

A two-steps approach to inference
  • Automatically learn an optimal low-dimensional summary statistic $$y = f_\varphi(\kappa_{KS}) $$
  • Use Neural Density Estimation to either:
    • build an estimate $p_\phi$ of the likelihood function $p(y \ | \ \theta)$ (Neural Likelihood Estimation)

    • build an estimate $p_\phi$ of the posterior distribution $p(\theta \ | \ y)$ (Neural Posterior Estimation)

Neural Density Estimation

Bishop (1994)
  • Mixture Density Networks (MDN) \begin{equation} p(\theta | x) = \prod_i \pi_i(x) \ \mathcal{N}\left(\mu_i(x), \ \sigma_i(x) \right) \nonumber \end{equation}

  • Flourishing Machine Learning literature on density estimators
    GLOW, (Kingma & Dhariwal, 2018)

End-to-end framework for likelihood-free parameter inference with DES SV

Jeffrey, Alsing, Lanusse (2021)

Suite of N-body + raytracing simulations: $\mathcal{D}$

Learning summary statistics by Variational Mutual Information Maximization

  • Mutual information between $X$ and $Y$:
    “"amount of information" obtained about one random variable through observing the other random variable”

  • Given a parametric summarizing function $y = f_\phi(\kappa(\theta))$ optimizing $f_\phi$ can be done by maximizing $I(y, \theta)$.

  • In practice, $f_\phi$ is a CNN, trained to maximize a variational lower bound on the mutual information: $$ I(y ; \theta) \ \ge \ \mathbb{E}_{y, \theta} [ \log q_\phi(\theta | y) ] + H(\Theta) $$

deep residual networks for lensing maps compression

  • Deep Residual Network $y = f_\phi(x)$ followed by mixture density network $q_\phi(\theta | y)$

  • Training on weak lensing maps simulated for different cosmologies

  • Optimization of the variational lower bound: $$\mathbb{E}_{(x, \theta) \in \mathcal{D}} [ \log q_\phi(\theta | f_\phi(y) ) ]$$

Estimating the likelihood by Neural Density Estimation

$\Longrightarrow$ We cannot assume a Gaussian likelihood for the summary $y = f_\phi(\kappa)$ but we can learn $p(y | \theta)$: Neural Likelihood Estimation.

Dinh et al. 2016
Neural Likelihood Estimation by Normalizing Flow
  • We use a conditional Normalizing Flow to build an explicit model for the likelihood function $$ \log p_\varphi (y | \theta)$$

  • In practice we use the pyDELFI package and an ensemble of NDEs for robustness.

  • Once learned, we can use the likelihood as part of a conventional MCMC chain

Parameter constraints from DES SV data

Let's be critical for a minute

Is this Likelihood-Free Inference approach a good application of Deep Learning?

Deep Learning For Cosmological Inference
  • It's only a shift in how we model physics and observables: $$ \mbox{analytic modeling} \rightarrow \mbox{forward simulation} $$ but we remain within the same physical Bayesian framework.

  • This allows us to free ourselves from the restrictions of tractable analytic likelihoods (choice of summary statistics, handling of systematics, etc...)

Deep Generative Models as Data-Driven Bayesian Priors

Gravitational lensing

Galaxy shapes as estimators for gravitational shear
$$ e = \gamma + e_i \qquad \mbox{ with } \qquad e_i \sim \mathcal{N}(0, I)$$
  • We are trying the measure the ellipticity $e$ of galaxies as an estimator for the gravitational shear $\gamma$

Weak Lensing Mass-Mapping as an Inverse Problem

Shear $\gamma$
Convergence $\kappa$
$$\gamma_1 = \frac{1}{2} (\partial_1^2 - \partial_2^2) \ \Psi \quad;\quad \gamma_2 = \partial_1 \partial_2 \ \Psi \quad;\quad \kappa = \frac{1}{2} (\partial_1^2 + \partial_2^2) \ \Psi$$
$$\boxed{\gamma = \mathbf{P} \kappa}$$

Linear inverse problems

$\boxed{y = \mathbf{A}x + n}$

$\mathbf{A}$ is known and encodes our physical understanding of the problem.
$\Longrightarrow$ When non-invertible or ill-conditioned, the inverse problem is ill-posed with no unique solution $x$

What Would a Bayesian Do?

$\boxed{y = \mathbf{A}x + n}$

The Bayesian view of the problem:
$$ p(x | y) \propto p(y | x) \ p(x) $$
  • $p(y | x)$ is the data likelihood, which contains the physics

  • $p(x)$ is the prior knowledge on the solution.

    With these concepts in hand we can:
  • Estimate for instance the Maximum A Posteriori solution:
    $$\hat{x} = \arg\max\limits_x \ \log p(y \ | \ x) + \log p(x)$$
  • Estimate from the full posterior p(x|y) with MCMC or Variational Inference methods.

How do you choose the prior ?

Classical examples of signal priors

$$ \log p(x) = \parallel \mathbf{W} x \parallel_1 $$
Gaussian $$ \log p(x) = x^t \mathbf{\Sigma^{-1}} x $$
Total Variation $$ \log p(x) = \parallel \nabla x \parallel_1 $$

But what about this?

$\Longrightarrow$ Prior in the form of numerical simulations.

Can we use Deep Learning to embed simulation-driven priors within a physical Bayesian model?

What is generative modeling?

  • The goal of generative modeling is to learn the distribution $\mathbb{P}$ from which the training set $X = \{x_0, x_1, \ldots, x_n \}$ is drawn.

  • Usually, this means building a parametric model $\mathbb{P}_\theta$ that tries to be close to $\mathbb{P}$.

True $\mathbb{P}$

Samples $x_i \sim \mathbb{P}$

Model $\mathbb{P}_\theta$

  • Once trained, you can typically sample from $\mathbb{P}_\theta$ and/or evaluate the likelihood $p_\theta(x)$.

Why isn't it easy?

  • The curse of dimensionality put all points far apart in high dimension

Distance between pairs of points drawn from a Gaussian distribution.

  • Classical methods for estimating probability densities, i.e. Kernel Density Estimation (KDE) start to fail in high dimension because of all the gaps

The evolution of generative models

  • Deep Belief Network
    (Hinton et al. 2006)

  • Variational AutoEncoder
    (Kingma & Welling 2014)

  • Generative Adversarial Network
    (Goodfellow et al. 2014)

  • Wasserstein GAN
    (Arjovsky et al. 2017)

A visual Turing test

Fake PixelCNN samples

Real galaxies from SDSS

Getting started with Deep Priors: deep denoising example

$$ \boxed{{\color{Orchid} y} = {\color{SkyBlue} x} + n} $$
  • Let us assume we have access to examples of $ {\color{SkyBlue} x}$ without noise.

  • We learn the distribution of noiseless data $\log p_\theta(x)$ from samples using a deep generative model.

  • The solution should lie on the realistic data manifold, symbolized by the two-moons distribution.

    We want to solve for the Maximum A Posterior solution:

    $$\arg \max - \frac{1}{2} \parallel {\color{Orchid} y} - {\color{SkyBlue} x} \parallel_2^2 + \log p_\theta({\color{SkyBlue} x})$$ This can be done by gradient descent as long as one has access to the score function $\frac{\color{orange} d \color{orange}\log \color{orange}p\color{orange}(\color{orange}x\color{orange})}{\color{orange} d \color{orange}x}$.

Neural Score Estimation by Denoising Score Matching

  • Denoising Score Matching: An optimal Gaussian denoiser learns the score of a given distribution.
    • If $x \sim \mathbb{P}$ is corrupted by additional Gaussian noise $u \in \mathcal{N}(0, \sigma^2)$ to yield $$x^\prime = x + u$$
    • Let's consider a denoiser $r_\theta$ trained under an $\ell_2$ loss: $$\mathcal{L}=\parallel x - r_\theta(x^\prime, \sigma) \parallel_2^2$$
    • The optimal denoiser $r_{\theta^\star}$ verifies: $$\boxed{\boldsymbol{r}_{\theta^\star}(\boldsymbol{x}', \sigma) = \boldsymbol{x}' + \sigma^2 \nabla_{\boldsymbol{x}} \log p_{\sigma^2}(\boldsymbol{x}')}$$
$\boldsymbol{x}'- \boldsymbol{r}^\star(\boldsymbol{x}', \sigma)$
$\boldsymbol{r}^\star(\boldsymbol{x}', \sigma)$

Back to the Mass-Mapping Problem

$$ \log p( \kappa | e) = \underbrace{\log p(e | \kappa)}_{\simeq -\frac{1}{2} \parallel e - P \kappa \parallel_\Sigma^2} + \log p(\kappa) +cst $$
  • The likelihood term is known analytically.
  • There is no close form expression for the full non-Gaussian prior of the convergence.
    • We do have access to samples of full prior through simulations: $X = \{x_0, x_1, \ldots, x_n \}$ with $x_i \sim \mathbb{P}$
$\Longrightarrow$ Our strategy: Learn the prior score $\nabla \log p(\kappa)$ with Denoising Score Matching, and then sample the full posterior with annealed HMC.

Illustration on MassiveNuS simulations

Probabilistic Mass-Mapping of the HST COSMOS field

  • COSMOS shear data from Schrabback et al. 2010

  • Prior learned from MassiveNuS at fiducial cosmology (320x320 maps at 0.4 arcsec resolution).

  • Known massive X-ray clusters indicated with crosses, along with their redshifts, right pannel shows cutouts of central cluster from multiple posterior samples.

Uncertainty quantification in Magnetic Resonance Imaging (MRI)

Ramzi, Remy, Lanusse et al. 2020

$$\boxed{y = \mathbf{M} \mathbf{F} x + n}$$

$\Longrightarrow$ We can see which parts of the image are well constrained by data, and which regions are uncertain.

Example of application to deblending

  • Hybrid Physical-Deep Learning Model for Astronomical Inverse Problems
    F. Lanusse, P. Melchior, F. Moolekamp

$\mathcal{L} = \frac{1}{2} \parallel \mathbf{\Sigma}^{-1/2} (\ Y - P \ast A S \ ) \parallel_2^2 - \sum_{i=1}^K \log p_{\theta}(S_i)$

isolated galaxy $\log p_\theta(x) = 3293.7$
artificial blend $\log p_\theta(x) = 3100.5 $

Let's be critical for a minute

Is this Deep Priors approach a good application of Deep Learning?

Deep Data-Driven Priors for Inverse Problems
  • Deep Generative Models are only a way to perform Bayesian inference in a classically intractable regime: from having only access to samples from the prior p(x) with a high dimensional $x$.

  • We retain a physical explicit likelihood (can directly account for changes in masks, noise, etc. without retraining)

Automatically Differentiable Physics

the hammer behind the Deep Learning revolution

  • Automatic differentiation allows you to compute analytic derivatives of arbitraty expressions:
    If I form the expression $y = a * x + b$, it is separated in fundamental ops: $$ y = u + b \qquad u = a * x $$ then gradients can be obtained by the chain rule: $$\frac{\partial y}{\partial x} = \frac{\partial y}{\partial u} \frac{ \partial u}{\partial x} = 1 \times a = a$$

  • This is a fundamental tool in Machine Learning, and autodiff frameworks include TensorFlow, JAX or PyTorch.
    $\Longrightarrow$ In addition to autodiff, they all provide GPU/TPU acceleration.

jax-cosmo: Finally a differentiable cosmology library, and it's in JAX!

Open In Colab
  • Easy to use, runs on GPU, and validated against the DESC Core Cosmology Library.
    																import jax_cosmo as jc
    																# Defining a Cosmology
    																cosmo = jc.Planck15()
    																# Define a redshift distribution with smail_nz(a, b, z0)
    																nz = jc.redshift.smail_nz(1., 2., 1.)
    																# Build a lensing tracer with a single redshift bin
    																probe = probes.WeakLensing([nz])
    																# Compute angular Cls for some ell
    																ell = np.logspace(0.1,3)
    																cls = angular_cl(cosmo_jax, ell, [probe])
  • Fisher matrices become trivial and exact!
    													# Fisher matrix in just one line:
    													F = - jax.hessian(gaussian_likelihood)(theta)
    No derivatives were harmed by finite differences in the computation of this Fisher!

LSST DESC 3x2pt Tomography Challenge

Zuntz, Lanusse, et al. (2021)
Description of the challenge
“Given (g)riz photometry, find a tomographic bin assignment method that optimizes a 3x2pt analysis.”
  • Metrics: Total Signal-to-Noise: $m_{SNR} = \sqrt{\mu^t \mathbf{C}^{-1} \mu}$ ; DETF Figure of Merit: $m_{FOM} = \frac{1}{\sqrt{ \det(\mathbf{F}^{-1})}}$
  • Idealized setting: assumes perfect training set. More info at: https://github.com/LSSTDESC/tomo_challenge
  • Conventional strategy: Use a photoz code to estimate redshifts, then bin galaxies based on their photoz.

  • Strategy with Differentiable Physics:
    • Introduce a parametric bin assignement function $f_\theta(x_{phot})$
    • Optimize $\theta$ by back-propagating through the challenge metrics.

Challenge results based on Buzzard simulations

Another approach to cosmological inference: Hierarchical Bayesian Modeling

  • Instead of trying to analytically evaluate the likelihood, let us build a forward model of the observables.

  • Each component of the model is now tractable, but at the cost of a large number of latent variables.

$\Longrightarrow$ How to peform efficient inference in this large number of dimensions?

    A non-exhaustive list of methods:
  • Hamiltonian Monte-Carlo
  • Variational Inference
  • MAP+Laplace
  • Gold Mining
  • Dimensionality reduction by Fisher-Information Maximization

What do they all have in common?
-> They require fast, accurate, differentiable forward simulations
(Schneider et al. 2015)

Forward Models in Cosmology

Linear Field
Final Dark Matter

Dark Matter Halos
N-body simulations
Group Finding
Semi-analytic &
distribution models

How do we make cosmological simulations fast and differentiable?

You can try to learn the simulation...

Learning particle displacement with a UNet. S. He, et al. (2019)

The issue with using deep learning as a black-box
  • No guarantees to work outside of training regime.
  • No guarantees to capture dependence on cosmology accurately.

the Fast Particle-Mesh scheme

The idea: approximate gravitational forces by estimating densities on a grid.
  • The numerical scheme:

    • Estimate the density of particles on a mesh
      => compute gravitational forces by FFT

    • Interpolate forces at particle positions

    • Update particle velocity and positions, and iterate

  • Fast and simple, at the cost of approximating short range interactions.
$\Longrightarrow$ Only a series of FFTs and interpolations.

Introducing FlowPM: Particle-Mesh Simulations in TensorFlow

Modi, Lanusse, Seljak (2020)

			import tensorflow as tf
			import flowpm
			# Defines integration steps
			stages = np.linspace(0.1, 1.0, 10, endpoint=True)

			initial_conds = flowpm.linear_field(32,       # size of the cube
			                                    100,      # Physical size
			                                    ipklin,   # Initial powerspectrum

			# Sample particles and displace them by LPT
			state = flowpm.lpt_init(initial_conds, a0=0.1)

			# Evolve particles down to z=0
			final_state = flowpm.nbody(state, stages, 32)

			# Retrieve final density field
			final_field = flowpm.cic_paint(tf.zeros_like(initial_conditions),
  • Seamless interfacing with deep learning components
  • Gradients readily available

Bayesian Reconstruction of Cosmological Fields

Going back to simpler times...
$$\arg\max_s \ \log p(x | f(s)) \ + \ \log p(s) $$ where:
  • $f$ is the forward model
  • $p(x | f(s))$ is a tractable data likelihood
  • $s$ are the initial conditions (early universe)
  • $x$ is the data (e.g. galaxies, lensing, etc.)

MAP optimization in action

$$\arg\max_s \ \log p(x_{dm} | f(s)) \ + \ \log p(s) $$
credit: C. Modi

True initial conditions

Reconstructed initial conditions $s$

Reconstructed dark matter distribution $x_{dm} = f(s)$

$x_{dm} = f(s_0)$

If only MAP optimization was easy...

$$\arg\max_s \ \log p(x_{dm} | f(s)) \ + \ \log p(s) $$
Direct optimization
ad-hoc annealing
  • Direct optimization of MAP leads to poor solutions on large scales.
  • Annealing recovers unbiased large scales, but at the cost of ad-hoc tempering procedure.

Instead of guessing an optimization scheme, could we learn to optimize?

A closer look at the optimization algorithm

$$\arg\max_x \ \log p(y | f(x)) \ + \ \log p(x) $$
  • Standard Gradient Descent Algorithm: $$x_{i+1} = x_i - \epsilon \left[ \nabla_x{\log p(y | f(x_i))} + \nabla_x \log p(x_i) \right]$$
    $$x_{i+1} = x_i - \Gamma \left(\nabla_x{\log p(y | f(x_i))}, \nabla_x \log p(x_i) \right]$$ with update function $\Gamma: (u,v) \rightarrow \epsilon(u + v)$

  • Many algorithms (e.g. ADAM, LBFGS) can expressed in this form with a different choice of $\Gamma$.

$\Longrightarrow$ What if we could learn this update function?

Recurrent Inference Machines for Solving Inverse Problems
Putzky & Welling, 2017

  • Introduce a Recurrent Neural Network (RNN) $h_\phi$, and state variable $s$, so that: $$ s_{i+1} = h^*_\phi( \nabla \log p(y|x_{i}), x_i, s_{i})$$ $$ x_{i+1} = x_i + h_\phi(\nabla \log p(y|x_{i}), x_i, s_{i+1})$$
  • Train according to: $$\mathcal{L} = \sum_{i}^T w_i\mathcal{L}(x_i, x)$$

Inside an RNN Cell


Illustration on inverse problems

From left to right: input masked image, increasing number of steps of solutions, ground truth.

CosmicRIM: Recurrence Inference Machines for Initial Condition Reconstruction

Modi, Lanusse, Seljak, Spergel, Perreault-Levasseur (2021)

Recurrent Neural Network Architecture
  • A few notable differences to a vanilla RIM:
    • We provide gradients of both prior and likelihood to the model.

    • Because our forward model couples scales, we use a multiscale U-Net architecture.

    • Input gradients are pre-scaled with the ADAM formula.


  • Forward model: $64^3$ particles, 400 Mpc/h box, 2LPT dynamics with 2nd order bias model
  • RIM: 10 steps, trained under l2 loss
Initial conditions cross-correlation
Transfer function
  • CosmicRIM: Learn to optimize by embedding a Neural Network in the optimization algorithm.
    $\Longrightarrow$ converges 40x faster than LBFGS.



Where do I think Deep Learning can be useful for Cosmology?
$\Longrightarrow$ Makes Bayesian inference possible at scale and with non-trivial models!

  • Enables inference in high dimension from numerical simulators.
    • Automagically construct summary statistics.
    • Provides the density estimation tools needed.

  • Complement known physical models with data-driven components
    • Use data-driven generative model as prior for solving inverse problems.

  • Differentiable physical models for fast inference

Thank you !