Probabilistic Deep Learning for Weak Lensing

from Mass-Mapping to Cosmological Parameter Inference


François Lanusse










slides at eiffl.github.io/LAM2022

the $\Lambda$CDM view of the Universe















the Rubin Observatory Legacy Survey of Space and Time

  • 1000 images each night, 15 TB/night for 10 years

  • 18,000 square degrees, observed once every few days

  • Tens of billions of objects, each one observed $\sim1000$ times

Previous generation survey: SDSS




















Image credit: Peter Melchior

Current generation survey: DES




















Image credit: Peter Melchior

LSST precursor survey: HSC




















Image credit: Peter Melchior

The challenges of modern surveys

$\Longrightarrow$ Modern surveys will provide large volumes of high quality data

A Blessing
  • Unprecedented statistical power

  • Great potential for new discoveries

A Curse
  • Existing methods are reaching their limits at every step of the science analysis

  • Control of systematics is paramount
LSST forecast on dark energy parameters
$\Longrightarrow$ Dire need for novel analysis techniques to fully realize the potential of modern surveys.

Can AI solve all of our problems?

The AI bubble


astro-ph abstracts mentioning Deep Learning, CNN, or Neural Networks
Branched GAN model for deblending (Reiman & Göhre, 2018)
The issue with using deep learning as a black-box
  • No explicit control of noise, PSF, number of sources.
    • Model would have to be retrained for all observing configurations

  • No guarantees on the network output (e.g. flux preservation, artifacts)

  • No proper uncertainty quantification.

Focus of this talk

  • Deep Learning for astronomical data reduction

  • Bayesian Neural Networks

  • Physical Bayesian inference


This talk
Generic approach to uncertainty quantification and interpretability:
  • (Differentiable) Physical Forward Models
  • Deep Generative Models
  • Bayesian Inference

Probabilistic Weak Lensing Mass Mapping by
Neural Score Matching


Work in collaboration with:
Benjamin Remy, Niall Jeffrey


$\Longrightarrow$ Learn complex priors by Neural Score Estimation and sample from posterior with gradient-based MCMC.

Gravitational lensing

Galaxy shapes as estimators for gravitational shear
$$ e = \gamma + e_i \qquad \mbox{ with } \qquad e_i \sim \mathcal{N}(0, I)$$
  • We are trying the measure the ellipticity $e$ of galaxies as an estimator for the gravitational shear $\gamma$

Weak Lensing Mass-Mapping as an Inverse Problem

Shear $\gamma$
Convergence $\kappa$
$$\gamma_1 = \frac{1}{2} (\partial_1^2 - \partial_2^2) \ \Psi \quad;\quad \gamma_2 = \partial_1 \partial_2 \ \Psi \quad;\quad \kappa = \frac{1}{2} (\partial_1^2 + \partial_2^2) \ \Psi$$
$$\boxed{\gamma = \mathbf{P} \kappa}$$

Illustration on the Dark Energy Survey (DES) Y3

Jeffrey, et al. (2021)

Linear inverse problems

$\boxed{y = \mathbf{A}x + n}$

$\mathbf{A}$ is known and encodes our physical understanding of the problem.
$\Longrightarrow$ When non-invertible or ill-conditioned, the inverse problem is ill-posed with no unique solution $x$
Deconvolution
Inpainting
Denoising

What Would a Bayesian Do?

$\boxed{y = \mathbf{A}x + n}$

The Bayesian view of the problem:
$$ p(x | y) \propto p(y | x) \ p(x) $$
  • $p(y | x)$ is the data likelihood, which contains the physics

  • $p(x)$ is the prior knowledge on the solution.


    With these concepts in hand we can:
  • Estimate for instance the Maximum A Posteriori solution:
    $$\hat{x} = \arg\max\limits_x \ \log p(y \ | \ x) + \log p(x)$$
  • Estimate from the full posterior p(x|y) with MCMC or Variational Inference methods.

How do you choose the prior ?

Classical examples of signal priors

Sparse
$$ \log p(x) = \parallel \mathbf{W} x \parallel_1 $$
Gaussian $$ \log p(x) = x^t \mathbf{\Sigma^{-1}} x $$
Total Variation $$ \log p(x) = \parallel \nabla x \parallel_1 $$

But what about this?




$\Longrightarrow$ Implicit prior $p(x)$ in the form of cosmological simulations.

Can we use Deep Learning to embed simulation-driven priors within a physical Bayesian model?

What is generative modeling?


  • The goal of generative modeling is to learn an implicit distribution $\mathbb{P}$ from which the training set $X = \{x_0, x_1, \ldots, x_n \}$ is drawn.

  • Usually, this means building a parametric model $\mathbb{P}_\theta$ that tries to be close to $\mathbb{P}$.


True $\mathbb{P}$

Samples $x_i \sim \mathbb{P}$

Model $\mathbb{P}_\theta$


  • Once trained, you can typically sample from $\mathbb{P}_\theta$ and/or evaluate the likelihood $p_\theta(x)$.

Why isn't it easy?


  • The curse of dimensionality put all points far apart in high dimension

Distance between pairs of points drawn from a Gaussian distribution.

  • Classical methods for estimating probability densities, i.e. Kernel Density Estimation (KDE) start to fail in high dimension because of all the gaps

The evolution of generative models




  • Deep Belief Network
    (Hinton et al. 2006)

  • Variational AutoEncoder
    (Kingma & Welling 2014)

  • Generative Adversarial Network
    (Goodfellow et al. 2014)

  • Wasserstein GAN
    (Arjovsky et al. 2017)



A visual Turing test


Fake PixelCNN samples

Real galaxies from SDSS

Getting started with Deep Priors: deep denoising example

$$ \boxed{{\color{Orchid} y} = {\color{SkyBlue} x} + n} $$
  • Let us assume we have access to examples of $ {\color{SkyBlue} x}$ without noise.

  • We learn the distribution of noiseless data $\log p_\theta(x)$ from samples using a deep generative model.

  • The solution should lie on the realistic data manifold, symbolized by the two-moons distribution.

    We want to solve for the Maximum A Posterior solution:

    $$\arg \max - \frac{1}{2} \parallel {\color{Orchid} y} - {\color{SkyBlue} x} \parallel_2^2 + \log p_\theta({\color{SkyBlue} x})$$ This can be done by gradient descent as long as one has access to the score function $\frac{\color{orange} d \color{orange}\log \color{orange}p\color{orange}(\color{orange}x\color{orange})}{\color{orange} d \color{orange}x}$.

Writing down the convergence map log posterior

Remy, Lanusse, et al. (2022)
$$ \log p( \kappa | e) = \underbrace{\log p(e | \kappa)}_{\simeq -\frac{1}{2} \parallel e - P \kappa \parallel_\Sigma^2} + \log p(\kappa) +cst $$
  • The likelihood term is known analytically.
  • There is no close form expression for the full non-Gaussian prior of the convergence. However:
    • We do have access to samples of full implicit prior through simulations: $X = \{x_0, x_1, \ldots, x_n \}$ with $x_i \sim \mathbb{P}$
$\Longrightarrow$ Our strategy: Learn the prior from simulation, and then sample the full posterior.

The score is all you need!


  • Whether you are looking for the MAP or sampling with HMC or MALA, you only need access to the score of the posterior: $$\frac{\color{orange} d \color{orange}\log \color{orange}p\color{orange}(\color{orange}x \color{orange}|\color{orange} y\color{orange})}{\color{orange} d \color{orange}x}$$
    • Gradient descent: $x_{t+1} = x_t + \tau \nabla_x \log p(x_t | y) $
    • Langevin algorithm: $x_{t+1} = x_t + \tau \nabla_x \log p(x_t | y) + \sqrt{2\tau} n_t$



Neural Score Estimation by Denoising Score Matching

  • Denoising Score Matching: An optimal Gaussian denoiser learns the score of a given distribution.
    • If $x \sim \mathbb{P}$ is corrupted by additional Gaussian noise $u \in \mathcal{N}(0, \sigma^2)$ to yield $$x^\prime = x + u$$
    • Let's consider a denoiser $r_\theta$ trained under an $\ell_2$ loss: $$\mathcal{L}=\parallel x - r_\theta(x^\prime, \sigma) \parallel_2^2$$
    • The optimal denoiser $r_{\theta^\star}$ verifies: $$\boxed{\boldsymbol{r}_{\theta^\star}(\boldsymbol{x}', \sigma) = \boldsymbol{x}' + \sigma^2 \nabla_{\boldsymbol{x}} \log p_{\sigma^2}(\boldsymbol{x}')}$$
$\boldsymbol{x}'$
$\boldsymbol{x}$
$\boldsymbol{x}'- \boldsymbol{r}^\star(\boldsymbol{x}', \sigma)$
$\boldsymbol{r}^\star(\boldsymbol{x}', \sigma)$

Training a Neural Score Estimator in practice




A standard UNet
  • We use a very standard residual UNet, and we adopt a residual score matching loss: $$ \mathcal{L}_{DSM} = \underset{\boldsymbol{x} \sim P}{\mathbb{E}} \underset{\begin{subarray}{c} \boldsymbol{u} \sim \mathcal{N}(0, I) \\ \sigma_s \sim \mathcal{N}(0, s^2) \end{subarray}}{\mathbb{E}} \parallel \boldsymbol{u} + \sigma_s \boldsymbol{r}_{\theta}(\boldsymbol{x} + \sigma_s \boldsymbol{u}, \sigma_s) \parallel_2^2$$ $\Longrightarrow$ direct estimator of the score $\nabla \log p_\sigma(x)$

  • Lipschitz regularization to improve robustness:

    Without regularization
    With regularization

Efficient sampling by Annealed HMC

  • Even with gradients, sampling in high number of dimensions is difficult!
    $\Longrightarrow$ Use a parallel annealing strategy to effectively sample from full distribution.

  • We use the fact that our score network $\mathbf{r}_\theta(x, \sigma)$ is learning a noise-convolved distribution $\nabla \log p_\sigma$
    $$\sigma_1 > \sigma_2 > \sigma_3 > \sigma_4 $$
  • Run many HMC chains in parallel, progressively annealing the $\sigma$ to 0, keep last point in the chain as independent sample.

Example of one chain during annealing

Illustration on $\kappa$-TNG simulations


True convergence map
Traditional Kaiser-Squires
Wiener Filter
Posterior Mean (ours)



Posterior samples

Validating Bayesian Posterior in Gaussian case

Reconstruction of the HST/ACS COSMOS field


Massey et al. (2007)
Remy et al. (2022) Posterior mean
Remy et al. (2022) Posterior samples

Other examples of Deep Priors


  • Hybrid Physical-Deep Learning Model for Astronomical Inverse Problems
    F. Lanusse, P. Melchior, F. Moolekamp


$\mathcal{L} = \frac{1}{2} \parallel \mathbf{\Sigma}^{-1/2} (\ Y - P \ast A S \ ) \parallel_2^2 - \sum_{i=1}^K \log p_{\theta}(S_i)$



  • Denoising Score-Matching for Uncertainty Quantification in Inverse Problems
    Z. Ramzi, B. Remy, F. Lanusse, P. Ciuciu, J.L. Starck

Simulation-Based Inference
by Neural Summarisation and Density Estimation


Work in collaboration with Niall Jeffrey, Justin Alsing



$\Longrightarrow$ Learn an implicit data likelihood from simulations.

limits of traditional cosmological inference

HSC cosmic shear power spectrum
HSC Y1 constraints on $(S_8, \Omega_m)$
(Hikage et al. 2018)
  • Measure the ellipticity $\epsilon = \epsilon_i + \gamma$ of all galaxies
    $\Longrightarrow$ Noisy tracer of the weak lensing shear $\gamma$

  • Compute summary statistics based on 2pt functions,
    e.g. the power spectrum

  • Run an MCMC to recover a posterior on model parameters, using an analytic likelihood $$ p(\theta | x ) \propto \underbrace{p(x | \theta)}_{\mathrm{likelihood}} \ \underbrace{p(\theta)}_{\mathrm{prior}}$$
Main limitation: the need for an explicit likelihood
We can only compute the likelihood for simple summary statistics and on large scales

$\Longrightarrow$ We are dismissing a large amount of information!

A different road: forward modeling

  • Instead of trying to analytically evaluate the likelihood, let us build a forward model of the observables.




$\Longrightarrow$ Learn an implicit likelihood $p(x|\theta)$ given by the simulator as our physical model




End-to-end framework for likelihood-free parameter inference with DES SV

Jeffrey, Alsing, Lanusse (2021)

Suite of N-body + raytracing simulations: $\mathcal{D}$

Our method

A two-steps approach to inference
  • Automatically learn an optimal low-dimensional summary statistic $$y = f_\varphi(\kappa_{KS}) $$

  • Use Neural Density Estimation to build an estimate $p_\phi$ of the likelihood function $p(y \ | \ \theta)$ (Neural Likelihood Estimation)

  • Run a conventional Markov Chain Monte Carlo sampling

Learning summary statistics by Variational Mutual Information Maximization



  • Mutual information between $X$ and $Y$:
    “"amount of information" obtained about one random variable through observing the other random variable”

  • Given a parametric summarizing function $y = f_\phi(\kappa(\theta))$ optimizing $f_\phi$ can be done by maximizing $I(y, \theta)$.

  • In practice, $f_\phi$ is a CNN, trained to maximize a variational lower bound on the mutual information: $$ I(y ; \theta) \ \ge \ \mathbb{E}_{y, \theta} [ \log q_\phi(\theta | y) ] + H(\Theta) $$

deep residual networks for lensing maps compression


  • Deep Residual Network $y = f_\phi(x)$ followed by neural density estimator $q_\phi(\theta | y)$

  • Training on weak lensing maps simulated for different cosmologies




  • Training by Variational Mutual Information Maximization: $$\mathbb{E}_{(x, \theta) \in \mathcal{D}} [ \log q_\phi(\theta | f_\phi(y) ) ]$$

Estimating the likelihood by Neural Density Estimation


$\Longrightarrow$ We cannot assume a Gaussian likelihood for the summary $y = f_\phi(\kappa)$ but we can learn $p(y | \theta)$: Neural Likelihood Estimation.


Dinh et al. 2016
Neural Likelihood Estimation by Normalizing Flow
  • We use a conditional Normalizing Flow to build an explicit model for the likelihood function $$ \log p_\varphi (y | \theta)$$

  • In practice we use the pyDELFI package and an ensemble of NDEs for robustness.

  • Once learned, we can use the likelihood as part of a conventional MCMC chain


Parameter constraints from DES SV data

GPU accelerated and Automatically Differentiable Physics




Collection of collaborative projects, join us :-)


the hammer behind the Deep Learning revolution: Automatic Differentation

  • Automatic differentiation allows you to compute analytic derivatives of arbitraty expressions:
    If I form the expression $y = a * x + b$, it is separated in fundamental ops: $$ y = u + b \qquad u = a * x $$ then gradients can be obtained by the chain rule: $$\frac{\partial y}{\partial x} = \frac{\partial y}{\partial u} \frac{ \partial u}{\partial x} = 1 \times a = a$$

  • This is a fundamental tool in Machine Learning, and autodiff frameworks include TensorFlow and PyTorch.


Enters JAX: NumPy + Autograd + GPU
  • JAX follows the NumPy api!
    
          										import jax.numpy as np
          									
  • Arbitrary order derivatives
  • Accelerated execution on GPU and TPU

jax-cosmo: Finally a differentiable cosmology library, and it's in JAX!


      										import jax.numpy as np
      										import jax_cosmo as jc

      										# Defining a Cosmology
      										cosmo = jc.Planck15()

      										# Define a redshift distribution with smail_nz(a, b, z0)
      										nz = jc.redshift.smail_nz(1., 2., 1.)

      										# Build a lensing tracer with a single redshift bin
      										probe = probes.WeakLensing([nz])

      										# Compute angular Cls for some ell
      										ell = np.logspace(0.1,3)
      										cls = angular_cl(cosmo_jax, ell, [probe])
      									
Current main features
  • Weak Lensing and Number counts probes
  • Eisenstein & Hu (1998) power spectrum + halofit
  • Angular $C_\ell$ under Limber approximation
$\Longrightarrow$ 3x2pt DES Y1 capable

Validating against the DESC Core Cosmology Library

let's compute a Fisher matrix


$$F = - \mathbb{E}_{p(x | \theta)}[ H_\theta(\log p(x| \theta)) ] $$

      					import jax
      					import jax.numpy as np
      					import jax_cosmo as jc

      					# .... define probes, and load a data vector

      					def gaussian_likelihood( theta ):
      					  # Build the cosmology for given parameters
      					  cosmo = jc.Planck15(Omega_c=theta[0], sigma8=theta[1])

      					  # Compute mean and covariance
      					  mu, cov = jc.angular_cl.gaussian_cl_covariance_and_mean(cosmo,
      					                                                    ell, probes)
      					  # returns likelihood of data under model
      					  return jc.likelihood.gaussian_likelihood(data, mu, cov)

      					# Fisher matrix in just one line:
      					F = - jax.hessian(gaussian_likelihood)(theta)
      					
Open In Colab


  • No derivatives were harmed by finite differences in the computation of this Fisher!
  • Only a small additional compute time compared to one forward evaluation of the model

Inference becomes fast and scalable

  • Current cosmological MCMC chains take days, and typically require access to large computer clusters.

  • Gradients of the log posterior are required for modern efficient and scalable inference techniques:
    • Variational Inference
    • Hamiltonian Monte-Carlo

  • In jax-cosmo, we can trivially obtain exact gradients:
    
    									      					def log_posterior( theta ):
    									      					    return gaussian_likelihood( theta ) + log_prior(theta)
    
    									      					score = jax.grad(log_posterior)(theta)
    									      					

  • On a DES Y1 analysis, we find convergence in 70,000 samples with vanilla HMC, 140,000 with Metropolis-Hastings

DES Y1 posterior, jax-cosmo HMC vs Cobaya MH
(credit: Joe Zuntz)

LSST DESC 3x2pt Tomography Challenge

Zuntz, Lanusse, et al. (2021)
Description of the challenge
“Given (g)riz photometry, find a tomographic bin assignment method that optimizes a 3x2pt analysis.”
  • Metrics: Total Signal-to-Noise: $m_{SNR} = \sqrt{\mu^t \mathbf{C}^{-1} \mu}$ ; DETF Figure of Merit: $m_{FOM} = \frac{1}{\sqrt{ \det(\mathbf{F}^{-1})}}$
  • Idealized setting: assumes perfect training set. More info at: https://github.com/LSSTDESC/tomo_challenge
  • Conventional strategy: Use a photoz code to estimate redshifts, then bin galaxies based on their photoz.

  • Strategy with Differentiable Physics:
    • Introduce a parametric bin assignement function $f_\theta(x_{phot})$
    • Optimize $\theta$ by back-propagating through the challenge metrics.

introducing FlowPM: Particle-Mesh Simulations in TensorFlow

Modi, Lanusse, Seljak (2020)

      													import tensorflow as tf
      													import flowpm
      													# Defines integration steps
      													stages = np.linspace(0.1, 1.0, 10, endpoint=True)

      													initial_conds = flowpm.linear_field(32,       # size of the cube
      													                                   100,       # Physical size
      													                                   ipklin,    # Initial powerspectrum
      													                                   batch_size=16)

      													# Sample particles and displace them by LPT
      													state = flowpm.lpt_init(initial_conds, a0=0.1)

      													# Evolve particles down to z=0
      													final_state = flowpm.nbody(state, stages, 32)

      													# Retrieve final density field
      													final_field = flowpm.cic_paint(tf.zeros_like(initial_conditions),
      													                               final_state[0])
      												
  • Seamless interfacing with deep learning components
  • Mesh TensorFlow implementation for distribution on supercomputers









Mesh FlowPM: distributed, GPU-accelerated, and automatically differentiable simulations

  • We developed a Mesh TensorFlow implementation that can scale on GPU clusters (horovod+NCCL).


  • For a $2048^3$ simulation:
    • Distributed on 256 NVIDIA V100 GPUs
    • Runtime: 3 mins


  • Don't hesitate to reach out if you have a use case for model parallelism!

Conclusion

Conclusion

Merging Deep Learning with Physical Models for Bayesian Inference
$\Longrightarrow$ Makes Bayesian inference possible at scale and with non-trivial models!

  • Complement known physical models with data-driven components
    • Use data-driven generative model as prior for solving inverse problems.

  • Enables inference in high dimension from numerical simulators.
    • Automagically construct summary statistics.
    • Provides the density estimation tools needed.

  • Differentiable physical models for fast inference
    • Differentiability enables Bayesian inference over large scale simulations.
    • Models can directly be embedded alongside deep learning components.



Thank you !