Probabilistic Deep Learning for Weak Lensing

from Mass-Mapping to Cosmological Parameter Inference

François Lanusse

slides at eiffl.github.io/LAM2022

the $\Lambda$CDM view of the Universe

the Rubin Observatory Legacy Survey of Space and Time

1000 images each night, 15 TB/night for 10 years

18,000 square degrees, observed once every few days

Tens of billions of objects, each one observed $\sim1000$ times

Previous generation survey: SDSS

Image credit: Peter Melchior

Current generation survey: DES

Image credit: Peter Melchior

LSST precursor survey: HSC

Image credit: Peter Melchior

The challenges of modern surveys

$\Longrightarrow$ Modern surveys will provide large volumes of high quality data

A Blessing

Unprecedented statistical power

Great potential for new discoveries

A Curse

Existing methods are reaching their limits at every step of the science analysis

Control of systematics is paramount

LSST forecast on dark energy parameters

$\Longrightarrow$ Dire need for novel analysis techniques to fully realize the potential of modern surveys.

Can AI solve all of our problems?

The AI bubble

astro-ph abstracts mentioning Deep Learning, CNN, or Neural Networks

Branched GAN model for deblending (Reiman & Göhre, 2018)

The issue with using deep learning as a black-box

No explicit control of noise, PSF, number of sources.
- Model would have to be retrained for all observing configurations

No guarantees on the network output (e.g. flux preservation, artifacts)

No proper uncertainty quantification.

Focus of this talk

Deep Learning for astronomical data reduction

Bayesian Neural Networks

Physical Bayesian inference

This talk

Generic approach to uncertainty quantification and interpretability:

(Differentiable) Physical Forward Models
Deep Generative Models
Bayesian Inference

Probabilistic Weak Lensing Mass Mapping by
Neural Score Matching

Work in collaboration with:
Benjamin Remy, Niall Jeffrey

$\Longrightarrow$ Learn complex priors by Neural Score Estimation and sample from posterior with gradient-based MCMC.

Gravitational lensing

Galaxy shapes as estimators for gravitational shear

$$ e = \gamma + e_i \qquad \mbox{ with } \qquad e_i \sim \mathcal{N}(0, I)$$

We are trying the measure the ellipticity $e$ of galaxies as an estimator for the gravitational shear $\gamma$

Weak Lensing Mass-Mapping as an Inverse Problem

Shear $\gamma$

Convergence $\kappa$

$$\gamma_1 = \frac{1}{2} (\partial_1^2 - \partial_2^2) \ \Psi \quad;\quad \gamma_2 = \partial_1 \partial_2 \ \Psi \quad;\quad \kappa = \frac{1}{2} (\partial_1^2 + \partial_2^2) \ \Psi$$

$$\boxed{\gamma = \mathbf{P} \kappa}$$

Illustration on the Dark Energy Survey (DES) Y3

Jeffrey, et al. (2021)

Linear inverse problems

$\boxed{y = \mathbf{A}x + n}$

$\mathbf{A}$ is known and encodes our physical understanding of the problem.
$\Longrightarrow$ When non-invertible or ill-conditioned, the inverse problem is ill-posed with no unique solution $x$

Deconvolution

Inpainting

Denoising

What Would a Bayesian Do?

$\boxed{y = \mathbf{A}x + n}$

The Bayesian view of the problem:
$$ p(x | y) \propto p(y | x) \ p(x) $$

$p(y | x)$ is the data likelihood, which contains the physics

$p(x)$ is the prior knowledge on the solution.

Estimate for instance the Maximum A Posteriori solution:
$$\hat{x} = \arg\max\limits_x \ \log p(y \ | \ x) + \log p(x)$$
Estimate from the full posterior p(x|y) with MCMC or Variational Inference methods.

How do you choose the prior ?

Classical examples of signal priors

Sparse

$$ \log p(x) = \parallel \mathbf{W} x \parallel_1 $$

Gaussian

$$ \log p(x) = x^t \mathbf{\Sigma^{-1}} x $$

Total Variation

$$ \log p(x) = \parallel \nabla x \parallel_1 $$

But what about this?

$\Longrightarrow$ Implicit prior $p(x)$ in the form of cosmological simulations.

Can we use Deep Learning to embed simulation-driven priors within a physical Bayesian model?

What is generative modeling?

The goal of generative modeling is to learn an implicit distribution $\mathbb{P}$ from which the training set $X = \{x_0, x_1, \ldots, x_n \}$ is drawn.

Usually, this means building a parametric model $\mathbb{P}_\theta$ that tries to be close to $\mathbb{P}$.

True $\mathbb{P}$

Samples $x_i \sim \mathbb{P}$

Model $\mathbb{P}_\theta$

Once trained, you can typically sample from $\mathbb{P}_\theta$ and/or evaluate the likelihood $p_\theta(x)$.

Why isn't it easy?

The curse of dimensionality put all points far apart in high dimension

Distance between pairs of points drawn from a Gaussian distribution.

Classical methods for estimating probability densities, i.e. Kernel Density Estimation (KDE) start to fail in high dimension because of all the gaps

The evolution of generative models

Deep Belief Network
(Hinton et al. 2006)

Variational AutoEncoder
(Kingma & Welling 2014)

Generative Adversarial Network
(Goodfellow et al. 2014)

Wasserstein GAN
(Arjovsky et al. 2017)

A visual Turing test

Fake PixelCNN samples

Real galaxies from SDSS

Getting started with Deep Priors: deep denoising example

$$ \boxed{{\color{Orchid} y} = {\color{SkyBlue} x} + n} $$

Let us assume we have access to examples of $ {\color{SkyBlue} x}$ without noise.

We learn the distribution of noiseless data $\log p_\theta(x)$ from samples using a deep generative model.

The solution should lie on the realistic data manifold, symbolized by the two-moons distribution.

We want to solve for the Maximum A Posterior solution:
$$\arg \max - \frac{1}{2} \parallel {\color{Orchid} y} - {\color{SkyBlue} x} \parallel_2^2 + \log p_\theta({\color{SkyBlue} x})$$ This can be done by gradient descent as long as one has access to the score function $\frac{\color{orange} d \color{orange}\log \color{orange}p\color{orange}(\color{orange}x\color{orange})}{\color{orange} d \color{orange}x}$.

Writing down the convergence map log posterior

Remy, Lanusse, et al. (2022)

$$ \log p( \kappa | e) = \underbrace{\log p(e | \kappa)}_{\simeq -\frac{1}{2} \parallel e - P \kappa \parallel_\Sigma^2} + \log p(\kappa) +cst $$

The likelihood term is known analytically.
There is no close form expression for the full non-Gaussian prior of the convergence. However:
- We do have access to samples of full implicit prior through simulations: $X = \{x_0, x_1, \ldots, x_n \}$ with $x_i \sim \mathbb{P}$

$\Longrightarrow$ Our strategy: Learn the prior from simulation, and then sample the full posterior.

The score is all you need!

Whether you are looking for the MAP or sampling with HMC or MALA, you only need access to the score of the posterior: $$\frac{\color{orange} d \color{orange}\log \color{orange}p\color{orange}(\color{orange}x \color{orange}|\color{orange} y\color{orange})}{\color{orange} d \color{orange}x}$$
- Gradient descent: $x_{t+1} = x_t + \tau \nabla_x \log p(x_t | y) $
- Langevin algorithm: $x_{t+1} = x_t + \tau \nabla_x \log p(x_t | y) + \sqrt{2\tau} n_t$

Neural Score Estimation by Denoising Score Matching

Denoising Score Matching: An optimal Gaussian denoiser learns the score of a given distribution.
- If $x \sim \mathbb{P}$ is corrupted by additional Gaussian noise $u \in \mathcal{N}(0, \sigma^2)$ to yield $$x^\prime = x + u$$
- Let's consider a denoiser $r_\theta$ trained under an $\ell_2$ loss: $$\mathcal{L}=\parallel x - r_\theta(x^\prime, \sigma) \parallel_2^2$$
- The optimal denoiser $r_{\theta^\star}$ verifies: $$\boxed{\boldsymbol{r}_{\theta^\star}(\boldsymbol{x}', \sigma) = \boldsymbol{x}' + \sigma^2 \nabla_{\boldsymbol{x}} \log p_{\sigma^2}(\boldsymbol{x}')}$$

$\boldsymbol{x}'$

$\boldsymbol{x}$

$\boldsymbol{x}'- \boldsymbol{r}^\star(\boldsymbol{x}', \sigma)$

$\boldsymbol{r}^\star(\boldsymbol{x}', \sigma)$

Training a Neural Score Estimator in practice

A standard UNet

We use a very standard residual UNet, and we adopt a residual score matching loss: $$ \mathcal{L}_{DSM} = \underset{\boldsymbol{x} \sim P}{\mathbb{E}} \underset{\begin{subarray}{c} \boldsymbol{u} \sim \mathcal{N}(0, I) \\ \sigma_s \sim \mathcal{N}(0, s^2) \end{subarray}}{\mathbb{E}} \parallel \boldsymbol{u} + \sigma_s \boldsymbol{r}_{\theta}(\boldsymbol{x} + \sigma_s \boldsymbol{u}, \sigma_s) \parallel_2^2$$ $\Longrightarrow$ direct estimator of the score $\nabla \log p_\sigma(x)$

Lipschitz regularization to improve robustness:

Without regularization

With regularization

Efficient sampling by Annealed HMC

Even with gradients, sampling in high number of dimensions is difficult!
$\Longrightarrow$ Use a parallel annealing strategy to effectively sample from full distribution.

We use the fact that our score network $\mathbf{r}_\theta(x, \sigma)$ is learning a noise-convolved distribution $\nabla \log p_\sigma$
$$\sigma_1 > \sigma_2 > \sigma_3 > \sigma_4 $$
Run many HMC chains in parallel, progressively annealing the $\sigma$ to 0, keep last point in the chain as independent sample.

Example of one chain during annealing

Illustration on $\kappa$-TNG simulations

True convergence map

Traditional Kaiser-Squires

Wiener Filter

Posterior Mean (ours)

Posterior samples

Validating Bayesian Posterior in Gaussian case

Reconstruction of the HST/ACS COSMOS field

COSMOS shear data from Schrabback et al. 2010
Prior learned from $\kappa$-TNG simulation from Osato et al. 2021.

Massey et al. (2007)

Remy et al. (2022) Posterior mean

Remy et al. (2022) Posterior samples

Other examples of Deep Priors

Hybrid Physical-Deep Learning Model for Astronomical Inverse Problems
F. Lanusse, P. Melchior, F. Moolekamp

$\mathcal{L} = \frac{1}{2} \parallel \mathbf{\Sigma}^{-1/2} (\ Y - P \ast A S \ ) \parallel_2^2 - \sum_{i=1}^K \log p_{\theta}(S_i)$

Denoising Score-Matching for Uncertainty Quantification in Inverse Problems
Z. Ramzi, B. Remy, F. Lanusse, P. Ciuciu, J.L. Starck

Simulation-Based Inference
by Neural Summarisation and Density Estimation

Work in collaboration with Niall Jeffrey, Justin Alsing

$\Longrightarrow$ Learn an implicit data likelihood from simulations.

limits of traditional cosmological inference

HSC cosmic shear power spectrum

HSC Y1 constraints on $(S_8, \Omega_m)$

(Hikage et al. 2018)

Measure the ellipticity $\epsilon = \epsilon_i + \gamma$ of all galaxies
$\Longrightarrow$ Noisy tracer of the weak lensing shear $\gamma$

Compute summary statistics based on 2pt functions,
e.g. the power spectrum

Run an MCMC to recover a posterior on model parameters, using an analytic likelihood $$ p(\theta | x ) \propto \underbrace{p(x | \theta)}_{\mathrm{likelihood}} \ \underbrace{p(\theta)}_{\mathrm{prior}}$$

Main limitation: the need for an explicit likelihood

We can only compute the likelihood for simple summary statistics and on large scales

$\Longrightarrow$ We are dismissing a large amount of information!

A different road: forward modeling

Instead of trying to analytically evaluate the likelihood, let us build a forward model of the observables.

$\Longrightarrow$ Learn an implicit likelihood $p(x|\theta)$ given by the simulator as our physical model

End-to-end framework for likelihood-free parameter inference with DES SV

Jeffrey, Alsing, Lanusse (2021)

Suite of N-body + raytracing simulations: $\mathcal{D}$

Our method

A two-steps approach to inference

Automatically learn an optimal low-dimensional summary statistic $$y = f_\varphi(\kappa_{KS}) $$

Use Neural Density Estimation to build an estimate $p_\phi$ of the likelihood function $p(y \ | \ \theta)$ (Neural Likelihood Estimation)

Run a conventional Markov Chain Monte Carlo sampling

Learning summary statistics by Variational Mutual Information Maximization

Mutual information between $X$ and $Y$:
“"amount of information" obtained about one random variable through observing the other random variable”

Given a parametric summarizing function $y = f_\phi(\kappa(\theta))$ optimizing $f_\phi$ can be done by maximizing $I(y, \theta)$.

In practice, $f_\phi$ is a CNN, trained to maximize a variational lower bound on the mutual information: $$ I(y ; \theta) \ \ge \ \mathbb{E}_{y, \theta} [ \log q_\phi(\theta | y) ] + H(\Theta) $$

deep residual networks for lensing maps compression

Deep Residual Network $y = f_\phi(x)$ followed by neural density estimator $q_\phi(\theta | y)$

Training on weak lensing maps simulated for different cosmologies

Training by Variational Mutual Information Maximization: $$\mathbb{E}_{(x, \theta) \in \mathcal{D}} [ \log q_\phi(\theta | f_\phi(y) ) ]$$

Estimating the likelihood by Neural Density Estimation

$\Longrightarrow$ We cannot assume a Gaussian likelihood for the summary $y = f_\phi(\kappa)$ but we can learn $p(y | \theta)$: Neural Likelihood Estimation.

Dinh et al. 2016

Neural Likelihood Estimation by Normalizing Flow

We use a conditional Normalizing Flow to build an explicit model for the likelihood function $$ \log p_\varphi (y | \theta)$$

In practice we use the pyDELFI package and an ensemble of NDEs for robustness.

Once learned, we can use the likelihood as part of a conventional MCMC chain

Parameter constraints from DES SV data

GPU accelerated and Automatically Differentiable Physics

Collection of collaborative projects, join us :-)

the hammer behind the Deep Learning revolution: Automatic Differentation

Automatic differentiation allows you to compute analytic derivatives of arbitraty expressions:
If I form the expression $y = a * x + b$, it is separated in fundamental ops: $$ y = u + b \qquad u = a * x $$ then gradients can be obtained by the chain rule: $$\frac{\partial y}{\partial x} = \frac{\partial y}{\partial u} \frac{ \partial u}{\partial x} = 1 \times a = a$$

This is a fundamental tool in Machine Learning, and autodiff frameworks include TensorFlow and PyTorch.

Enters JAX: NumPy + Autograd + GPU

JAX follows the NumPy api!


      										import jax.numpy as np

Arbitrary order derivatives
Accelerated execution on GPU and TPU

jax-cosmo: Finally a differentiable cosmology library, and it's in JAX!

https://github.com/DifferentiableUniverseInitiative/jax_cosmo


      										import jax.numpy as np
      										import jax_cosmo as jc

      										# Defining a Cosmology
      										cosmo = jc.Planck15()

      										# Define a redshift distribution with smail_nz(a, b, z0)
      										nz = jc.redshift.smail_nz(1., 2., 1.)

      										# Build a lensing tracer with a single redshift bin
      										probe = probes.WeakLensing([nz])

      										# Compute angular Cls for some ell
      										ell = np.logspace(0.1,3)
      										cls = angular_cl(cosmo_jax, ell, [probe])

Current main features

Weak Lensing and Number counts probes
Eisenstein & Hu (1998) power spectrum + halofit
Angular $C_\ell$ under Limber approximation

$\Longrightarrow$ 3x2pt DES Y1 capable

Validating against the DESC Core Cosmology Library

let's compute a Fisher matrix

$$F = - \mathbb{E}_{p(x | \theta)}[ H_\theta(\log p(x| \theta)) ] $$


      					import jax
      					import jax.numpy as np
      					import jax_cosmo as jc

      					# .... define probes, and load a data vector

      					def gaussian_likelihood( theta ):
      					  # Build the cosmology for given parameters
      					  cosmo = jc.Planck15(Omega_c=theta[0], sigma8=theta[1])

      					  # Compute mean and covariance
      					  mu, cov = jc.angular_cl.gaussian_cl_covariance_and_mean(cosmo,
      					                                                    ell, probes)
      					  # returns likelihood of data under model
      					  return jc.likelihood.gaussian_likelihood(data, mu, cov)

      					# Fisher matrix in just one line:
      					F = - jax.hessian(gaussian_likelihood)(theta)

No derivatives were harmed by finite differences in the computation of this Fisher!
Only a small additional compute time compared to one forward evaluation of the model

Inference becomes fast and scalable

Current cosmological MCMC chains take days, and typically require access to large computer clusters.

Gradients of the log posterior are required for modern efficient and scalable inference techniques:
- Variational Inference
- Hamiltonian Monte-Carlo

In jax-cosmo, we can trivially obtain exact gradients:


									      					def log_posterior( theta ):
									      					    return gaussian_likelihood( theta ) + log_prior(theta)

									      					score = jax.grad(log_posterior)(theta)

On a DES Y1 analysis, we find convergence in 70,000 samples with vanilla HMC, 140,000 with Metropolis-Hastings

DES Y1 posterior, jax-cosmo HMC vs Cobaya MH
(credit: Joe Zuntz)

LSST DESC 3x2pt Tomography Challenge

Zuntz, Lanusse, et al. (2021)

Description of the challenge

“Given (g)riz photometry, find a tomographic bin assignment method that optimizes a 3x2pt analysis.”

Metrics: Total Signal-to-Noise: $m_{SNR} = \sqrt{\mu^t \mathbf{C}^{-1} \mu}$ ; DETF Figure of Merit: $m_{FOM} = \frac{1}{\sqrt{ \det(\mathbf{F}^{-1})}}$
Idealized setting: assumes perfect training set. More info at: https://github.com/LSSTDESC/tomo_challenge

Conventional strategy: Use a photoz code to estimate redshifts, then bin galaxies based on their photoz.

Strategy with Differentiable Physics:
- Introduce a parametric bin assignement function $f_\theta(x_{phot})$
- Optimize $\theta$ by back-propagating through the challenge metrics.

introducing FlowPM: Particle-Mesh Simulations in TensorFlow

Modi, Lanusse, Seljak (2020)

https://github.com/DifferentiableUniverseInitiative/flowpm


      													import tensorflow as tf
      													import flowpm
      													# Defines integration steps
      													stages = np.linspace(0.1, 1.0, 10, endpoint=True)

      													initial_conds = flowpm.linear_field(32,       # size of the cube
      													                                   100,       # Physical size
      													                                   ipklin,    # Initial powerspectrum
      													                                   batch_size=16)

      													# Sample particles and displace them by LPT
      													state = flowpm.lpt_init(initial_conds, a0=0.1)

      													# Evolve particles down to z=0
      													final_state = flowpm.nbody(state, stages, 32)

      													# Retrieve final density field
      													final_field = flowpm.cic_paint(tf.zeros_like(initial_conditions),
      													                               final_state[0])

Seamless interfacing with deep learning components
Mesh TensorFlow implementation for distribution on supercomputers

Mesh FlowPM: distributed, GPU-accelerated, and automatically differentiable simulations

We developed a Mesh TensorFlow implementation that can scale on GPU clusters (horovod+NCCL).

For a $2048^3$ simulation:
- Distributed on 256 NVIDIA V100 GPUs
- Runtime: 3 mins

Don't hesitate to reach out if you have a use case for model parallelism!

https://github.com/DifferentiableUniverseInitiative/mesh

Conclusion

Merging Deep Learning with Physical Models for Bayesian Inference

$\Longrightarrow$ Makes Bayesian inference possible at scale and with non-trivial models!

Complement known physical models with data-driven components
- Use data-driven generative model as prior for solving inverse problems.
Enables inference in high dimension from numerical simulators.
- Automagically construct summary statistics.
- Provides the density estimation tools needed.
Differentiable physical models for fast inference
- Differentiability enables Bayesian inference over large scale simulations.
- Models can directly be embedded alongside deep learning components.

Thank you !

Probabilistic Deep Learning for Weak Lensing

from Mass-Mapping to Cosmological Parameter Inference

François Lanusse

the $\Lambda$CDM view of the Universe

the Rubin Observatory Legacy Survey of Space and Time

The challenges of modern surveys

Can AI solve all of our problems?

The AI bubble

Focus of this talk

Probabilistic Weak Lensing Mass Mapping by Neural Score Matching

Work in collaboration with: Benjamin Remy, Niall Jeffrey

Gravitational lensing

Weak Lensing Mass-Mapping as an Inverse Problem

Illustration on the Dark Energy Survey (DES) Y3

Linear inverse problems

What Would a Bayesian Do?

How do you choose the prior ?

Classical examples of signal priors

But what about this?

Can we use Deep Learning to embed simulation-driven priors within a physical Bayesian model?

What is generative modeling?

Why isn't it easy?

The evolution of generative models

A visual Turing test

Getting started with Deep Priors: deep denoising example

Writing down the convergence map log posterior

The score is all you need!

Neural Score Estimation by Denoising Score Matching

Training a Neural Score Estimator in practice

Efficient sampling by Annealed HMC

Example of one chain during annealing

Illustration on $\kappa$-TNG simulations

Validating Bayesian Posterior in Gaussian case

Reconstruction of the HST/ACS COSMOS field

Other examples of Deep Priors

Simulation-Based Inference by Neural Summarisation and Density Estimation

Work in collaboration with Niall Jeffrey, Justin Alsing

limits of traditional cosmological inference

A different road: forward modeling

End-to-end framework for likelihood-free parameter inference with DES SV

Our method

Learning summary statistics by Variational Mutual Information Maximization

deep residual networks for lensing maps compression

Estimating the likelihood by Neural Density Estimation

Parameter constraints from DES SV data

GPU accelerated and Automatically Differentiable Physics

Collection of collaborative projects, join us :-)

the hammer behind the Deep Learning revolution: Automatic Differentation

jax-cosmo: Finally a differentiable cosmology library, and it's in JAX!

let's compute a Fisher matrix

Inference becomes fast and scalable

LSST DESC 3x2pt Tomography Challenge

introducing FlowPM: Particle-Mesh Simulations in TensorFlow

Mesh FlowPM: distributed, GPU-accelerated, and automatically differentiable simulations

Conclusion

Conclusion

Probabilistic Weak Lensing Mass Mapping by
Neural Score Matching

Work in collaboration with:
Benjamin Remy, Niall Jeffrey

Simulation-Based Inference
by Neural Summarisation and Density Estimation