# Hybrid Physical-Deep Learning Models

## From Galaxy Morphology to Large Scale Structure

### François Lanusse

University of California, Berkeley

### the Large Synoptic Survey Telescope

• 1000 images each night, 15 TB/night for 10 years

• 18,000 square degrees, observed once every few days

• Tens of billions of objects, each one observed $\sim1000$ times

Previous generation survey: SDSS

Image credit: Peter Melchior

Current generation survey: DES

Image credit: Peter Melchior

LSST precursor survey: HSC

Image credit: Peter Melchior

## Generative Models for Galaxy Image Simulation

### Work in collaboration with Rachel Mandelbaum, Siamak Ravanbakhsh, Barnabas Poczos, Peter Freeman

Lanusse et al., in prep
Ravanbakhsh, Lanusse, et al. (2017)

### The weak lensing shape measurement problem

Shape measurement biases
The measured ellipticity $e$ is typically a biased tracer of the underlying shear $\gamma$ $$< e > = \ (1 + m) \ \gamma \ + \ c$$

### Simulation and calibration strategy

The GREAT3 approach
• Input galaxies from deep HST/ACS COSMOS images (25.2 imag)

• Apply a range of PSFs and noise levels sampled from the survey

• Measure response of shape measurement to a known shear and estimate $m$ and $c$

### The Bayesian hierarchical modeling strategy

$\mathbf{x}$

$\mathbf{\gamma}$

• The root of the problem is that the likelihood $p(x | \gamma)$ is a complicated beast.

• Provides an explicit description of pixel level data, in terms of simple, tractable distributions
(Schneider et al. 2015)

### Impact of galaxy morphology

In both cases, we are building a forward model of the data, how accurate does this model need to be?

Mandelbaum, et al. (2013), Mandelbaum, et al. (2014)

$\Longrightarrow$ We cannot measure shear without an accurate model of galaxy morphology

## Can we learn a model for galaxy morphologies from the data itself?

### The evolution of generative models

• Deep Belief Network
(Hinton et al. 2006)

• Variational AutoEncoder
(Kingma & Welling 2014)

(Goodfellow et al. 2014)

• Wasserstein GAN
(Arjovsky et al. 2017)

### Complications specific to astronomical images: spot the differences!

CelebA

HSC PDR-2 wide

• There is noise
• We have a Point Spread Function

### Combining deep generative and physical models

Probabilistic model
Dataset of $N$ i.i.d. samples $\{x_i \}$ generated from $$x \sim p_{\theta}(x | z, \Sigma, \Pi ) \ p(z)$$
• $z$ is a set of latent variables
• $\Pi$ is the PSF, $\Sigma$ is the noise covariance
• Under a Gaussian noise model:
$\qquad p_{\theta}(x | z, \Sigma, \Pi)=\mathcal{N}( \Pi \ast g_\theta(z), \Sigma)$
$\Longrightarrow$ Decouples the morphology model from the observing conditions.

### How to train your dragon model

• Training the generative amounts to finding $\theta_\star$ that maximizes the marginal likelihood of the model: $$p_\theta(x | \Sigma, \Pi) = \int \mathcal{N}( \Pi \ast g_\theta(z), \Sigma) \ p(z) \ dz$$
$\Longrightarrow$ This is generally intractable

• Efficient training of parameter $\theta$ is made possible by Amortized Variational Inference.
Auto-Encoding Variational Bayes (Kingma & Welling, 2014)
• We introduce a parametric distribution $q_\phi(z | x, \Pi, \Sigma)$ which aims to model the posterior $p_{\theta}(z | x, \Pi, \Sigma)$.

• Working out the KL divergence between these two distributions leads to: $$\log p_\theta(x | \Sigma, \Pi) \quad \geq \quad - \mathbb{D}_{KL}\left( q_\phi(z | x, \Sigma, \Pi) \parallel p(z) \right) \quad + \quad \mathbb{E}_{z \sim q_{\phi}(. | x, \Sigma, \Pi)} \left[ \log p_\theta(x | z, \Sigma, \Pi) \right]$$ $\Longrightarrow$ This is the Evidence Lower-Bound, which is differentiable with respect to $\theta$ and $\phi$.

### The famous Variational Auto-Encoder

$$\log p_\theta(x| \Sigma, \Pi ) \geq - \underbrace{\mathbb{D}_{KL}\left( q_\phi(z | x, \Sigma, \Pi) \parallel p(z) \right)}_{\mbox{code regularization}} + \underbrace{\mathbb{E}_{z \sim q_{\phi}(. | x, \Sigma, \Pi)} \left[ \log p_\theta(x | z, \Sigma, \Pi) \right]}_{\mbox{reconstruction error}}$$

### Illustration on HST/ACS COSMOS images

Fitting observations with VAE and Bulge+Disk parametric model.
• Training set: GalSim COSMOS HST/ACS postage stamps
• 80,000 deblended galaxies from I < 25.2 sample
• Drawn on 128x128 stamps at 0.03 arcsec resolution
• Each stamp comes with:
• PSF
• Noise power spectrum
• Bulge+Disk parametric fit

• Auto-Encoder model:
• Deep residual autoencoder:
7 stages of 2 resnet blocs each
• Dense bottleneck of size 32.
• Outputs positive, noiseless, deconvolved, galaxy surface brightness.

### Sampling from the model

Woups... what's going on?

### Tradeoff between code regularization and image quality

$$\log p_\theta(x| \Sigma, \Pi ) \geq - \underbrace{\mathbb{D}_{KL}\left( q_\phi(z | x, \Sigma, \Pi) \parallel p(z) \right)}_{\mbox{code regularization}} + \underbrace{\mathbb{E}_{z \sim q_{\phi}(. | x, \Sigma, \Pi)} \left[ \log p_\theta(x | z, \Sigma, \Pi) \right]}_{\mbox{reconstruction error}}$$

### Latent space modeling with Normalizing Flows

$\Longrightarrow$ All we need to do is sample from the aggregate posterior of the data instead of sampling from the prior.

Dinh et al. 2016
Normalizing Flows
• Assumes a bijective mapping between data space $x$ and latent space $z$ with prior $p(z)$: $$z = f_{\theta} ( x ) \qquad \mbox{and} \qquad x = f^{-1}_{\theta}(z)$$
• Admits an explicit marginal likelihood: $$\log p_\theta(x) = \log p(z) + \log \left| \frac{\partial f_\theta}{\partial x} \right|(x)$$

### Conditional sampling in VAE latent space

• We build a latent space model $p_\varphi(z)$ using a Masked Autoregressive Flow (MAF) (Papamakarios, et al. 2017)

• While we are learning to sample from the latent space, we can also learn to sample conditionaly: $$p_\varphi(z | y)$$

• Here we learn to sample images conditioned on:
• Size: half-light radius $r$
• Brightness: I band magnitude $mag\_auto$
• Redshift: COSMOS photometric redshift $zphot$

### Testing conditional sampling

$\Longrightarrow$ We can successfully condition galaxy generation.

### Takeaway message

• We have combined physical and deep learning components to model observed noisy and PSF-convoled galaxy images.
$\Longrightarrow$ This framework can handle multi-band, multi-resolution, multi-instrument data.

• We are overcoming the limitations of standard VAEs with an additional latent space model.
$\Longrightarrow$ Can produce sharp and meaningful images.

• We demonstrate conditional sampling of galaxy light profiles
$\Longrightarrow$ Image simulation can be combined with larger survey simulation efforts.

GalSim Hub

## Differentiable models of the Large-Scale Structure

### Work in collaboration with Chirag Modi, Uroš Seljak

Modi, Lanusse, et al., in prep
Modi, et al. (2018)

HSC cosmic shear power spectrum
HSC Y1 constraints on $(S_8, \Omega_m)$
(Hikage,..., Lanusse, et al. 2018)
• Measure the ellipticity $\epsilon = \epsilon_i + \gamma$ of all galaxies
$\Longrightarrow$ Noisy tracer of the weak lensing shear $\gamma$

• Compute summary statistics based on 2pt functions,
e.g. the power spectrum

• Run an MCMC to recover a posterior on model parameters, using an analytic likelihood $$p(\theta | x ) \propto \underbrace{p(x | \theta)}_{\mathrm{likelihood}} \ \underbrace{p(\theta)}_{\mathrm{prior}}$$
Main limitation: the need for an explicit likelihood
We can only compute the likelihood for simple summary statistics and on large scales

$\Longrightarrow$ We are dismissing most of the information!

### A different road: forward modeling

• Instead of trying to analytically evaluate the likelihood, let us build a forward model of the observables.

• Each component of the model is now tractable, but at the cost of a large number of latent variables.

$\Longrightarrow$ How to peform efficient inference in this large number of dimensions?

A non-exhaustive list of methods:
• Hamiltonian Monte-Carlo
• Variational Inference
• MAP+Laplace
• Gold Mining
• Dimensionality reduction by Fisher-Information Maximization
What do they all have in common?
-> They require fast, accurate, differentiable forward simulations
(Schneider et al. 2015)

## How do we simulate the Universe in a fast and differentiable way?

### Forward Models in Cosmology

Linear Field
Final Dark Matter

$\longrightarrow$
N-body simulations

### introducing FlowPM: Particle-Mesh Simulations in TensorFlow


import tensorflow as tf
import flowpm
# Defines integration steps
stages = np.linspace(0.1, 1.0, 10, endpoint=True)

initial_conds = flowpm.linear_field(32,        # size of the cube
100,       # Physical size
ipklin,    # Initial powerspectrum
batch_size=16)

# Sample particles and displace them by LPT
state = flowpm.lpt_init(initial_conds, a0=0.1)

# Evolve particles down to z=0
final_state = flowpm.nbody(state, stages, 32)

# Retrieve final density field
final_field = flowpm.cic_paint(tf.zeros_like(initial_conditions),
final_state[0])

with tf.Session() as sess:
sim = sess.run(final_field)

• Seamless interfacing with deep learning components

### Forward Models in Cosmology

Linear Field
Final Dark Matter

Dark Matter Halos
Galaxies
$\longrightarrow$
N-body simulations
FlowPM
$\longrightarrow$
Group Finding
algorithms
$\longrightarrow$
Semi-analytic &
distribution models

### Example of Extending Dark Matter Simulations with Deep Learning

$\longrightarrow$
Modi et al. 2018

### The practical challenge for inference at scale

• Simulations of scientifically interesting sizes do not fit on a single GPU RAM
e.g. $128^3$ operational, need $1024^3$ for survey volumes
$\Longrightarrow$ We need a distributed Machine Learning Framework

• Most common form of distribution is data-parallelism $\Longrightarrow$ Reached Exascale on scientific deep learning applications

• What we need is model-parallelism on HPC environments

$\Longrightarrow$ We have started investigating Mesh TensorFlow at NERSC and Google TPUs.

### Mesh TensorFlow in a few words

• Redefines the TensorFlow API, in terms of abstract logical tensors with actual memory instantiation on multiple devices defined by:
• The specification of the mesh of computing devices
• The specification of rules for which dimensions can be splitted

(Gholami et al. 2018)

### Proof of concept with Mesh FlowPM and why should you care :-)

Evolution from initial conditions to z=0 distributed on 2 Nodes 16 GPUs
Our assessment so far
• Provides an easy framework to write down distributed differentiable simulations and large scale Machine Learning tasks
• The Mesh TensorFlow project is still young and limited in scope:
$\Longrightarrow$ we need help from the Physics community to develop it for our needs!

### Takeaway message

• We are combining physical and deep learning components to model the Large-Scale Structure in a fast and differentiable way.
$\Longrightarrow$ This is a necessary backbone for large scale simulation-based inference.

• We are demonstrating that large-scale simulations can be implemented in distributed autodiff frameworks.
$\Longrightarrow$ We hope that this will one day become the norm.

• Our community has unique needs and limited resources, we will all gain by working collaboratively !

FlowPM