Hybrid Physical-Deep Learning Models

From Galaxy Morphology to Large Scale Structure



François Lanusse

University of California, Berkeley




the Large Synoptic Survey Telescope

  • 1000 images each night, 15 TB/night for 10 years

  • 18,000 square degrees, observed once every few days

  • Tens of billions of objects, each one observed $\sim1000$ times

Previous generation survey: SDSS




















Image credit: Peter Melchior

Current generation survey: DES




















Image credit: Peter Melchior

LSST precursor survey: HSC




















Image credit: Peter Melchior

Generative Models for Galaxy Image Simulation






Work in collaboration with
Rachel Mandelbaum, Siamak Ravanbakhsh, Barnabas Poczos, Peter Freeman



Lanusse et al., in prep
Ravanbakhsh, Lanusse, et al. (2017)

The weak lensing shape measurement problem

Shape measurement biases
The measured ellipticity $e$ is typically a biased tracer of the underlying shear $\gamma$ $$ < e > = \ (1 + m) \ \gamma \ + \ c $$

Simulation and calibration strategy




The GREAT3 approach
  • Input galaxies from deep HST/ACS COSMOS images (25.2 imag)

  • Apply a range of PSFs and noise levels sampled from the survey

  • Measure response of shape measurement to a known shear and estimate $m$ and $c$

The Bayesian hierarchical modeling strategy


$\mathbf{x}$

$\mathbf{\gamma}$


  • The root of the problem is that the likelihood $p(x | \gamma)$ is a complicated beast.

  • Provides an explicit description of pixel level data, in terms of simple, tractable distributions
(Schneider et al. 2015)

Impact of galaxy morphology


In both cases, we are building a forward model of the data, how accurate does this model need to be?


Mandelbaum, et al. (2013), Mandelbaum, et al. (2014)

$\Longrightarrow$ We cannot measure shear without an accurate model of galaxy morphology

Can we learn a model for galaxy morphologies from the data itself?

The evolution of generative models




  • Deep Belief Network
    (Hinton et al. 2006)

  • Variational AutoEncoder
    (Kingma & Welling 2014)

  • Generative Adversarial Network
    (Goodfellow et al. 2014)

  • Wasserstein GAN
    (Arjovsky et al. 2017)



Complications specific to astronomical images: spot the differences!


CelebA

HSC PDR-2 wide

  • There is noise
  • We have a Point Spread Function

Combining deep generative and physical models

Probabilistic model
Dataset of $N$ i.i.d. samples $\{x_i \}$ generated from $$ x \sim p_{\theta}(x | z, \Sigma, \Pi ) \ p(z) $$
  • $z$ is a set of latent variables
  • $\Pi$ is the PSF, $\Sigma$ is the noise covariance
  • Under a Gaussian noise model:
    $\qquad p_{\theta}(x | z, \Sigma, \Pi)=\mathcal{N}( \Pi \ast g_\theta(z), \Sigma)$
$\Longrightarrow$ Decouples the morphology model from the observing conditions.

How to train your dragon model

  • Training the generative amounts to finding $\theta_\star$ that maximizes the marginal likelihood of the model: $$p_\theta(x | \Sigma, \Pi) = \int \mathcal{N}( \Pi \ast g_\theta(z), \Sigma) \ p(z) \ dz$$
    $\Longrightarrow$ This is generally intractable

  • Efficient training of parameter $\theta$ is made possible by Amortized Variational Inference.
Auto-Encoding Variational Bayes (Kingma & Welling, 2014)
  • We introduce a parametric distribution $q_\phi(z | x, \Pi, \Sigma)$ which aims to model the posterior $p_{\theta}(z | x, \Pi, \Sigma)$.

  • Working out the KL divergence between these two distributions leads to: $$\log p_\theta(x | \Sigma, \Pi) \quad \geq \quad - \mathbb{D}_{KL}\left( q_\phi(z | x, \Sigma, \Pi) \parallel p(z) \right) \quad + \quad \mathbb{E}_{z \sim q_{\phi}(. | x, \Sigma, \Pi)} \left[ \log p_\theta(x | z, \Sigma, \Pi) \right]$$ $\Longrightarrow$ This is the Evidence Lower-Bound, which is differentiable with respect to $\theta$ and $\phi$.

The famous Variational Auto-Encoder



$$\log p_\theta(x| \Sigma, \Pi ) \geq - \underbrace{\mathbb{D}_{KL}\left( q_\phi(z | x, \Sigma, \Pi) \parallel p(z) \right)}_{\mbox{code regularization}} + \underbrace{\mathbb{E}_{z \sim q_{\phi}(. | x, \Sigma, \Pi)} \left[ \log p_\theta(x | z, \Sigma, \Pi) \right]}_{\mbox{reconstruction error}} $$

Illustration on HST/ACS COSMOS images


Fitting observations with VAE and Bulge+Disk parametric model.
  • Training set: GalSim COSMOS HST/ACS postage stamps
    • 80,000 deblended galaxies from I < 25.2 sample
    • Drawn on 128x128 stamps at 0.03 arcsec resolution
    • Each stamp comes with:
      • PSF
      • Noise power spectrum
      • Bulge+Disk parametric fit


  • Auto-Encoder model:
    • Deep residual autoencoder:
      7 stages of 2 resnet blocs each
    • Dense bottleneck of size 32.
    • Outputs positive, noiseless, deconvolved, galaxy surface brightness.

Sampling from the model

Woups... what's going on?

Tradeoff between code regularization and image quality


$$\log p_\theta(x| \Sigma, \Pi ) \geq - \underbrace{\mathbb{D}_{KL}\left( q_\phi(z | x, \Sigma, \Pi) \parallel p(z) \right)}_{\mbox{code regularization}} + \underbrace{\mathbb{E}_{z \sim q_{\phi}(. | x, \Sigma, \Pi)} \left[ \log p_\theta(x | z, \Sigma, \Pi) \right]}_{\mbox{reconstruction error}} $$

Latent space modeling with Normalizing Flows


$\Longrightarrow$ All we need to do is sample from the aggregate posterior of the data instead of sampling from the prior.


Dinh et al. 2016
Normalizing Flows
  • Assumes a bijective mapping between data space $x$ and latent space $z$ with prior $p(z)$: $$ z = f_{\theta} ( x ) \qquad \mbox{and} \qquad x = f^{-1}_{\theta}(z)$$
  • Admits an explicit marginal likelihood: $$ \log p_\theta(x) = \log p(z) + \log \left| \frac{\partial f_\theta}{\partial x} \right|(x) $$




Conditional sampling in VAE latent space

  • We build a latent space model $p_\varphi(z)$ using a Masked Autoregressive Flow (MAF) (Papamakarios, et al. 2017)

  • While we are learning to sample from the latent space, we can also learn to sample conditionaly: $$ p_\varphi(z | y) $$

  • Here we learn to sample images conditioned on:
    • Size: half-light radius $r$
    • Brightness: I band magnitude $mag\_auto$
    • Redshift: COSMOS photometric redshift $zphot$

Flow-VAE samples



Testing conditional sampling

$\Longrightarrow$ We can successfully condition galaxy generation.

Testing galaxy morphologies





Takeaway message


  • We have combined physical and deep learning components to model observed noisy and PSF-convoled galaxy images.
    $\Longrightarrow$ This framework can handle multi-band, multi-resolution, multi-instrument data.

  • We are overcoming the limitations of standard VAEs with an additional latent space model.
    $\Longrightarrow$ Can produce sharp and meaningful images.

  • We demonstrate conditional sampling of galaxy light profiles
    $\Longrightarrow$ Image simulation can be combined with larger survey simulation efforts.


GalSim Hub

Differentiable models of the Large-Scale Structure




Work in collaboration with
Chirag Modi, Uroš Seljak


Modi, Lanusse, et al., in prep
Modi, et al. (2018)

traditional cosmological inference

HSC cosmic shear power spectrum
HSC Y1 constraints on $(S_8, \Omega_m)$
(Hikage,..., Lanusse, et al. 2018)
  • Measure the ellipticity $\epsilon = \epsilon_i + \gamma$ of all galaxies
    $\Longrightarrow$ Noisy tracer of the weak lensing shear $\gamma$

  • Compute summary statistics based on 2pt functions,
    e.g. the power spectrum

  • Run an MCMC to recover a posterior on model parameters, using an analytic likelihood $$ p(\theta | x ) \propto \underbrace{p(x | \theta)}_{\mathrm{likelihood}} \ \underbrace{p(\theta)}_{\mathrm{prior}}$$
Main limitation: the need for an explicit likelihood
We can only compute the likelihood for simple summary statistics and on large scales

$\Longrightarrow$ We are dismissing most of the information!

A different road: forward modeling

  • Instead of trying to analytically evaluate the likelihood, let us build a forward model of the observables.

  • Each component of the model is now tractable, but at the cost of a large number of latent variables.

$\Longrightarrow$ How to peform efficient inference in this large number of dimensions?

    A non-exhaustive list of methods:
  • Hamiltonian Monte-Carlo
  • Variational Inference
  • MAP+Laplace
  • Gold Mining
  • Dimensionality reduction by Fisher-Information Maximization
What do they all have in common?
-> They require fast, accurate, differentiable forward simulations
(Schneider et al. 2015)

How do we simulate the Universe in a fast and differentiable way?

Forward Models in Cosmology

Linear Field
Final Dark Matter

$\longrightarrow$
N-body simulations

introducing FlowPM: Particle-Mesh Simulations in TensorFlow


                  import tensorflow as tf
                  import flowpm
                  # Defines integration steps
                  stages = np.linspace(0.1, 1.0, 10, endpoint=True)

                  initial_conds = flowpm.linear_field(32,        # size of the cube
                                                      100,       # Physical size
                                                      ipklin,    # Initial powerspectrum
                                                      batch_size=16)

                  # Sample particles and displace them by LPT
                  state = flowpm.lpt_init(initial_conds, a0=0.1)

                  # Evolve particles down to z=0
                  final_state = flowpm.nbody(state, stages, 32)

                  # Retrieve final density field
                  final_field = flowpm.cic_paint(tf.zeros_like(initial_conditions),
                                                 final_state[0])

                  with tf.Session() as sess:
                      sim = sess.run(final_field)
              	
  • Seamless interfacing with deep learning components
  • Gradients readily available

Forward Models in Cosmology

Linear Field
Final Dark Matter

Dark Matter Halos
Galaxies
$\longrightarrow$
N-body simulations
FlowPM
$\longrightarrow$
Group Finding
algorithms
$\longrightarrow$
Semi-analytic &
distribution models

Example of Extending Dark Matter Simulations with Deep Learning

$\longrightarrow$
Modi et al. 2018

The practical challenge for inference at scale



  • Simulations of scientifically interesting sizes do not fit on a single GPU RAM
    e.g. $128^3$ operational, need $1024^3$ for survey volumes
    $\Longrightarrow$ We need a distributed Machine Learning Framework

  • Most common form of distribution is data-parallelism $\Longrightarrow$ Reached Exascale on scientific deep learning applications

  • What we need is model-parallelism on HPC environments


$\Longrightarrow$ We have started investigating Mesh TensorFlow at NERSC and Google TPUs.

Mesh TensorFlow in a few words

  • Redefines the TensorFlow API, in terms of abstract logical tensors with actual memory instantiation on multiple devices defined by:
    • The specification of the mesh of computing devices
    • The specification of rules for which dimensions can be splitted

(Gholami et al. 2018)

Proof of concept with Mesh FlowPM and why should you care :-)


Evolution from initial conditions to z=0 distributed on 2 Nodes 16 GPUs
Our assessment so far
  • Provides an easy framework to write down distributed differentiable simulations and large scale Machine Learning tasks
  • The Mesh TensorFlow project is still young and limited in scope:
    $\Longrightarrow$ we need help from the Physics community to develop it for our needs!

Takeaway message


  • We are combining physical and deep learning components to model the Large-Scale Structure in a fast and differentiable way.
    $\Longrightarrow$ This is a necessary backbone for large scale simulation-based inference.

  • We are demonstrating that large-scale simulations can be implemented in distributed autodiff frameworks.
    $\Longrightarrow$ We hope that this will one day become the norm.

  • Our community has unique needs and limited resources, we will all gain by working collaboratively !


FlowPM

Final words



Advertisement:


Thank you !