Convex Optimized feed.

Alternative Frontends for PyMC

Rob Zinkov — Sun, 19 Nov 2023 00:00:00 UT

Alternative Frontends for PyMC

Rob Zinkov

2023-11-19

When people are starting to learn pymc they often assume the syntax and workflow for the library is something that’s unchangeable. But thanks to the modular way the library is implemented, I’m going to show that it’s fairly easy to use it in a totally different way!

from functools import wraps
import numpy as np
import pymc as pm

Sample as a method on the model

Some people see sampling as more a method on the model than a function. We can always extend pm.Model for those that find that more intuitive

class Model(pm.Model):
    def sample(self, *args, **kwargs):
        return pm.sample(*args, model=self, **kwargs)

    def sample_posterior_predictive(self, *args, **kwargs):
        return pm.sample_posterior_predictive(
            *args,
            model=self,
            **kwargs,
        )

Here is a simple example of it in action

with Model() as basic_model:
    x = pm.Normal("x", 0., 1.)
    y = pm.Normal("y", x, 1.)

idata = basic_model.sample(draws=1000)

Models as parameterised functions

The idea here is to create models by just using a decorator.

def model(f):
    @wraps(f)
    def make_model(*args, **kwargs):
        with Model() as m:
            f(*args, **kwargs)
            return m
    return make_model

With this change our previous model becomes:

@model
def basic_model(mu):
    x = pm.Normal("x", mu, 1.)
    y = pm.Normal("y", x, 1.)

In practice, this removes all need to think about context managers

m = basic_model(mu=0.)
idata = m.sample(draws=1000)

pm.plot_trace(idata)

But the real composition happens with how readily helper methods can be used

def hyperprior(name, **kwargs):
    mu = pm.Normal(name + "_mu", mu=0, sigma=1)
    sd = pm.HalfNormal(name + "_sd", sigma=1)
    return pm.Normal(name, mu=mu, sigma=sd, **kwargs)

@model
def model_with_helper():
    y = hyperprior("y")
    z = hyperprior("z", observed=2.)

m = model_with_helper()
idata = pm.sample(model=m)
pm.plot_trace(idata, kind="rank_bars")

And since the model returned is an ordinary pymc model object, it can be readily used for things like posterior predictive checks

y_data = np.random.normal(size=100)

@model
def ppc_model():
    x = pm.Normal("x")
    y = pm.Normal("y", x, 1., observed=y_data)

m = ppc_model()
idata = pm.sample(draws=1000, model=m)
idata = pm.sample_posterior_predictive(trace=idata, model=m)

pm.plot_ppc(idata)

Finally, one underappreciated aspect of this functional approach to defining model is it avoids the need for pm.MutableData in simpler models. Porting an example from the documentation

n_obs = 100
true_beta = 2.5
true_alpha = 0.25

x = np.random.normal(size=n_obs)
true_p = 1 / (1 + np.exp(-(true_alpha + true_beta * x)))
y = np.random.binomial(n=1, p=true_p)

@model
def logistic_model(x, y):
    alpha = pm.Normal("alpha")
    beta = pm.Normal("beta")
    p = pm.Deterministic("p", pm.math.sigmoid(alpha + beta * x))
    obs = pm.Bernoulli("obs", p=p, observed=y, shape=x.shape[0])

lm = logistic_model(x, y)
idata = lm.sample()

idata = lm.sample_posterior_predictive(
    idata, extend_inferencedata=True,
)

We call the logistic_model function with different arguments changing to use x_grid instead of x

grid_size = 250
x_grid = np.linspace(x.min(), x.max(), grid_size)
lm_grid = logistic_model(x_grid, y)
post_idata = lm_grid.sample_posterior_predictive(
    idata, var_names=["p", "obs"],
)

fig, ax = plt.subplots()
hdi = az.hdi(post_idata.posterior_predictive.p).p

ax.scatter(
    x,
    y,
    facecolor="none",
    edgecolor="k",
    label="Observed Data",
)
p_mean = post_idata.posterior_predictive.p.mean(dim=["chain", "draw"])
ax.plot(
    x_grid,
    p_mean,
    color="tab:red",
    label="Mean Posterior Probability",
)
ax.fill_between(
    x_grid,
    *hdi.values.T,
    color="tab:orange",
    alpha=0.25,
    label="94% HDI",
)
ax.legend()
ax.set(ylabel="Probability of $y=1$", xlabel="x value")
plt.show()

This even works really well for coords. It only requires we change model a little bit

def model(f):
    @wraps(f)
    def make_model(*args, **kwargs):
        coords = kwargs.pop("coords", {})
        with Model(coords=coords) as m:
            f(*args, **kwargs)
            return m
    return make_model

Now let’s generate some data and fit a linear model

a_true = 2
b_true = -0.4
x = np.linspace(0, 10, 31)
year = np.arange(2022-len(x), 2022)
y = a_true + b_true * x + np.random.normal(size=len(x))

@model
def linreg_model(x):
    a = pm.Normal("a", 0, 3)
    b = pm.Normal("b", 0, 2)
    sigma = pm.HalfNormal("sigma", 2)
    
    pm.Normal("y", a + b * x, sigma, observed=y, dims="year")

m = linreg_model(x, coords={"year": year})
linreg_idata = pm.sample(model=m)

We can then update the coords seamlessly

m2 = linreg_model(x[-1] + x[1:3], coords={"year": [2022, 2023]})
pm.sample_posterior_predictive(
    linreg_idata,
    model=m2,
    predictions=True,
    extend_inferencedata=True,
)

az.plot_posterior(linreg_idata, group="predictions")

While I personally think these changes simplify the models and speed-up the interactive workflow, that’s not the main reason I share them. I share them because more of us should be doing little experiments like these. There are certainly more low-hanging fruits to be had for people who are willing to join in!

Why care about Program Synthesis

Rob Zinkov — Sun, 17 Feb 2019 00:00:00 UT

Why care about Program Synthesis

Rob Zinkov

2019-02-17

Program synthesis is now emerging as an exciting new area of research not just in the programming languages community, but also the machine learning community. In this post, I’d like to convince you why this area of study has the potential to solve precisely the kinds of problems existing approaches built around differential programming struggle with.

Basics of Program Synthesis

To start let’s informally and somewhat formally define what makes something a program synthesis problem. Informally, program synthesis is where given a some language \(\mathcal{L}\) and specification \(\mathcal{S}\) we return a program \(\mathcal{P} \in \mathcal{L}\) which meets that specification.

So what languages (\(\mathcal{L}\)) will we use? In principle, any language can be used. So, we can synthesize Python code. In practice, because it is difficult these days to create programs much longer than 20-30 lines of code, we concentrate on domain-specific languages (DSLs). DSLs are languages like SQL, Regexes, or OpenGL shaders. If we are willing to be a bit loose about what defines a language, this can include synthesizing a set of library calls like Autopandas. All the matters is we can define a grammar that covers the space of programs we wish to consider.

    ::=  '|' 
            |  

    ::= {  }

    ::=  { '*' }
             
    ::= 
           |  '\' 
           |  '('  ')'

Regex grammar

What do we mean by a specification (\(\mathcal{S}\))?

This can actually be a wide variety of things. \(\mathcal{S}\) can be in particular order one or more of the following:

A formal specification of the problem including things like theorems that must be proved along with other formal verification steps.
A set of input/output examples
A set of unit tests and property-based tests
A natural language description of the problem
A set of execution traces of the desired program
A sketch of a program where we have a partial program and some blanks we would like to fill in
A correct but inefficient implementation of the desire program

While not strictly necessary, we may also have some side information like:

Similar but incorrect programs
A set of other programs in \(\mathcal{L}\)

If we restrict ourselves to a specification that consists of input/output examples and a language of pure functions we get something pretty similar to supervised machine learning. But because the specification can be much richer we actually tackle problems that are hard to pose in a way amendable to traditional machine learning algorithms.

Program synthesis is good for

Introduction

Now while it is a nice generic formalism that isn’t very compelling if there aren’t problems that benefit from being posed that way. Deep Learning and other optimization methods can now be used to solve a diverse set of problems. What problems tend to easier to solve with program syntheis? As things stand today that main advantages of specifically wanting to generate a program have to do with interpretability, generalisability, verification, combinatorial problems, and output needs to be a program.

Interpretability

Consider the task of automatically grading assignments. How would you go about doing this? You might treat this as a classification task where you find the errors. The challenge with this problem is there can be multiple valid solutions, and the fix for the assignment will depend on which solution you think the student was attempting.

Instead, we can synthesize the correct program but exploring the space of small edits that get us from the incorrect program to a correct program that satisfies an already written specification. These edits can then be presented to the student. This is precisely what the paper Automated Feedback Generation for Introductory Programming Assignments does on a subset of the Python language, and the paper Towards Specification-Directed Program Repair which does it for the robot manipulation DSL Karel.

If we didn’t treat this as a program we would have likely ended up with some character edits which as much less interpretable.

This can be seen more strikingly in Learning to Infer Graphics Programs from Hand-Drawn Images where the program we learn in being a program better communicates the structure in the image.

Generalisability

Many deep learning models struggle with generalisibility. They tend not to be very robust to small distribution differences between the training and the testing set as well as being prone to adversarial examples where small imperceptible changes to the input radically change the prediction.

But for many domains if we represent our function as a program it can be made more robust to perturbations of the input like that as can be seen in Learning to Infer Graphics Programs from Hand-Drawn Images

There are actually particular challenges that face the most popular machine learning models which give program synthesis approaches no problems. We know LSTM have trouble with copy and reverse functions as seen in the Differentiable Neural computers paper.

LSTM models have trouble generalising to test data longer than training data as can be seen in Neural Logic Machines

In contrast the papers Making Neural Programming Architectures Generalize via Recursion and Towards Synthesizing Complex Programs from Input-Output Examples show no issues with either of those tasks.

Verification

Another advantage comes from our output artifact from a program. Neural networks are difficult to formally verify and at present often require major restrictions be placed on the models. In contrast, with programs we can reuse existing infrastructure for verifying deterministic programs. We can thus verify these programs terminate or obey a formal spec. In some domains like robotics we can check if the program has controlability.

Problems with combinatorial shape

Problems that require dealing with graphs, trees, and permutations still remain fairly challenging for existing machine learning algorithms. Programs are a natural representation for manipulating combinatorial structures. Pointer networks, Sinkhorn networks along with work with Memory networks and Neural Turing Machines shows that at the moment it is difficult to learn a function that can handle anything beyond toy problems which themselves have trouble generalizing to larger domains.

Required to use some api / output must be program

And finally, sometimes for one reason or another you need an output that must satisfy some grammar. This might be learning to generate a phone number or a URL. We might have some API we need to conform like if we are trying to generate mobile software that needs to call out to Android or IOS primitives.

We could be using program synthesis for compiler optimization so we must generate a valid program as output. We could be learning to deobfuscate code. Or learning to generate code that would automatically hack a system.

Any other approach will need to model the grammar to make output that is acceptable and at that point could also be argued is performing program synthesis.

Conclusions

None of this is meant to say that these problems couldn’t be solved with other methods, but program synthesis has distinct advantages that enables them to solve them particularly well.

NeurIPS 2018: Papers to check out

Rob Zinkov — Fri, 21 Dec 2018 00:00:00 UT

NeurIPS 2018: Papers to check out

Rob Zinkov

2018-12-21

It’s been a long time since I’ve done one of these, so below are some of the papers I found exciting at this past NeurIPS. One notable thing is as the conference has gotten larger is there are simply more papers being presented. Some people worry that in becoming a gigantic conference the quality is declining but thanks to the great efforts of community, more people has meant more interesting ideas and more great papers to read. I typically highlight 5 papers

Approximate Inference

Program Synthesis

Applications

Misc

Conference papers:

Approximate Inference

Implicit Reparameterization Gradients

Mikhail Figurnov, Shakir Mohamed, Andriy Mnih

Abstract: By providing a simple and efficient way of computing low-variance gradients of continuous random variables, the reparameterization trick has become the technique of choice for training a variety of latent variable models. However, it is not applicable to a number of important continuous distributions. We introduce an alternative approach to computing reparameterization gradients based on implicit differentiation and demonstrate its broader applicability by applying it to Gamma, Beta, Dirichlet, and von Mises distributions, which cannot be used with the classic reparameterization trick. Our experiments show that the proposed approach is faster and more accurate than the existing gradient estimators for these distributions.

Random Feature Stein Discrepancies

Jonathan Huggins, Lester Mackey

Abstract: Computable Stein discrepancies have been deployed for a variety of applications, ranging from sampler selection in posterior inference to approximate Bayesian inference to goodness-of-fit testing. Existing convergence-determining Stein discrepancies admit strong theoretical guarantees but suffer from a computational cost that grows quadratically in the sample size. While linear-time Stein discrepancies have been proposed for goodness-of-fit testing, they exhibit avoidable degradations in testing power—even when power is explicitly optimized. To address these shortcomings, we introduce feature Stein discrepancies (ΦSDs), a new family of quality measures that can be cheaply approximated using importance sampling. We show how to construct ΦSDs that provably determine the convergence of a sample to its target and develop high-accuracy approximations—random ΦSDs (RΦSDs)—which are computable in near-linear time. In our experiments with sampler selection for approximate posterior inference and goodness-of-fit testing, RΦSDs perform as well or better than quadratic-time KSDs while being orders of magnitude faster to compute.

Wasserstein Variational Inference

Luca Ambrogioni, Umut Güçlü, Yağmur Güçlütürk, Max Hinne, Marcel A. J. van Gerven, Eric Maris

Abstract: This paper introduces Wasserstein variational inference, a new form of approximate Bayesian inference based on optimal transport theory. Wasserstein variational inference uses a new family of divergences that includes both f-divergences and the Wasserstein distance as special cases. The gradients of the Wasserstein variational loss are obtained by backpropagating through the Sinkhorn iterations. This technique results in a very stable likelihood-free training method that can be used with implicit distributions and probabilistic programs. Using the Wasserstein variational inference framework, we introduce several new forms of autoencoders and test their robustness and performance against existing variational autoencoding techniques.

DeepProbLog: Neural Probabilistic Logic Programming

Robin Manhaeve, Sebastijan Dumancic, Angelika Kimmig, Thomas Demeester, Luc De Raedt

Abstract: We introduce DeepProbLog, a probabilistic logic programming language that incorporates deep learning by means of neural predicates. We show how existing inference and learning techniques can be adapted for the new language. Our experiments demonstrate that DeepProbLog supports (i) both symbolic and subsymbolic representations and inference, (ii) program induction, (iii) probabilistic (logic) programming, and (iv) (deep) learning from examples. To the best of our knowledge, this work is the first to propose a framework where general-purpose neural networks and expressive probabilistic-logical modeling and reasoning are integrated in a way that exploits the full expressiveness and strengths of both worlds and can be trained end-to-end based on examples.

Meta-Learning MCMC Proposals

Tongzhou Wang, Yi Wu, Dave Moore, Stuart J. Russell

Abstract: Effective implementations of sampling-based probabilistic inference often require manually constructed, model-specific proposals. Inspired by recent progresses in meta-learning for training learning agents that can generalize to unseen environments, we propose a meta-learning approach to building effective and generalizable MCMC proposals. We parametrize the proposal as a neural network to provide fast approximations to block Gibbs conditionals. The learned neural proposals generalize to occurrences of common structural motifs across different models, allowing for the construction of a library of learned inference primitives that can accelerate inference on unseen models with no model-specific training required. We explore several applications including open-universe Gaussian mixture models, in which our learned proposals outperform a hand-tuned sampler, and a real-world named entity recognition task, in which our sampler yields higher final F1 scores than classical single-site Gibbs sampling.

Importance Weighting and Variational Inference

Justin Domke, Daniel R. Sheldon

Abstract: Recent work used importance sampling ideas for better variational bounds on likelihoods. We clarify the applicability of these ideas to pure probabilistic inference, by showing the resulting Importance Weighted Variational Inference (IWVI) technique is an instance of augmented variational inference, thus identifying the looseness in previous work. Experiments confirm IWVI’s practicality for probabilistic inference. As a second contribution, we investigate inference with elliptical distributions, which improves accuracy in low dimensions, and convergence in high dimensions.

GILBO: One Metric to Measure Them All

Alexander A. Alemi, Ian Fischer

Abstract: We propose a simple, tractable lower bound on the mutual information contained in the joint generative density of any latent variable generative model: the GILBO (Generative Information Lower BOund). It offers a data-independent measure of the complexity of the learned latent variable description, giving the log of the effective description length. It is well-defined for both VAEs and GANs. We compute the GILBO for 800 GANs and VAEs each trained on four datasets (MNIST, FashionMNIST, CIFAR-10 and CelebA) and discuss the results.

Graphical model inference: Sequential Monte Carlo meets deterministic approximations

Fredrik Lindsten, Jouni Helske, Matti Vihola

Abstract: Approximate inference in probabilistic graphical models (PGMs) can be grouped into deterministic methods and Monte-Carlo-based methods. The former can often provide accurate and rapid inferences, but are typically associated with biases that are hard to quantify. The latter enjoy asymptotic consistency, but can suffer from high computational costs. In this paper we present a way of bridging the gap between deterministic and stochastic inference. Specifically, we suggest an efficient sequential Monte Carlo (SMC) algorithm for PGMs which can leverage the output from deterministic inference methods. While generally applicable, we show explicitly how this can be done with loopy belief propagation, expectation propagation, and Laplace approximations. The resulting algorithm can be viewed as a post-correction of the biases associated with these methods and, indeed, numerical results show clear improvements over the baseline deterministic methods as well as over “plain” SMC.

A Bayesian Nonparametric View on Count-Min Sketch

Diana Cai, Michael Mitzenmacher, Ryan P. Adams

Abstract The count-min sketch is a time- and memory-efficient randomized data structure that provides a point estimate of the number of times an item has appeared in a data stream. The count-min sketch and related hash-based data structures are ubiquitous in systems that must track frequencies of data such as URLs, IP addresses, and language n-grams. We present a Bayesian view on the count-min sketch, using the same data structure, but providing a posterior distribution over the frequencies that characterizes the uncertainty arising from the hash-based approximation. In particular, we take a nonparametric approach and consider tokens generated from a Dirichlet process (DP) random measure, which allows for an unbounded number of unique tokens. Using properties of the DP, we show that it is possible to straightforwardly compute posterior marginals of the unknown true counts and that the modes of these marginals recover the count-min sketch estimator, inheriting the associated probabilistic guarantees. Using simulated data with known ground truth, we investigate the properties of these estimators. Lastly, we also study a modified problem in which the observation stream consists of collections of tokens (i.e., documents) arising from a random measure drawn from a stable beta process, which allows for power law scaling behavior in the number of unique tokens.

Autoconj: Recognizing and Exploiting Conjugacy Without a Domain-Specific Language

Matthew D. Hoffman, Matthew J. Johnson, Dustin Tran

Abstract Deriving conditional and marginal distributions using conjugacy relationships can be time consuming and error prone. In this paper, we propose a strategy for automating such derivations. Unlike previous systems which focus on relationships between pairs of random variables, our system (which we call Autoconj) operates directly on Python functions that compute log-joint distribution functions. Autoconj provides support for conjugacy-exploiting algorithms in any Python-embedded PPL. This paves the way for accelerating development of novel inference algorithms and structure-exploiting modeling strategies. The package can be downloaded at https://github.com/google-research/autoconj.

Robust Hypothesis Testing Using Wasserstein Uncertainty Sets

Rui Gao, Liyan Xie, Yao Xie, Huan Xu

Abstract We develop a novel computationally efficient and general framework for robust hypothesis testing. The new framework features a new way to construct uncertainty sets under the null and the alternative distributions, which are sets centered around the empirical distribution defined via Wasserstein metric, thus our approach is data-driven and free of distributional assumptions. We develop a convex safe approximation of the minimax formulation and show that such approximation renders a nearly-optimal detector among the family of all possible tests. By exploiting the structure of the least favorable distribution, we also develop a tractable reformulation of such approximation, with complexity independent of the dimension of observation space and can be nearly sample-size-independent in general. Real-data example using human activity data demonstrated the excellent performance of the new robust detector.

Geometrically Coupled Monte Carlo Sampling

Mark Rowland, Krzysztof M. Choromanski, François Chalus, Aldo Pacchiano, Tamas Sarlos, Richard E. Turner, Adrian Weller

Abstract Monte Carlo sampling in high-dimensional, low-sample settings is important in many machine learning tasks. We improve current methods for sampling in Euclidean spaces by avoiding independence, and instead consider ways to couple samples. We show fundamental connections to optimal transport theory, leading to novel sampling algorithms, and providing new theoretical grounding for existing strategies. We compare our new strategies against prior methods for improving sample efficiency, including QMC, by studying discrepancy. We explore our findings empirically, and observe benefits of our sampling schemes for reinforcement learning and generative modelling.

Assessing Generative Models via Precision and Recall

Mehdi S. M. Sajjadi, Olivier Bachem, Mario Lucic, Olivier Bousquet, Sylvain Gelly

Abstract Recent advances in generative modeling have led to an increased interest in the study of statistical divergences as means of model comparison. Commonly used evaluation methods, such as the Frechet Inception Distance (FID), correlate well with the perceived quality of samples and are sensitive to mode dropping. However, these metrics are unable to distinguish between different failure cases since they only yield one-dimensional scores. We propose a novel definition of precision and recall for distributions which disentangles the divergence into two separate dimensions. The proposed notion is intuitive, retains desirable properties, and naturally leads to an efficient algorithm that can be used to evaluate generative models. We relate this notion to total variation as well as to recent evaluation metrics such as Inception Score and FID. To demonstrate the practical utility of the proposed approach we perform an empirical study on several variants of Generative Adversarial Networks and Variational Autoencoders. In an extensive set of experiments we show that the proposed metric is able to disentangle the quality of generated samples from the coverage of the target distribution.

DAGs with NO TEARS: Continuous Optimization for Structure Learning

Xun Zheng, Bryon Aragam, Pradeep K. Ravikumar, Eric P. Xing

Abstract Estimating the structure of directed acyclic graphs (DAGs, also known as Bayesian networks) is a challenging problem since the search space of DAGs is combinatorial and scales superexponentially with the number of nodes. Existing approaches rely on various local heuristics for enforcing the acyclicity constraint. In this paper, we introduce a fundamentally different strategy: we formulate the structure learning problem as a purely continuous optimization problem over real matrices that avoids this combinatorial constraint entirely. This is achieved by a novel characterization of acyclicity that is not only smooth but also exact. The resulting problem can be efficiently solved by standard numerical algorithms, which also makes implementation effortless. The proposed method outperforms existing ones, without imposing any structural assumptions on the graph such as bounded treewidth or in-degree.

Doubly Robust Bayesian Inference for Non-Stationary Streaming Data with β-Divergences

Jeremias Knoblauch, Jack E. Jewson, Theodoros Damoulas

Abstract We present the very first robust Bayesian Online Changepoint Detection algorithm through General Bayesian Inference (GBI) with β-divergences. The resulting inference procedure is doubly robust for both the predictive and the changepoint (CP) posterior, with linear time and constant space complexity. We provide a construction for exponential models and demonstrate it on the Bayesian Linear Regression model. In so doing, we make two additional contributions: Firstly, we make GBI scalable using Structural Variational approximations that are exact as β→0. Secondly, we give a principled way of choosing the divergence parameter β by minimizing expected predictive loss on-line. Reducing False Discovery Rates of from up to 99% to 0% on real world data, this offers the state of the art.

The promises and pitfalls of Stochastic Gradient Langevin Dynamics

Nicolas Brosse, Alain Durmus, Eric Moulines

Abstract Stochastic Gradient Langevin Dynamics (SGLD) has emerged as a key MCMC algorithm for Bayesian learning from large scale datasets. While SGLD with decreasing step sizes converges weakly to the posterior distribution, the algorithm is often used with a constant step size in practice and has demonstrated spectacular successes in machine learning tasks. The current practice is to set the step size inversely proportional to N where N is the number of training samples. As N becomes large, we show that the SGLD algorithm has an invariant probability measure which significantly departs from the target posterior and behaves like as Stochastic Gradient Descent (SGD). This difference is inherently due to the high variance of the stochastic gradients. Several strategies have been suggested to reduce this effect; among them, SGLD Fixed Point (SGLDFP) uses carefully designed control variates to reduce the variance of the stochastic gradients. We show that SGLDFP gives approximate samples from the posterior distribution, with an accuracy comparable to the Langevin Monte Carlo (LMC) algorithm for a computational cost sublinear in the number of data points. We provide a detailed analysis of the Wasserstein distances between LMC, SGLD, SGLDFP and SGD and explicit expressions of the means and covariance matrices of their invariant distributions. Our findings are supported by limited numerical experiments.

Reparameterization Gradient for Non-differentiable Models

Wonyeol Lee, Hangyeol Yu, Hongseok Yang

Abstract We present a new algorithm for stochastic variational inference that targets at models with non-differentiable densities. One of the key challenges in stochastic variational inference is to come up with a low-variance estimator of the gradient of a variational objective. We tackle the challenge by generalizing the reparameterization trick, one of the most effective techniques for addressing the variance issue for differentiable models, so that the trick works for non-differentiable models as well. Our algorithm splits the space of latent variables into regions where the density of the variables is differentiable, and their boundaries where the density may fail to be differentiable. For each differentiable region, the algorithm applies the standard reparameterization trick and estimates the gradient restricted to the region. For each potentially non-differentiable boundary, it uses a form of manifold sampling and computes the direction for variational parameters that, if followed, would increase the boundary’s contribution to the variational objective. The sum of all the estimates becomes the gradient estimate of our algorithm. Our estimator enjoys the reduced variance of the reparameterization gradient while remaining unbiased even for non-differentiable models. The experiments with our preliminary implementation confirm the benefit of reduced variance and unbiasedness.

Improving Explorability in Variational Inference with Annealed Variational Objectives

Chin-Wei Huang, Shawn Tan, Alexandre Lacoste, Aaron C. Courville

Abstract Despite the advances in the representational capacity of approximate distributions for variational inference, the optimization process can still limit the density that is ultimately learned. We demonstrate the drawbacks of biasing the true posterior to be unimodal, and introduce Annealed Variational Objectives (AVO) into the training of hierarchical variational methods. Inspired by Annealed Importance Sampling, the proposed method facilitates learning by incorporating energy tempering into the optimization objective. In our experiments, we demonstrate our method’s robustness to deterministic warm up, and the benefits of encouraging exploration in the latent space.

Program Synthesis

Neural Guided Constraint Logic Programming for Program Synthesis

Lisa Zhang, Gregory Rosenblatt, Ethan Fetaya, Renjie Liao, William Byrd, Matthew Might, Raquel Urtasun, Richard Zemel

Abstract Synthesizing programs using example input/outputs is a classic problem in artificial intelligence. We present a method for solving Programming By Example (PBE) problems by using a neural model to guide the search of a constraint logic programming system called miniKanren. Crucially, the neural model uses miniKanren’s internal representation as input; miniKanren represents a PBE problem as recursive constraints imposed by the provided examples. We explore Recurrent Neural Network and Graph Neural Network models. We contribute a modified miniKanren, drivable by an external agent, available at https://github.com/xuexue/neuralkanren. We show that our neural-guided approach using constraints can synthesize programs faster in many cases, and importantly, can generalize to larger problems.

Learning to Infer Graphics Programs from Hand-Drawn Images

Kevin Ellis, Daniel Ritchie, Armando Solar-Lezama, Josh Tenenbaum

Abstract We introduce a model that learns to convert simple hand drawings into graphics programs written in a subset of .~The model combines techniques from deep learning and program synthesis. We learn a convolutional neural network that proposes plausible drawing primitives that explain an image. These drawing primitives are a specification (spec) of what the graphics program needs to draw. We learn a model that uses program synthesis techniques to recover a graphics program from that spec. These programs have constructs like variable bindings, iterative loops, or simple kinds of conditionals. With a graphics program in hand, we can correct errors made by the deep network and extrapolate drawings.

Learning Libraries of Subroutines for Neurally–Guided Bayesian Program Induction

Kevin Ellis, Lucas Morales, Mathias Sablé-Meyer, Armando Solar-Lezama, Josh Tenenbaum

Abstract Successful approaches to program induction require a hand-engineered domain-specific language (DSL), constraining the space of allowed programs and imparting prior knowledge of the domain. We contribute a program induction algorithm that learns a DSL while jointly training a neural network to efficiently search for programs in the learned DSL. We use our model to synthesize functions on lists, edit text, and solve symbolic regression problems, showing how the model learns a domain-specific library of program components for expressing solutions to problems in the domain.

HOUDINI: Lifelong Learning as Program Synthesis

Lazar Valkov, Dipak Chaudhari, Akash Srivastava, Charles Sutton, Swarat Chaudhuri

Abstract We present a neurosymbolic framework for the lifelong learning of algorithmic tasks that mix perception and procedural reasoning. Reusing high-level concepts across domains and learning complex procedures are key challenges in lifelong learning. We show that a program synthesis approach that combines gradient descent with combinatorial search over programs can be a more effective response to these challenges than purely neural methods. Our framework, called HOUDINI, represents neural networks as strongly typed, differentiable functional programs that use symbolic higher-order combinators to compose a library of neural functions. Our learning algorithm consists of: (1) a symbolic program synthesizer that performs a type-directed search over parameterized programs, and decides on the library functions to reuse, and the architectures to combine them, while learning a sequence of tasks; and (2) a neural module that trains these programs using stochastic gradient descent. We evaluate HOUDINI on three benchmarks that combine perception with the algorithmic tasks of counting, summing, and shortest-path computation. Our experiments show that HOUDINI transfers high-level concepts more effectively than traditional transfer learning and progressive neural networks, and that the typed representation of networks signiﬁcantly accelerates the search.

A Retrieve-and-Edit Framework for Predicting Structured Outputs

Tatsunori B. Hashimoto, Kelvin Guu, Yonatan Oren, Percy S. Liang

Abstract For the task of generating complex outputs such as source code, editing existing outputs can be easier than generating complex outputs from scratch. With this motivation, we propose an approach that first retrieves a training example based on the input (e.g., natural language description) and then edits it to the desired output (e.g., code). Our contribution is a computationally efficient method for learning a retrieval model that embeds the input in a task-dependent way without relying on a hand-crafted metric or incurring the expense of jointly training the retriever with the editor. Our retrieve-and-edit framework can be applied on top of any base model. We show that on a new autocomplete task for GitHub Python code and the Hearthstone cards benchmark, retrieve-and-edit significantly boosts the performance of a vanilla sequence-to-sequence model on both tasks.

Neural Code Comprehension: A Learnable Representation of Code Semantics

Tal Ben-Nun, Alice Shoshana Jakobovits, Torsten Hoefler

Abstract With the recent success of embeddings in natural language processing, research has been conducted into applying similar methods to code analysis. Most works attempt to process the code directly or use a syntactic tree representation, treating it like sentences written in a natural language. However, none of the existing methods are sufficient to comprehend program semantics robustly, due to structural features such as function calls, branching, and interchangeable order of statements. In this paper, we propose a novel processing technique to learn code semantics, and apply it to a variety of program analysis tasks. In particular, we stipulate that a robust distributional hypothesis of code applies to both human- and machine-generated programs. Following this hypothesis, we define an embedding space, inst2vec, based on an Intermediate Representation (IR) of the code that is independent of the source programming language. We provide a novel definition of contextual flow for this IR, leveraging both the underlying data- and control-flow of the program. We then analyze the embeddings qualitatively using analogies and clustering, and evaluate the learned representation on three different high-level tasks. We show that even without fine-tuning, a single RNN architecture and fixed inst2vec embeddings outperform specialized approaches for performance prediction (compute device mapping, optimal thread coarsening); and algorithm classification from raw code (104 classes), where we set a new state-of-the-art.

Tree-to-tree Neural Networks for Program Translation

Xinyun Chen, Chang Liu, Dawn Song

Abstract Program translation is an important tool to migrate legacy code in one language into an ecosystem built in a different language. In this work, we are the first to employ deep neural networks toward tackling this problem. We observe that program translation is a modular procedure, in which a sub-tree of the source tree is translated into the corresponding target sub-tree at each step. To capture this intuition, we design a tree-to-tree neural network to translate a source tree into a target one. Meanwhile, we develop an attention mechanism for the tree-to-tree model, so that when the decoder expands one non-terminal in the target tree, the attention mechanism locates the corresponding sub-tree in the source tree to guide the expansion of the decoder. We evaluate the program translation capability of our tree-to-tree model against several state-of-the-art approaches. Compared against other neural translation models, we observe that our approach is consistently better than the baselines with a margin of up to 15 points. Further, our approach can improve the previous state-of-the-art program translation approaches by a margin of 20 points on the translation of real-world projects.

Improving Neural Program Synthesis with Inferred Execution Traces

Richard Shin, Illia Polosukhin, Dawn Song

Abstract The task of program synthesis, or automatically generating programs that are consistent with a provided specification, remains a challenging task in artificial intelligence. As in other fields of AI, deep learning-based end-to-end approaches have made great advances in program synthesis. However, more so than other fields such as computer vision, program synthesis provides greater opportunities to explicitly exploit structured information such as execution traces, which contain a superset of the information input/output pairs. While they are highly useful for program synthesis, as execution traces are more difficult to obtain than input/output pairs, we use the insight that we can split the process into two parts: infer the trace from the input/output example, then infer the program from the trace. This simple modification leads to state-of-the-art results in program synthesis in the Karel domain, improving accuracy to 81.3% from the 77.12% of prior work.

Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing

Chen Liang, Mohammad Norouzi, Jonathan Berant, Quoc V. Le, Ni Lao

Abstract We present Memory Augmented Policy Optimization (MAPO), a simple and novel way to leverage a memory buffer of promising trajectories to reduce the variance of policy gradient estimate. MAPO is applicable to deterministic environments with discrete actions, such as structured prediction and combinatorial optimization tasks. We express the expected return objective as a weighted sum of two terms: an expectation over the high-reward trajectories inside the memory buffer, and a separate expectation over trajectories outside the buffer. To make an efficient algorithm of MAPO, we propose: (1) memory weight clipping to accelerate and stabilize training; (2) systematic exploration to discover high-reward trajectories; (3) distributed sampling from inside and outside of the memory buffer to scale up training. MAPO improves the sample efficiency and robustness of policy gradient, especially on tasks with sparse rewards. We evaluate MAPO on weakly supervised program synthesis from natural language (semantic parsing). On the WikiTableQuestions benchmark, we improve the state-of-the-art by 2.6%, achieving an accuracy of 46.3%. On the WikiSQL benchmark, MAPO achieves an accuracy of 74.9% with only weak supervision, outperforming several strong baselines with full supervision. Our source code is available at https://goo.gl/TXBp4e

Automatic Program Synthesis of Long Programs with a Learned Garbage Collector

Amit Zohar, Lior Wolf

Abstract We consider the problem of generating automatic code given sample input-output pairs. We train a neural network to map from the current state and the outputs to the program’s next statement. The neural network optimizes multiple tasks concurrently: the next operation out of a set of high level commands, the operands of the next statement, and which variables can be dropped from memory. Using our method we are able to create programs that are more than twice as long as existing state-of-the-art solutions, while improving the success rate for comparable lengths, and cutting the run-time by two orders of magnitude. Our code, including an implementation of various literature baselines, is publicly available at https://github.com/amitz25/PCCoder

Learning Loop Invariants for Program Verification

Xujie Si, Hanjun Dai, Mukund Raghothaman, Mayur Naik, Le Song

Abstract A fundamental problem in program verification concerns inferring loop invariants. The problem is undecidable and even practical instances are challenging. Inspired by how human experts construct loop invariants, we propose a reasoning framework Code2Inv that constructs the solution by multi-step decision making and querying an external program graph memory block. By training with reinforcement learning, Code2Inv captures rich program features and avoids the need for ground truth solutions as supervision. Compared to previous learning tasks in domains with graph-structured data, it addresses unique challenges, such as a binary objective function and an extremely sparse reward that is given by an automated theorem prover only after the complete loop invariant is proposed. We evaluate Code2Inv on a suite of 133 benchmark problems and compare it to three state-of-the-art systems. It solves 106 problems compared to 73 by a stochastic search-based system, 77 by a heuristic search-based system, and 100 by a decision tree learning-based system. Moreover, the strategy learned can be generalized to new programs: compared to solving new instances from scratch, the pre-trained agent is more sample efficient in finding solutions.

Application papers

Scalable End-to-End Autonomous Vehicle Testing via Rare-event Simulation

Matthew O’Kelly, Aman Sinha, Hongseok Namkoong, Russ Tedrake, John C. Duchi

Abstract While recent developments in autonomous vehicle (AV) technology highlight substantial progress, we lack tools for rigorous and scalable testing. Real-world testing, the de facto evaluation environment, places the public in danger, and, due to the rare nature of accidents, will require billions of miles in order to statistically validate performance claims. We implement a simulation framework that can test an entire modern autonomous driving system, including, in particular, systems that employ deep-learning perception and control algorithms. Using adaptive importance-sampling methods to accelerate rare-event probability evaluation, we estimate the probability of an accident under a base distribution governing standard traffic behavior. We demonstrate our framework on a highway scenario, accelerating system evaluation by 2-20 times over naive Monte Carlo sampling methods and 10-300P times (where P is the number of processors) over real-world testing.

Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation

Jiaxuan You, Bowen Liu, Zhitao Ying, Vijay Pande, Jure Leskovec

Abstract Generating novel graph structures that optimize given objectives while obeying some given underlying rules is fundamental for chemistry, biology and social science research. This is especially important in the task of molecular graph generation, whose goal is to discover novel molecules with desired properties such as drug-likeness and synthetic accessibility, while obeying physical laws such as chemical valency. However, designing models that finds molecules that optimize desired properties while incorporating highly complex and non-differentiable rules remains to be a challenging task. Here we propose Graph Convolutional Policy Network (GCPN), a general graph convolutional network based model for goal-directed graph generation through reinforcement learning. The model is trained to optimize domain-specific rewards and adversarial loss through policy gradient, and acts in an environment that incorporates domain-specific rules. Experimental results show that GCPN can achieve 61% improvement on chemical property optimization over state-of-the-art baselines while resembling known molecules, and achieve 184% improvement on the constrained property optimization task.

Constrained Graph Variational Autoencoders for Molecule Design

Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, Alexander Gaunt

Abstract Graphs are ubiquitous data structures for representing interactions between entities. With an emphasis on applications in chemistry, we explore the task of learning to generate graphs that conform to a distribution observed in training data. We propose a variational autoencoder model in which both encoder and decoder are graph-structured. Our decoder assumes a sequential ordering of graph extension steps and we discuss and analyze design choices that mitigate the potential downsides of this linearization. Experiments compare our approach with a wide range of baselines on the molecule generation task and show that our method is successful at matching the statistics of the original dataset on semantically important metrics. Furthermore, we show that by using appropriate shaping of the latent space, our model allows us to design molecules that are (locally) optimal in desired properties.

Misc

Neural Ordinary Differential Equations

Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, David K. Duvenaud

Abstract We introduce a new family of deep neural network models. Instead of specifying a discrete sequence of hidden layers, we parameterize the derivative of the hidden state using a neural network. The output of the network is computed using a blackbox differential equation solver. These continuous-depth models have constant memory cost, adapt their evaluation strategy to each input, and can explicitly trade numerical precision for speed. We demonstrate these properties in continuous-depth residual networks and continuous-time latent variable models. We also construct continuous normalizing flows, a generative model that can train by maximum likelihood, without partitioning or ordering the data dimensions. For training, we show how to scalably backpropagate through any ODE solver, without access to its internal operations. This allows end-to-end training of ODEs within larger models.

A Unified Framework for Extensive-Form Game Abstraction with Bounds

Christian Kroer, Tuomas Sandholm

Abstract Abstraction has long been a key component in the practical solving of large-scale extensive-form games. Despite this, abstraction remains poorly understood. There have been some recent theoretical results but they have been confined to specific assumptions on abstraction structure and are specific to various disjoint types of abstraction, and specific solution concepts, for example, exact Nash equilibria or strategies with bounded immediate regret. In this paper we present a unified framework for analyzing abstractions that can express all types of abstractions and solution concepts used in prior papers with performance guarantees—while maintaining comparable bounds on abstraction quality. Moreover, our framework gives an exact decomposition of abstraction error in a much broader class of games, albeit only in an ex-post sense, as our results depend on the specific strategy chosen. Nonetheless, we use this ex-post decomposition along with slightly weaker assumptions than prior work to derive generalizations of prior bounds on abstraction quality. We also show, via counterexample, that such assumptions are necessary for some games. Finally, we prove the first bounds for how ϵ-Nash equilibria computed in abstractions perform in the original game. This is important because often one cannot afford to compute an exact Nash equilibrium in the abstraction. All our results apply to general-sum n-player games.

Exponentiated Strongly Rayleigh Distributions

Zelda E. Mariet, Suvrit Sra, Stefanie Jegelka

Abstract Strongly Rayleigh (SR) measures are discrete probability distributions over the subsets of a ground set. They enjoy strong negative dependence properties, as a result of which they assign higher probability to subsets of diverse elements. We introduce in this paper Exponentiated Strongly Rayleigh (ESR) measures, which sharpen (or smoothen) the negative dependence property of SR measures via a single parameter (the exponent) that can intuitively understood as an inverse temperature. We develop efficient MCMC procedures for approximate sampling from ESRs, and obtain explicit mixing time bounds for two concrete instances: exponentiated versions of Determinantal Point Processes and Dual Volume Sampling. We illustrate some of the potential of ESRs, by applying them to a few machine learning tasks; empirical results confirm that beyond their theoretical appeal, ESR-based models hold significant promise for these tasks.

Learning to Optimize Tensor Programs

Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, Arvind Krishnamurthy

Abstract We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution are key enablers of effective deep learning systems. However, existing systems rely on manually optimized libraries such as cuDNN where only a narrow range of server class GPUs are well-supported. The reliance on hardware specific operator libraries limits the applicability of high-level graph optimizations and incurs significant engineering costs when deploying to new hardware targets. We use learning to remove this engineering burden. We learn domain specific statistical cost models to guide the search of tensor operator implementations over billions of possible program variants. We further accelerate the search by effective model transfer across workloads. Experimental results show that our framework delivers performance competitive with state-of-the-art hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPU.

Modelling sparsity, heterogeneity, reciprocity and community structure in temporal interaction data

Xenia Miscouridou, Francois Caron, Yee Whye Teh

Abstract We propose a novel class of network models for temporal dyadic interaction data. Our objective is to capture important features often observed in social interactions: sparsity, degree heterogeneity, community structure and reciprocity. We use mutually-exciting Hawkes processes to model the interactions between each (directed) pair of individuals. The intensity of each process allows interactions to arise as responses to opposite interactions (reciprocity), or due to shared interests between individuals (community structure). For sparsity and degree heterogeneity, we build the non time dependent part of the intensity function on compound random measures following Todeschini et al., 2016. We conduct experiments on real-world temporal interaction data and show that the proposed model outperforms competing approaches for link prediction, and leads to interpretable parameters.

Workshop papers:

Infer to Control workshop

VIREL: A Variational Inference Framework for Reinforcement Learning

Matthew Fellows, Anuj Mahajan, Tim G. J. Rudner, Shimon Whiteson

Abstract Applying probabilistic models to reinforcement learning (RL) has become an exciting direction of research owing to powerful optimisation tools such as variational inference becoming applicable to RL. However, due to their formulation, existing inference frameworks and their algorithms pose significant challenges for learning optimal policies, for example, the absence of mode capturing behaviour in pseudo-likelihood methods and difficulties in optimisation of learning objective in maximum entropy RL based approaches. We propose VIREL, a novel, theoretically grounded probabilistic inference framework for RL that utilises the action-value function in a parametrised form to capture future dynamics of the underlying Markov decision process. Owing to its generality, our framework lends itself to current advances in variational inference. Applying the variational expectation-maximisation algorithm to our framework, we show that the actor-critic algorithm can be reduced to expectation-maximisation. We derive a family of methods from our framework, including state-of-the-art methods based on soft value functions. We evaluate two actor-critic algorithms derived from this family, which perform on par with soft actor critic, demonstrating that our framework offers a promising perspective on RL as inference.

Decomposing the ELBO

Rob Zinkov — Fri, 02 Nov 2018 00:00:00 UT

Decomposing the ELBO

Rob Zinkov

2018-11-02

When performing Variational Inference, we are minimizing the KL divergence between some distribution we care about \(p(\v{z} \mid \v{x})\) and some distribution that is easier to work with \(q_\phi(\v{z} \mid \v{x})\).

\[ \begin{align} \phi^* &= \underset{\phi}{\mathrm{argmin}}\, \text{KL}(q_\phi(\v{z} \mid \v{x}) \;\|\; p(\v{z} \mid \v{x})) \\ &= \underset{\phi}{\mathrm{argmin}}\, \mathbb{E}_{q_\phi(\v{z} \mid \v{x})} \big[\log q_\phi(\v{z} \mid \v{x}) - \log p(\v{z} \mid \v{x}) \big]\\ \end{align} \]

Now because the density of \(p(\mathbf{z} \mid \mathbf{x})\) usually isn’t tractable, we use a property of the log model evidence \(\log\, p(\v{x})\) to define a different objective to optimize.

\[ \begin{align} \Expect_{q_\phi(\v{z} \mid \v{x})} \big[\log q_\phi(\v{z} \mid \v{x}) - \log p(\v{z} \mid \v{x})\big] &\leq \Expect_{q_\phi(\v{z} \mid \v{x})} \big[\log q_\phi(\v{z} \mid \v{x}) - \log p(\v{z} \mid \v{x})\big] - \log p(\v{x}) \\ &= \Expect _{q_\phi(\v{z} \mid \v{x})} \big[\log q_\phi(\v{z} \mid \v{x}) - \log p(\v{z} \mid \v{x}) - \log p(\v{x})\big] \\ &= \Expect _{q_\phi(\v{z} \mid \v{x})} \big[\log q_\phi(\v{z} \mid \v{x}) - \log p(\v{x}, \v{z})\big]\\ &= -\mathcal{L}(\phi) \end{align} \]

As \(\mathcal{L}(\phi) = \log p(\v{x}) - \text{KL}(q_\phi(\v{z} \mid \v{x}) \;\|\; p(\v{z} \mid \v{x}))\) maximizing \(\mathcal{L}(\phi)\) effectively minimizes our original KL.

This term \(\mathcal{L}(\phi)\) is sometimes called the evidence lower-bound or ELBO, because the KL term must always be greater-than or equal to zero, \(\mathcal{L}(\phi)\) can be seen as a lower-bound estimate of \(\log p(\v{x})\).

Due to various linearity properties of expectations, this can be rearranged into many different forms. This is useful to get an intuition for what can be going wrong when you learn \(q_\phi(\v{z} \mid \v{x})\)

Now why does this matter? Couldn’t I just optimize this loss with SGD and be done? Well you can, but often if something is going wrong it will show up as one or some terms being unusually off. By making these tradeoffs in the loss function explicit means you can adjust it to favor different properties of your learned representation. Either by hand or automatically.

Entropy form

The classic form is in terms of an energy term and an entropy term. The first term encourages \(q\) to put high probability mass wherever \(p\) does so. The second term is encouraging that \(q\) should as much as possible maximize it’s entropy and put probability mass everywhere it can.

\[ \mathcal{L}(\phi) = \Expect_{q_\phi(\v{z} \mid \v{x})}[\log p(x, z)] + H(q_\phi(\v{z} \mid \v{x})) \]

where

\[ H(q_\phi(\v{z} \mid \v{x})) \triangleq - \Expect_{q_\phi(\v{z} \mid \v{x})}[\log q_\phi(\v{z} \mid \v{x})] \]

Reconstruction error minus KL on the prior

More often these days, we describe the \(\mathcal{L}\) in terms of a reconstruction term and KL on the prior for \(p\). Here the first term is saying we should put mass on latent codes \(\v{z}\) from which \(p\) is likely to generate our observation \(\v{x}\). The second term then suggests to this trade off with \(q\) also being near the prior.

\[ \mathcal{L}(\phi) = \Expect_{q_\phi(\v{z} \mid \v{x})}[\log p(\v{x} \mid \v{z})] - \text{KL}(q_\phi(\v{z} \mid \v{x}) \;\|\; p(\v{z}))\]

ELBO surgery

But there are other ways to think about this decomposition. Because we frequently use amortized inference to learn a \(\phi\) useful for describing all kinds of \(q\) distributions regardless of our choice of observation \(\v{x}\). We can talk about the average distribution we learn over our observed data, with \(p_d\) being the empirical distribution of our observations.

\[ \overline{q}_\phi(\v{z}) = \Expect_{p_d(\v{x})} \big[ q_\phi(\v{z} \mid \v{x}) \big] \]

This is sometimes called the aggregate posterior.

With that we can decompose our KL on the prior into a mutual information term that encourages each \(q_\phi(\v{z} \mid \v{x})\) we create to be near the average one \(\overline{q}_\phi(\v{z})\) and a KL between this average distribution and the prior. The encourages the representation generated for \(\v{z}\) to be useful.

\[ w\mathcal{L}(\phi) = \Expect_{q_\phi(\v{z} \mid \v{x})}[\log p(\v{x} \mid \v{z})] - \mathbb{I}_q(\v{x},\v{z}) - \text{KL}(\overline{q}_\phi(\v{z}) \;\|\; p(\v{z})) \]

where

\[ \mathbb{I}_q(\v{x},\v{z}) \triangleq \Expect_{p_d}\big[\Expect_{q_\phi(\v{z} \mid \v{x})} \big[\log q_\phi(\v{z} \mid \v{x})\big] \big] - \Expect_{\overline{q}_\phi(\v{z})} \log \overline{q}_\phi(\v{z}) \]

Difference of two KL divergences

With something like \(p_d\) around it is also possible to pull out the relationship between \(p\) and \(p_d\). This is particularly relevant if you intend to learn \(p\).

\[ \mathcal{L}(\phi) = - \text{KL}(q_\phi(\v{z} \mid \v{x}) \;\|\; p(\v{z} \mid \v{x})) - \text{KL}(p_d(\v{x}) \;\|\; p(\v{x})) \]

Full decomposition

Of course with more aggressive rearranging, we can just have a term to encourage learning better latent representations. In a setting where you aren’t learning \(p\) some of these terms are constant and can generally be ignored. I provide them here for completeness.

\[ \mathcal{L}(\phi) = \Expect_{q_\phi(\v{z} \mid \v{x})}\left[ \log\frac{p(\v{x} \mid \v{z})}{p(\v{x})} - \log\frac{q_\phi(\v{z} \mid \v{x})}{q_\phi(\v{z})} \right] - \text{KL}(p_d(\v{x}) \;\|\; p(\v{x})) - \text{KL}(\overline{q}_\phi(\v{z}) \;\|\; p(\v{z}))\]

I highly encourage checking out the Appendix of the Structured Disentangled Representations paper to see how much further this can be pushed.

Final notes

Of course, all the above still holds in the VAE setting where \(p\) becomes \(p_\theta\) but I felt the notation was cluttered enough already. It’s kind of amazing how much insight can be gained through expanding and collapsing one loss function.

Further references

Calculating the golden-era of The Simpsons

Rob Zinkov — Fri, 03 Nov 2017 00:00:00 UT

Calculating the golden-era of The Simpsons

Rob Zinkov

2017-11-03

Nathan Cunningham published last week a fantastic article about using some stats to estimate at what episode did Simpsons start to decline.

Cameron Davidson-Pilon suggested this would make a great application of Bayesian changepoint models.

Someone want to Bayesian switchpoint model this? See first chapter of BMH https://t.co/QiGwzA0khD
— Cam DP 👨🏽‍💻 (@Cmrn_DP) 28 October 2017

In turns out, he was totally right. Taking the Coal-mining disaster example from the pymc3 quickstart guide and slightly modifying it is enough to do the job.

First we load the data

data = pd.read_csv("simpsons_ratings.csv")
index = data.index

Then we use some Gaussians to describe the average rating, and how that mean rate translates to the quality of any particular episode.

with pm.Model() as model:
    switch = pm.DiscreteUniform('switch', lower=index.min(), upper=index.max())
    early_mean = pm.Normal('early_mean', mu=5., sd=1.)
    late_mean = pm.Normal('late_mean', mu=5., sd=1.)
    mean = tt.switch(switch >= index.values, early_mean, late_mean)
    ratings = pm.Normal('ratings', mu=mean, sd=1.,
                        observed=data["UserRating"].values)

    tr = pm.sample(10000, tune=500)
    pm.traceplot(tr)

As we can see around 220 is when our model thinks the Simpsons was starting to downward slide.

That would be

print("{}: {}".format(data["EpisodeID"][220], data["Episode"][220]))
# >>> S10E18: Simpsons Bible Stories

An episode I remember being alright. Generally the 10th season is acknowledged as the last of the golden years. In fact, Chicago Simpsons Trivia Night bills itself as not asking any questions from seasons after 10.

Apologies in advance for not using more Simpsons jokes in this post. You can find the code and data I used on github.