Estimation Theory

Marlin, Benjamin M., and Nando de Freitas. "Asymptotic Efficiency of Deterministic Estimators for Discrete Energy-Based Models: Ratio Matching and Pseudolikelihood." UAI. 2011. 497-505. Abstractestimators_uai11_paper.pdf

Standard maximum likelihood estimation cannot be applied to discrete energy-based models in the general case because the computation of exact model probabilities is intractable. Recent research has seen the proposal of several new estimators designed specifically to overcome this intractability, but virtually nothing is known about their theoretical properties. In this paper, we present a generalized estimator that unifies many of the classical and recently proposed estimators. We use results from the standard asymptotic theory for M-estimators to derive a generic expression for the asymptotic covariance matrix of our generalized estimator. We apply these results to study the relative statistical efficiency of classical pseudolikelihood and the recently-proposed ratio matching estimator.

Swersky, Kevin, Marc'Aurelio Ranzato, David Buchman, Benjamin M. Marlin, and Nando de Freitas. "On Autoencoders and Score Matching for Energy Based Models." ICML. 2011. 1201-1208. Abstractautoencoders_icml11_paper.pdf

We consider estimation methods for the class of continuous-data energy based models (EBMs). Our main result shows that estimating the parameters of an EBM using score matching when the conditional distribution over the visible units is Gaussian corresponds to training a particular form of regularized autoencoder. We show how different Gaussian EBMs lead to different autoencoder architectures, providing deep links between these two families of models. We compare the score matching estimator for the mPoT model, a particular Gaussian EBM, to several other training methods on a variety of tasks including image denoising and unsupervised feature extraction. We show that the regularization function induced by score matching leads to superior classification performance relative to a standard autoencoder. We also show that score matching yields classification results that are indistinguishable from better-known stochastic approximation maximum likelihood estimators.

Marlin, Benjamin M., Richard S. Zemel, Sam T. Roweis, and Malcolm Slaney. "Recommender Systems, Missing Data and Statistical Model Estimation." IJCAI. 2011. 2686-2691. Abstractmissing_data_ijcai11_paper.pdf

The goal of rating-based recommender systems is to make personalized predictions and recommendations for individual users by leveraging the preferences of a community of users with respect to a collection of items like songs or movies. Recommender systems are often based on intricate statistical models that are estimated from data sets containing a very high proportion of missing ratings. This work describes evidence of a basic incompatibility between the properties of recommender system data sets and the assumptions required for valid estimation and evaluation of statistical models in the presence of missing data. We discuss the implications of this problem and describe extended modelling and evaluation frameworks that attempt to circumvent it. We present prediction and ranking results showing that models developed and tested under these extended frameworks can significantly outperform standard models.

Swersky, Kevin, Bo Chen, Benjamin M. Marlin, and Nando de Freitas. "A tutorial on stochastic approximation algorithms for training Restricted Boltzmann Machines and Deep Belief Nets." ITA. 2010. 80-89. Abstract

In this study, we provide a direct comparison of the Stochastic Maximum Likelihood algorithm and Contrastive Divergence for training Restricted Boltzmann Machines using the MNIST data set. We demonstrate that Stochastic Maximum Likelihood is superior when using the Restricted Boltzmann Machine as a classifier, and that the algorithm can be greatly improved using the technique of iterate averaging from the field of stochastic approximation. We further show that training with optimal parameters for classification does not necessarily lead to optimal results when Restricted Boltzmann Machines are stacked to form a Deep Belief Network. In our experiments we observe that fine tuning a Deep Belief Network significantly changes the distribution of the latent data, even though the parameter changes are negligible.

Marlin, Benjamin M., Kevin Swersky, Bo Chen, and Nando de Freitas. "Inductive Principles for Restricted Boltzmann Machine Learning." AISTATS. 2010. 509-516. Abstract

Recent research has seen the proposal of several new inductive principles designed specifically to avoid the problems associated with maximum likelihood learning in models with intractable partition functions. In this paper, we study learning methods for binary restricted Boltzmann machines (RBMs) based on ratio matching and generalized score matching. We compare these new RBM learning methods to a range of existing learning methods including stochastic maximum likelihood, contrastive divergence, and pseudo-likelihood. We perform an extensive empirical evaluation across multiple tasks and data sets.

Marlin, Benjamin M., Richard S. Zemel, Sam T. Roweis, and Malcolm Slaney. "Collaborative Filtering and the Missing at Random Assumption." UAI. 2007. 267-275. Abstract

Rating prediction is an important application, and a popular research topic in collaborative filtering. However, both the validity of learning algorithms, and the validity of standard testing procedures rest on the assumption that missing ratings are missing at random (MAR). In this paper we present the results of a user study in which we collect a random sample of ratings from current users of an online radio service. An analysis of the rating data collected in the study shows that the sample of random ratings has markedly different properties than ratings of user-selected songs. When asked to report on their own rating behaviour, a large number of users indicate they believe their opinion of a song does affect whether they choose to rate that song, a violation of the MAR condition. Finally, we present experimental results showing that incorporating an explicit model of the missing data mechanism can lead to significant improvements in prediction performance on the random sample of ratings.