<!DOCTYPE html>
	<html lang="en">
	<head>
	    <meta charset="utf-8">
	    <style type="text/css">
	@import url(//fonts.googleapis.com/css?family=Vollkorn:400,400italic,700,700italic&subset=latin);

html, body {
        padding:1em;
        margin:auto;
        max-width:42em;
        background:#fefefe;
  }
body {
  font: 1.3em "Vollkorn", Palatino, Times;
  color: #333;
  line-height: 1.4;
  text-align: justify;
  }
header, nav, article, footer {
  width: 700px;
  margin:0 auto;
  }
article {
  margin-top: 4em;
  margin-bottom: 4em;
  min-height: 400px;
  }
footer {
  margin-bottom:50px;
  }
video {
  margin: 2em 0;
  border:1px solid #ddd;
  }

nav {
  font-size: .9em;
  font-style: italic;
  border-bottom: 1px solid #ddd;
  padding: 1em 0;
  }
nav p {
  margin: 0;
  }

/* Typography
-------------------------------------------------------- */

h1 {
  margin-top: 0;
  font-weight: normal;
  }
h2 {
  font-weight: normal;
  }
h3 {
  font-weight: normal;
  font-style: italic;
  margin-top:3em;
  }
p {
  margin-top:0;
  -webkit-hypens:auto;
  -moz-hypens:auto;
  hyphens:auto;
  }
ul {
  list-style: square;
  padding-left:1.2em;
  }
ol {
  padding-left:1.2em;
  }
blockquote {
  margin-left: 1em;
  padding-left: 1em;
  border-left: 1px solid #ddd;
  }
code {
  font-family: "Consolas", "Menlo", "Monaco", monospace, serif;
  font-size: .9em;
  background: white;
  }
a {
  color: #2484c1;
  text-decoration: none;
  }
a:hover {
  text-decoration: underline;
  }
a img {
  border:none;
  outline:none;
  }
h1 a, h1 a:hover {
  color: #333;
  text-decoration: none;
  }
hr {
  color : #ddd;
  height : 1px;
  margin: 2em 0;
  border-top : solid 1px #ddd;
  border-bottom : none;
  border-left: 0;
  border-right: 0;
  }
p#heart{
  font-size: 2em;
  line-height: 1;
  text-align: center;
  color: #ccc;
  }
.red {
  color:#B50000;
  }

/* Home Page
--------------------------- */

body#index li {
  margin-bottom: 1em;
  }


/* iPad
-------------------------------------------------------- */
@media only screen and (max-device-width: 1024px) {
body {
  font-size: 120%;
  line-height: 1.4;
  }
} /* @iPad */

/* iPhone
-------------------------------------------------------- */
@media only screen and (max-device-width: 480px) {
body {
  text-align: left;
  }
article, footer {
  width: auto;
  }
article {
  padding: 0 10px;
  }
} /* @iPhone */

	    </style>
	</head>
	<body>
	<p>A set of some ~30 interesting NIPS 2013 papers read by Markus Heinonen</p>

<p>All papers are <a href="http://papers.nips.cc/book/advances-in-neural-information-processing-systems-26-2013">available</a> (also for <a href="http://cs.stanford.edu/people/karpathy/nips2013/">visualized</a> )</p>

<p>The best papers of these are:</p>

<ul>
<li><em>Lopez-Paz, Hennig &amp; Scholkopf</em>: <strong>The Randomized Dependence Coefficient</strong></li>
<li><em>Zhang, Lee &amp; Teh</em>: <strong>Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space</strong></li>
<li><em>Paskov, West, Mitchell &amp; Hastie</em>: <strong>Compressive Feature Learning</strong></li>
</ul>

<hr />

<h2>Randomization &amp; Feature selection</h2>

<ul>
<li><p>Ungar: <strong>New Subsampling Algorithms for Fast Least Squares Regression</strong></p>

<ul>
<li>subsample data when n &gt;&gt; p, i.e. lots of data available</li>
<li>statistical notation (OLS)</li>
<li>a simple procedure of choosing a subsample and using remaining data to fix bias
<ol>
<li>subsample data and learn regression  </li>
<li>use non-sampled data as test set,</li>
<li>learn a second model that matches the test set residuals, and</li>
<li>combine these two models to fix bias</li>
</ol></li>
<li>not very impressive or novel</li>
</ul></li>
<li><p>Ungar:  <strong>Faster Ridge Regression via the Subsampled Randomized Hadamard Transform</strong></p>

<ul>
<li>subsample features when p &gt;&gt; n, analog to previous papers but with ridge-reg</li>
<li>apply Hadamard transform on sampled columns of X</li>
<li>not really novel, Hadamard/Fourier transforms have been known for long time</li>
</ul></li>
<li><p>Buhman: <strong>Correlated random features for fast semi-supervised learning</strong></p>

<ul>
<li>correlated nystrom views (XNV) algorithm, semisupervised, three steps:
<ol>
<li>construct two random projections (views) of the data 
<ul>
<li>nystrom method or fourier transform, nystrom found better </li>
<li>assume there actually are two different views of the data, hence both views contain good predictors</li>
</ul></li>
<li>CCA over the two views to extract correlated features (use all unlabeled data)</li>
<li>CCA regression: penalize over found CCA coefficients, 
<ul>
<li>i.e. use mostly features that are correlated on both views</li>
</ul></li>
</ol></li>
<li>nystrom method tries to approximate kernel matrix by low-dimensional random feature maps</li>
<li>fast, performance without unlabeled data equivalent to others, with unlabeled data nicely better</li>
</ul></li>
<li><p>Geurts: <strong>Understanding variable importances in forests of randomized trees</strong></p>

<ul>
<li>discusses mean impurity decrease (MDI) statistic for feature relevence</li>
<li>formulates the MDI for an ensemble of totally random trees (branch on random features, not learning much)
<ul>
<li>the MDI decomposes nicely into components measuring relevance of individual features,
combinations of features and feature dependencies across tress </li>
</ul></li>
<li>second part talks about realistic random forests, but does not provide any results for these
<ul>
<li>i.e. MDI statistic does not behave nicely with realistic trees</li>
<li>i.e. zero MDI does not mean irrelevant feature, etc.</li>
</ul></li>
</ul></li>
<li><p>Jaakkola: <strong>Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions</strong></p>

<ul>
<li>random MAP predictors for multi-task</li>
</ul></li>
<li><p>Rosasco: <strong>On the Sample Complexity of Subspace Learning</strong></p></li>
<li><p>Precup: <strong>Bellman Error Based Feature Generation using Random Projections on Sparse Spaces</strong></p></li>
</ul>

<hr />

<h2>Feature selection</h2>

<ul>
<li><p>Hernandez-Lobato: <strong>Learning Feature Selection Dependencies in Multi-task Learning</strong></p>

<ul>
<li>feature selection using the <em>horseshoe prior</em> in n &lt; p case
<ul>
<li>infinite spike at zero, long tails to allow large values</li>
<li>somewhere between gaussian and laplacian, i.e. L1 and L2</li>
</ul></li>
<li>they add correlations between features to the horseshoe prior through latent variables</li>
<li>the correlation structure is constrained as PP^T where P is of smaller dimension than the full correlation structure
<ul>
<li>learned from data</li>
</ul></li>
<li>approximate inference using gradients of log-likelihood against the P, use expectation propagation</li>
<li>extension to multitask case where correlation matrix is shared among tasks,
<ul>
<li>i.e. more evidence for correlations</li>
</ul></li>
<li>good results</li>
</ul></li>
<li><p>Hastie: <strong>Compressive Feature Learning</strong></p>

<ul>
<li>feature learning as a compression problem (as in "gzip")</li>
<li>unsupervised feature learning of text data using MDL principle and dictionary compression (Lempel-Ziv)
<ul>
<li>choose set of k-grams that minimize the dictionary compression cost = size (lossless compression)</li>
<li>this cost balances between having a small dictionary (just 1-grams) and amount of decompression pointers you need
the compressed size equals to dictionary+pointers </li>
</ul></li>
<li>binary programming problem where a subset of features are chosen into the dictionary
to minimize the compressed size, this subset is the learned feature set</li>
<li>a non-convex problem is approximated by a series of relaxations into linear problems</li>
<li>nice results</li>
</ul></li>
</ul>

<hr />

<h2>Kernel theory</h2>

<ul>
<li><p>Zhang: <strong>Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space</strong></p>

<ul>
<li>amazing paper</li>
<li>the paper discussed invariance in learning: i.e. some local transformations should have no effect on model
<ul>
<li>e.g. digit detector should be invariant to rotations, scalings, translations, etc. </li>
<li>one way to handle this is to constrain the gradient of the learned function to be small along 
invariant transformations (or ideally zero)</li>
<li>in another approach the function f(x) should be invariant to a transformation T(x,\theta),
i.e. f(T(x,\theta1)) = f(T(x,\theta2))</li>
</ul></li>
<li>they propose invariances as local functionals <strong>in the given RKHS</strong> (translations, averages, gradients can all be done)
so that gradients of f along invariances are penalized</li>
<li>I.e. min L(y,f(x)) + ||f|| + Li(f), where Li is the invariance functional with kernel forms
through the representer theorem</li>
<li>Several functionals are shown to be possible
<ul>
<li>semisupervised smoothness priors</li>
<li>transformations</li>
<li>averaging</li>
</ul></li>
<li>really cool results</li>
</ul></li>
<li><p>Mohri: <strong>Learning Kernels Using Local Rademacher Complexity</strong></p>

<ul>
<li>in Rademacher methods usually the MKL kernel traces are controlled</li>
<li>paper presents a <em>local</em> Rademacher complexity, which is the global Rademacher when 
function variance is constrained, this can then lead tighter bounds and faster convergence </li>
<li>the local rademacher complexity is then the tail sum of kernel eigenvalue decomposition eigenvalues</li>
<li>in MKL the rademacher complexity is often constrained by the trace of kernel matrix
<ul>
<li>however, since local RC gives tighter generalization bounds, we use the eigenvalue tail sums as constraints instead</li>
<li>i.e. they only use kernels that have low eigenvalue tail sum (after cut off \theta)</li>
<li>this favours heavily non-noisy kernels in the MKL</li>
</ul></li>
<li>very nice results, the new complexity helps a lot</li>
</ul></li>
<li><p>Borgwardt: <strong>Rapid Distance-Based Outlier Detection via Sampling</strong></p>

<ul>
<li>an overview of outlier detection methods (especially using distance-based methods: outlier is too far from others)</li>
<li>a robust method computes for each point, its distances to all other points, and forms an outlier score (n^2)
<ul>
<li>several complex methods have arisen to compute the distances efficiently</li>
</ul></li>
<li>in this paper they show that computing outliers on a one-pass subsample is good enough</li>
<li>the results are baffling, on a dataset with over 5 million points the best subsample size is 20!
<ul>
<li>i.e. all distances of 5 million points are computed against only those 20</li>
<li>they don't give much analysis of the results..</li>
</ul></li>
</ul></li>
</ul>

<hr />

<h2>Gaussian processes</h2>

<ul>
<li>Hertzmann: <strong>Efficient Optimization for Sparse Gaussian Process Regression</strong></li>
<li>Stegle:    <strong>It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals</strong>
<ul>
<li>interesting <strong>GP-kronsum</strong> model adds a task-covariance term so that similar targets get similar predictions</li>
<li><em>need to reread...</em></li>
</ul></li>
</ul>

<hr />

<h2>Statistical learning papers</h2>

<ul>
<li><p>Fleuret: <strong>Reservoir Boosting : Between Online and Offline Ensemble Learning</strong></p>

<ul>
<li>new online ensemble framework where data is inspected one at a time, but some data are
stored in a "reservoir" for fast access</li>
<li>i.e. each round they have reservoir + fresh batch of points, of which some are retained
as the new reservoir, and they compute next boosted weak learner on the reservoir </li>
<li>reservoir is populated with points with high residuals, but also with some low-residual points 
to represent a non-biased version of the data
<ul>
<li>good results with sampling points over the residual distribution</li>
<li>alternative GEEM algorithm, which tries to keep reservoir as representation of full dataset,
i.e. the weak learners on reservoir and full data should be close </li>
</ul></li>
<li>they show good'ish results, but compare against online learners that process data one by one
<ul>
<li>online learners have big disadvantage here</li>
<li>no comparison to offline boosting (!)</li>
</ul></li>
</ul></li>
<li><p>Hopcroft: <strong>Sign Cauchy Projections and Chi-Square Kernel</strong></p>

<ul>
<li>study stable random projections </li>
<li>???</li>
</ul></li>
<li><p>Jordan: <strong>Estimation, Optimization, and Parallelism when Data is Sparse</strong></p>

<ul>
<li>general analysis of learning when data is sparse</li>
<li>they show upper and lower bounds for minimax learning on sparse data</li>
<li>this allows them to do parallelisation</li>
</ul></li>
<li><p>Valiant: <strong>Estimating the Unseen: Improved Estimators for Entropy and other Properties</strong></p></li>
<li><p>Scholkopf: <strong>Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators</strong></p></li>
<li><p>Garg: <strong>Adaptivity to Local Smoothness and Dimension in Kernel Regression</strong></p></li>
<li><p>Shalev-Shwartz: <strong>More data speeds up training time in learning halfspaces over sparse vectors</strong></p>

<ul>
<li>if you have so much data that adding more does not improve learning,
can you turn extra data into improved speed?</li>
<li>learning halfspaces over k-sparse vectors of length n
(think of google queries, millions (=n) of potential query words, usually just 1, 2 or 3 present)</li>
<li>with n^1.0 examples, runtime exp2(n)</li>
<li>with n^1.5 examples, runtime poly(n)</li>
<li>with n^2.0 examples, runtime lin(x)</li>
<li>reduction from 3SAT problem</li>
</ul></li>
<li><p>Scholkopf: <strong>The Randomized Dependence Coefficient</strong></p>

<ul>
<li>really interesting paper</li>
<li>presents a new kind of non-linear correlation measure, which is fast to compute (5 lines of code!), 
while previous measures (ACE, HSIC, KCCA, MIC) are heavy to compute</li>
<li>based on the abstract HGR correlation coefficient by Renyi</li>
<li>three steps to compute:
<ol>
<li>empirical copula transformation to make sample invariant to linear operations (i.e. map into cdf's)</li>
<li>projection into k random non-linear maps (e.g. k sinusoidal maps with parameters from a gaussian [prior])</li>
<li>computation of largest canonical correlation coefficient (standard)</li>
</ol></li>
<li>n*logn, difference to HGR bounded</li>
<li>excellent results in feature selection and resistance to noise</li>
</ul></li>
</ul>

<hr />

<h2>Misc  </h2>

<ul>
<li><p>Blaschko: <strong>B-tests: Low Variance Kernel Two-Sample Tests</strong></p></li>
<li><p>Adams: <strong>Contrastive Learning Using Spectral Methods</strong></p></li>
<li><p>Adams: <strong>Message Passing Inference with Chemical Reaction Networks</strong></p></li>
<li><p>Kim: <strong>A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variables</strong></p></li>
<li><p>Borgwardt: <strong>Scalable kernels for graphs with continuous attributes</strong></p>

<ul>
<li>more efficient shortest-paths graph kernel </li>
<li>completely pointless, SP kernel is already fastest graph kernel, and does not perform well</li>
</ul></li>
<li><p>Wipf: <strong>Non-Uniform Camera Shake Removal Using a Spatially-Adaptive Sparse Penalty</strong></p></li>
</ul>
</body>
	</html>
	