A set of some ~30 interesting NIPS 2013 papers read by Markus Heinonen

All papers are available (also for visualized )

The best papers of these are:

Lopez-Paz, Hennig & Scholkopf: The Randomized Dependence Coefficient
Zhang, Lee & Teh: Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space
Paskov, West, Mitchell & Hastie: Compressive Feature Learning

Randomization & Feature selection

Ungar: New Subsampling Algorithms for Fast Least Squares Regression
- subsample data when n >> p, i.e. lots of data available
- statistical notation (OLS)
- a simple procedure of choosing a subsample and using remaining data to fix bias
  1. subsample data and learn regression
  2. use non-sampled data as test set,
  3. learn a second model that matches the test set residuals, and
  4. combine these two models to fix bias
- not very impressive or novel
Ungar: Faster Ridge Regression via the Subsampled Randomized Hadamard Transform
- subsample features when p >> n, analog to previous papers but with ridge-reg
- apply Hadamard transform on sampled columns of X
- not really novel, Hadamard/Fourier transforms have been known for long time
Buhman: Correlated random features for fast semi-supervised learning
- correlated nystrom views (XNV) algorithm, semisupervised, three steps:
  1. construct two random projections (views) of the data
    - nystrom method or fourier transform, nystrom found better
    - assume there actually are two different views of the data, hence both views contain good predictors
  2. CCA over the two views to extract correlated features (use all unlabeled data)
  3. CCA regression: penalize over found CCA coefficients,
    - i.e. use mostly features that are correlated on both views
- nystrom method tries to approximate kernel matrix by low-dimensional random feature maps
- fast, performance without unlabeled data equivalent to others, with unlabeled data nicely better
Geurts: Understanding variable importances in forests of randomized trees
- discusses mean impurity decrease (MDI) statistic for feature relevence
- formulates the MDI for an ensemble of totally random trees (branch on random features, not learning much)
  - the MDI decomposes nicely into components measuring relevance of individual features, combinations of features and feature dependencies across tress
- second part talks about realistic random forests, but does not provide any results for these
  - i.e. MDI statistic does not behave nicely with realistic trees
  - i.e. zero MDI does not mean irrelevant feature, etc.
Jaakkola: Learning Efficient Random Maximum A-Posteriori Predictors with Non-Decomposable Loss Functions
- random MAP predictors for multi-task
Rosasco: On the Sample Complexity of Subspace Learning
Precup: Bellman Error Based Feature Generation using Random Projections on Sparse Spaces

Feature selection

Hernandez-Lobato: Learning Feature Selection Dependencies in Multi-task Learning
- feature selection using the horseshoe prior in n < p case
  - infinite spike at zero, long tails to allow large values
  - somewhere between gaussian and laplacian, i.e. L1 and L2
- they add correlations between features to the horseshoe prior through latent variables
- the correlation structure is constrained as PP^T where P is of smaller dimension than the full correlation structure
  - learned from data
- approximate inference using gradients of log-likelihood against the P, use expectation propagation
- extension to multitask case where correlation matrix is shared among tasks,
  - i.e. more evidence for correlations
- good results
Hastie: Compressive Feature Learning
- feature learning as a compression problem (as in "gzip")
- unsupervised feature learning of text data using MDL principle and dictionary compression (Lempel-Ziv)
  - choose set of k-grams that minimize the dictionary compression cost = size (lossless compression)
  - this cost balances between having a small dictionary (just 1-grams) and amount of decompression pointers you need the compressed size equals to dictionary+pointers
- binary programming problem where a subset of features are chosen into the dictionary to minimize the compressed size, this subset is the learned feature set
- a non-convex problem is approximated by a series of relaxations into linear problems
- nice results

Kernel theory

Zhang: Learning with Invariance via Linear Functionals on Reproducing Kernel Hilbert Space
- amazing paper
- the paper discussed invariance in learning: i.e. some local transformations should have no effect on model
  - e.g. digit detector should be invariant to rotations, scalings, translations, etc.
  - one way to handle this is to constrain the gradient of the learned function to be small along invariant transformations (or ideally zero)
  - in another approach the function f(x) should be invariant to a transformation T(x,\theta), i.e. f(T(x,\theta1)) = f(T(x,\theta2))
- they propose invariances as local functionals in the given RKHS (translations, averages, gradients can all be done) so that gradients of f along invariances are penalized
- I.e. min L(y,f(x)) + ||f|| + Li(f), where Li is the invariance functional with kernel forms through the representer theorem
- Several functionals are shown to be possible
  - semisupervised smoothness priors
  - transformations
  - averaging
- really cool results
Mohri: Learning Kernels Using Local Rademacher Complexity
- in Rademacher methods usually the MKL kernel traces are controlled
- paper presents a local Rademacher complexity, which is the global Rademacher when function variance is constrained, this can then lead tighter bounds and faster convergence
- the local rademacher complexity is then the tail sum of kernel eigenvalue decomposition eigenvalues
- in MKL the rademacher complexity is often constrained by the trace of kernel matrix
  - however, since local RC gives tighter generalization bounds, we use the eigenvalue tail sums as constraints instead
  - i.e. they only use kernels that have low eigenvalue tail sum (after cut off \theta)
  - this favours heavily non-noisy kernels in the MKL
- very nice results, the new complexity helps a lot
Borgwardt: Rapid Distance-Based Outlier Detection via Sampling
- an overview of outlier detection methods (especially using distance-based methods: outlier is too far from others)
- a robust method computes for each point, its distances to all other points, and forms an outlier score (n^2)
  - several complex methods have arisen to compute the distances efficiently
- in this paper they show that computing outliers on a one-pass subsample is good enough
- the results are baffling, on a dataset with over 5 million points the best subsample size is 20!
  - i.e. all distances of 5 million points are computed against only those 20
  - they don't give much analysis of the results..

Gaussian processes

Hertzmann: Efficient Optimization for Sparse Gaussian Process Regression
Stegle: It is all in the noise: Efficient multi-task Gaussian process inference with structured residuals
- interesting GP-kronsum model adds a task-covariance term so that similar targets get similar predictions
- need to reread...

Statistical learning papers

Fleuret: Reservoir Boosting : Between Online and Offline Ensemble Learning
- new online ensemble framework where data is inspected one at a time, but some data are stored in a "reservoir" for fast access
- i.e. each round they have reservoir + fresh batch of points, of which some are retained as the new reservoir, and they compute next boosted weak learner on the reservoir
- reservoir is populated with points with high residuals, but also with some low-residual points to represent a non-biased version of the data
  - good results with sampling points over the residual distribution
  - alternative GEEM algorithm, which tries to keep reservoir as representation of full dataset, i.e. the weak learners on reservoir and full data should be close
- they show good'ish results, but compare against online learners that process data one by one
  - online learners have big disadvantage here
  - no comparison to offline boosting (!)
Hopcroft: Sign Cauchy Projections and Chi-Square Kernel
- study stable random projections
- ???
Jordan: Estimation, Optimization, and Parallelism when Data is Sparse
- general analysis of learning when data is sparse
- they show upper and lower bounds for minimax learning on sparse data
- this allows them to do parallelisation
Valiant: Estimating the Unseen: Improved Estimators for Entropy and other Properties
Scholkopf: Statistical analysis of coupled time series with Kernel Cross-Spectral Density operators
Garg: Adaptivity to Local Smoothness and Dimension in Kernel Regression
Shalev-Shwartz: More data speeds up training time in learning halfspaces over sparse vectors
- if you have so much data that adding more does not improve learning, can you turn extra data into improved speed?
- learning halfspaces over k-sparse vectors of length n (think of google queries, millions (=n) of potential query words, usually just 1, 2 or 3 present)
- with n^1.0 examples, runtime exp2(n)
- with n^1.5 examples, runtime poly(n)
- with n^2.0 examples, runtime lin(x)
- reduction from 3SAT problem
Scholkopf: The Randomized Dependence Coefficient
- really interesting paper
- presents a new kind of non-linear correlation measure, which is fast to compute (5 lines of code!), while previous measures (ACE, HSIC, KCCA, MIC) are heavy to compute
- based on the abstract HGR correlation coefficient by Renyi
- three steps to compute:
  1. empirical copula transformation to make sample invariant to linear operations (i.e. map into cdf's)
  2. projection into k random non-linear maps (e.g. k sinusoidal maps with parameters from a gaussian [prior])
  3. computation of largest canonical correlation coefficient (standard)
- n*logn, difference to HGR bounded
- excellent results in feature selection and resistance to noise

Misc

Blaschko: B-tests: Low Variance Kernel Two-Sample Tests
Adams: Contrastive Learning Using Spectral Methods
Adams: Message Passing Inference with Chemical Reaction Networks
Kim: A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variables
Borgwardt: Scalable kernels for graphs with continuous attributes
- more efficient shortest-paths graph kernel
- completely pointless, SP kernel is already fastest graph kernel, and does not perform well
Wipf: Non-Uniform Camera Shake Removal Using a Spatially-Adaptive Sparse Penalty