ICML: “Test of time” paper shows how times have changed

July 27, 2024

0 Views 0

SaveSavedRemoved 0

Amazon researchers have nine new papers at this year’s International Conference on Machine Learning (ICML), one of the top conferences in AI. Matthias Seeger, a principal applied scientist with Amazon Web Services (AWS), is a coauthor on one of them, which reports work led by AWS applied scientist Cuong Nguyen.

But it’s a paper that Seeger cowrote ten years ago that’s one of the conference highlights. On July 1, the ICML awards committee announced that Seeger and his colleagues’ 2010 paper “Gaussian process optimization in the bandit setting: no regret and experimental design” had won the conference’s Test of Time Award, which honors “a paper from ICML ten years ago that has had substantial impact on the field of machine learning, including both research and practice.”

The citation from the award committee begins, “This paper brought together the fields of Bayesian optimization, bandits, and experimental design,” and it “has since cross-fertilized these separate research domains,” Seeger adds.

Matthias Seeger, principal applied scientist

Bayesian-optimization and bandit problems have the same general structure, Seeger explains, but “Bayesian optimization is generally done over continuous input spaces and more complicated functions,” he says. “Multi-armed bandits would normally assume finite spaces and linear or otherwise strongly restricted payoff functions. Maybe because Bayesian optimization is more flexible in this sense, it comes with a lot less solid theory. Multi-armed bandits is a more theoretically grounded area.”

Seeger and his colleagues’ 2010 paper generalized theoretical findings from the multi-armed bandit setting to Bayesian optimization (BO), providing strong performance bounds given particular choices of statistical models. This gave machine learning practitioners greater confidence in techniques they’d arrived at empirically and helped them identify circumstances in which those techniques might be less successful.

In the context of deep learning — which now dominates the field of artificial intelligence — BO is used for hyperparameter tuning, or optimizing structural features of the deep-learning model and parameters of the learning algorithm to maximize the efficacy of training on particular data.

To prove their result for BO, Seeger and his colleagues extended techniques borrowed from a third related field, experimental design. The tools they devised to bridge the related disciplines of BO, multi-armed bandits, and experimental design have proved useful to researchers working in all three; the paper has more than 1,000 citations on Google Scholar, which have helped make Seeger the fourth most highly cited researcher in the field of Bayesian optimization.

Seeger’s coauthors on the 2010 paper are Niranjan Srinivas, now a computational biologist at 10xGenomics; Andreas Krause, now a professor of computer science at ETH Zurich; and Sham Kakade, now a professor in the departments of computer science and statistics at the University of Washington.

With Bayesian optimization, Seeger explains, “you are essentially optimizing a function over some search space without actually knowing what this function looks like. You have to learn about that function as you sample it. But your real goal is finding the function’s maximum, or to sample it nearby.”

“If you sample forever, at some point you will find its optima” he adds. “But since sampling is expensive and takes time, you want to finish as rapidly as possible. So what you are really interested in is to spend as few samples as possible before you converge to something useful, very close to the optimum.”

An example of Seeger and his colleagues’ sample selection procedure, taken from their 2010 paper. The first image (a) represents temperatures in different parts of a building; the two images at right (b and c) represent successive iterations of the procedure, in which individual sensors are briefly activated to take temperature readings (red circles). The black line represents the true temperatures, and the grey areas represent the method’s latest inference of the range of possible temperatures in each region. Crosses indicate points at which readings have already been taken. The procedure selects new sample points with the goal of either maximizing information gain (b) or finding optima (c).

Seeger and his colleagues proved that, under conditions that frequently hold for machine learning problems, the sampling process is guaranteed to converge. But they also showed that the convergence rate depends on specific problem parameters.

In BO, the function that you’re trying to optimize is a random function, Seeger explains. “Every time you plug in a point x, you get a random value f(x),” he says. A standard way to do BO is to model the outputs of the function using a probability distribution. If that distribution is Gaussian — the standard bell curve — then Bayesian optimization is said to use a Gaussian process as a surrogate model.

One of the parameters of the surrogate model is its covariance function, which describes the correlation between changes to function inputs (the x’s) and the resulting changes to the outputs (the f(x)’s). There are several families of covariance function, with an infinite range of functions within each family.

Seeger and his colleagues’ paper quantitatively relates the convergence rate of the function-sampling procedure to the specific choice of covariance function.

“Some choices of covariance function imply smooth functions, which can faithfully be interpolated from measurements nearby,” Seeger says. “Others result in rough functions, for which interpolating even across short distances is an uncertain exercise.”