Maximin Action Identification: A New Bandit Framework for Games

We study an original problem of pure exploration in a strategic bandit model motivated by Monte Carlo Tree Search. It consists in identifying the best action in a game, when the player may sample random outcomes of sequentially chosen pairs of actions. We propose two strategies for the fixed-confidence setting: Maximin-LUCB, based on lower-and upper-confidence bounds; and Maximin-Racing, which operates by successively eliminating the sub-optimal actions. We discuss the sample complexity of both methods and compare their performance empirically.

Optimal Best Arm Identification with Fixed Confidence

We provide a complete characterization of the complexity of best-arm identification in one-parameter bandit problems. We prove a new, tight lower bound on the sample complexity. We propose the 'Track-and-Stop' strategy, which is proved to be asymptotically optimal. It consists in a new sampling rule (which tracks the optimal proportions of arm draws highlighted by the lower bound) and in a stopping rule named after Chernoff, for which we give a new analysis.

Practical targeted learning from large data sets by survey sampling

We address the practical construction of asymptotic confidence intervals for smooth (i.e., pathwise differentiable), real-valued statistical parameters by targeted learning from independent and identically distributed data in contexts where sample size is so large that it poses computational challenges. We observe some summary measure of all data and select a sub-sample from the complete data set by Poisson rejective sampling with unequal inclusion probabilities based on the summary measures. Targeted learning is carried out from the easier to handle sub-sample.

Asymptotically Optimal Algorithms for Multiple Play Bandits with Partial Feedback

We study a variant of the multi-armed bandit problem with multiple plays in which the user wishes to sample the m out of k arms with the highest expected rewards, but at any given time can only sample ell≤m arms. When ell=m, Thompson sampling was recently shown to be asymptotically efficient. We derive an asymptotic regret lower bound for any uniformly efficient algorithm in our new setting where ell may be less than m. We then establish the asymptotic optimality of Thompson sampling for Bernoulli rewards, where our proof technique differs from earlier methods even when ell=m.

Inference of a non-parametric covariate-adjusted variable importance measure of a continuous exposure

We consider a setting where a real-valued variable of cause X affects a real-valued variable of effect Y in the presence of a context variable W. The objective is to assess to what extent (X, W) influences Y while making as few assumptions as possible on the unknown distribution of (W, X, Y). Based on a user-supplied marginal structural model, our new variable importance measure is non-parametric and context-adjusted. It generalizes the variable importance measure introduced by Chambaz et al. [4].

SPADRO soutient la Troisième Conférence ISNPS

Le projet ANR SPADRO soutient la Troisième Conférence ISNPS.

En particulier, SPADRO organise et sponsorise la session invitée intitulée Semiparametric inference. Elle consistera en quatre exposés, d'Emilien Joly (Modal'X, Paris Ouest Nanterre), Alex Luedtke (Fred Hutch), Wenjing Zheng (Division of Biostatistics, UC Berkeley) et Antoine Chambaz (Modal'X, Université Paris Ouest).

Workshop SPADRO: mardi 17/5 à Toulouse

[les slides des présentations sont disponibles, selon autorisation de leur auteur, au bas de ce message]

Data-adaptive Inference of the Optimal Treatment Rule and its Mean Reward. The Masked Bandit

This article studies the data-adaptive inference of an optimal treatment rule (TR). A TR is an individualized treatment strategy in which treatment assignment for a patient is based on her measured baseline covariates. Eventually, a reward is measured on the patient. We also infer the mean reward under the optimal TR. We do so in the so called non-exceptional case, i.e., assuming that there is no stratum of the baseline covariates where treatment is neither beneficial nor harmful, and under a companion margin assumption.

Les exposés du workshop "Sequential learning and applications" sont en ligne

Le projet SPADRO a participé à l'organisation du workshop "Sequential learning and applications" qui a eu lieu à Toulouse les 9 et 10 novembre 2015 dans le cadre du semestre thématique du labex CIMI dédié à l'apprentissage (machine learning). Les vidéos des exposés sont désormais en ligne et accessibles ici.

Pierre Ménard commence une thèse sur les bandits

Bienvenue dans le projet à Pierre Ménard ! Il a débuté en cette rentrée une thèse à Toulouse co-encadrée par Gilles Stoltz et Aurélien Garivier, et son travail va porter sur l'étude fine de problèmes de bandits avec un grand nombre (ou un nombre infini) de bras.

Pages

Souscrire à spadro.eu RSS