# Conformal prediction (old series of working papers)

## Current state of the project

The research described on this web page has resulted in a book, Algorithmic learning in a random world by Vovk, Gammerman and Shafer, published by Springer, New York, in 2005. The book's web site has now become the main site for this project; it also contains information about newer developments (since 2005).

## Overview of the Working Papers

The basic applied problem that we consider is on-line prediction: at each trial n=1,2,..., given the first n-1 observations we would like to say something about the nth observation. (Often the observations will consist of two parts, the object and the label, and we might be asked to give a prediction for the nth label given the nth object, but this case can be treated in a similar way.) The observations are usually assumed to be generated independently from the same probability distribution (the iid model); only in Working Paper 8 is this assumption relaxed. One setting that we consider is that of region prediction: the prediction algorithm is required to output a set of possible values for the nth observation. We show that there exists a simple class of prediction algorithms (called Transductive Confidence Machines, abbreviated TCM, on this web site, and called conformal predictors on the new web site) which are automatically calibrated: when given a "significance level" δ, they make errors with frequency at most δ (precisely δ if randomisation is allowed). Moreover, the algorithm is calibrated in a very strong non-asymptotic sense: in the randomised case, the probability of error is δ at each trial and errors are made independently at different trials.

Working paper 1 proves the calibration result for the iid model and in the "structured" situation, where each observation consists of an object and a label. In principle, it is easy to be calibrated: since an error is made, by definition, when the prediction region does not contain the true label, errors can be avoided if the regions are made large enough. Another goal in region prediction is to make as few uncertain predictions as possible, where a prediction region is called "uncertain" if it contains more than one possible label (assuming the set of possible labels is finite: the case of classification). The justification for the use of TCM given in Working Paper 1 was that it produces reasonable results on standard benchmark data sets, whereas the standard methods (combining algorithms for point prediction with PAC bounds on the probability of error) are impractical even on such a clean data set as the USPS data set of hand-written digits.

Working Paper 3 provides theoretical evidence for the efficiency of TCM (again in the case of the iid model): there is a simple TCM based on the Nearest Neighbours algorithm whose long-run frequency of uncertain predictions does not exceed the frequency of uncertain predictions of any other calibrated algorithm (even an algorithm that is optimised for the data-generating distribution).

The basic on-line scenario, when the true label is disclosed immediately after the prediction is made, is not realistic: typically there is a delay before the true label becomes known, and some (or most) labels are never disclosed. Working Papers 6 and 7 show that calibration continues to hold even if only a small fraction of labels is eventually disclosed (perhaps with a delay).

Working Paper 8 extends conformal prediction to a wide class of statistical models called on-line compression models (and closely connected with Martin-Löf's repetitive structures and Friedman's summarizing statistics). The standard approach to modelling uncertainty is to choose a family of probability distributions (statistical model) one of which is believed to be the true distribution generating, or explaining in a satisfactory way, the data. (In some applications of probability theory, the true distribution is assumed to be known, and so the statistical model is a one-element set. In Bayesian statistics, the statistical model is complemented by another element, a prior distribution on the distributions in the model.) All modern applications of probability depend on this scheme.

In 1963–1970 Andrei Kolmogorov suggested a different approach to modelling uncertainty based on information theory; its purpose was to provide a more direct link between the theory and applications of probability. On-line compression models are a natural adaptation of Kolmogorov's programme to the technique of conformal prediction. Working Paper 8 introduces three on-line compression models: exchangeability (equivalent to the iid model), Gaussian and Markov.

The content of all other Working Papers is briefly described in the next section.