The main topic of this book is conformal prediction, a method of prediction recently developed in machine learning. Conformal predictors are among the most accurate methods of machine learning, and unlike other state-of-the-art methods, they provide information about their own accuracy and reliability.
The book integrates mathematical theory and revealing experimental work. It demonstrates mathematically the validity of the reliability claimed by conformal predictors when they are applied to independent and identically distributed data, and it confirms experimentally that the accuracy is sufficient for many practical problems. Later chapters generalize these results to models called repetitive structures, which originate in the algorithmic theory of randomness and statistical physics. The approach is flexible enough to incorporate most existing methods of machine learning, including newer methods such as boosting and support vector machines and older methods such as nearest neighbors and the bootstrap.
Topics and Features:
Researchers in computer science, statistics, and artificial intelligence will find the book an authoritative and rigorous treatment of some of the most promising new developments in machine learning. Practitioners and students in all areas of research that use quantitative prediction or machine learning will learn about important new methods.
The book may be purchased directly from Springer and from many on-line booksellers, including amazon.com.
Vladimir Vovk and Alex Gammerman are Professors of Computer Science at Royal Holloway, University of London. Glenn Shafer is Professor in the Rutgers School of Business - Newark and New Brunswick. All three authors are affiliated with the Computer Learning Research Centre at Royal Holloway, University of London.
The following articles further develop and review the theory presented in the book:
This article is written for statisticians. We consider the on-line predictive version of the standard statistical problem of linear regression; the goal is to predict each consecutive response given the corresponding explanatory variables and all the previous observations. We are mainly interested in prediction intervals rather than point predictions. The standard treatment of prediction intervals in linear regression analysis has two drawbacks: (1) the usual prediction intervals guarantee that the probability of error is equal to the nominal significance level epsilon, but this property per se does not imply that the long-run frequency of error is close to epsilon; (2) it is not suitable for prediction of complex systems as it assumes that the number of observations exceeds the number of parameters. We state the book's result showing that in the on-line protocol the frequency of error does equal the nominal significance level, up to statistical fluctuations, and we describe alternative regression models in which informative prediction intervals can be found before the number of observations exceeds the number of parameters. One of these models, which only assumes that the observations are independent and identically distributed, is greatly underused in the statistical theory of regression. Published in the Annals of Statistics 37:1566–1590 (2009).
This is a written version of the second Computer Journal lecture, presented at the British Computer Society London Office on 12 June 2006. The lecture was followed by a discussion by some of the leading experts in machine learning, which is also included in the article. Its definitive version is published in the Computer Journal 50:151–177 (2007). This is the description of conformal prediction as given in the article's abstract:
Recent advances in machine learning make it possible to design efficient prediction algorithms for data sets with huge numbers of parameters. This article describes a new technique for "hedging" the predictions output by many such algorithms, including support vector machines, kernel ridge regression, kernel nearest neighbours, and by many other state-of-the-art methods. The hedged predictions for the labels of new objects include quantitative measures of their own accuracy and reliability. These measures are provably valid under the assumption of randomness, traditional in machine learning: the objects and their labels are assumed to be generated independently from the same probability distribution. In particular, it becomes possible to control (up to statistical fluctuations) the number of erroneous predictions by selecting a suitable confidence level. Validity being achieved automatically, the remaining goal of hedged prediction is efficiency: taking full account of the new objects' features and other available information to produce as accurate predictions as possible. This can be done successfully using the powerful machinery of modern machine learning.
This tutorial presents a self-contained account of the theory of conformal prediction and works through several numerical examples. Published in the Journal of Machine Learning Research 9:371–421 (2008).
A standard assumption in machine learning is the exchangeability of data, which is equivalent to assuming that the examples are generated from the same probability distribution independently. This paper is devoted to testing the assumption of exchangeability on-line: the examples arrive one by one, and after receiving each example we would like to have a valid measure of the degree to which the assumption of exchangeability has been falsified. Such measures are provided by exchangeability martingales. We extend known techniques for constructing exchangeability martingales and show that our new method is competitive with the martingales introduced before. Finally we investigate the performance of our testing method on two benchmark datasets, USPS and Statlog Satellite data; for the former, the known techniques give satisfactory results, but for the latter our new more flexible method becomes necessary. This article appears in the Proceedings of ICML 2012.
Conformal predictors are automatically valid in the sense of having coverage probability equal to or exceeding a given confidence level. Inductive conformal predictors are a computationally efficient version of conformal predictors satisfying the same property of validity. However, inductive conformal predictors have been only known to control unconditional coverage probability. This paper explores various versions of conditional validity and various ways to achieve them using inductive conformal predictors and their modifications. These are the data set (the Spambase data set contributed by George Forman to the UCI Machine Learning Repository with column names) and R programs used in the experimental section of the conference version of the paper. The conference version of this article is published in the Proceedings of ACML 2012 (JMLR: Workshop and Conference Proceedings 25:475–490, 2012). The journal version is published in Machine Learning (ACML 2012 Special Issue) 92:349–376 (2013).
This note introduces the method of cross-conformal prediction, which is a hybrid of the methods of inductive conformal prediction and cross-validation, and studies its validity and predictive efficiency empirically. The journal version is published in the Special Issue of Annals of Mathematics and Artificial Intelligence on Conformal Prediction.
This paper continues study, both theoretical and empirical, of the method of Venn prediction, concentrating on binary prediction problems. Venn predictors produce probability-type predictions for the labels of test objects which are guaranteed to be well calibrated under the standard assumption that the observations are generated independently from the same distribution. We give a simple formalization and proof of this property. We also introduce Venn-Abers predictors, a new class of Venn predictors based on the idea of isotonic regression, and report promising empirical results both for Venn-Abers predictors and for their more computationally efficient simplified version. For the code that we used for running the experiments, click here. The conference version is published in the UAI 2014 Proceedings.
This paper, published in the COPA 2013 Proceedings, discusses a transductive version of conformal predictors. This version is computationally inefficient for big test sets, but it turns out that apparently crude "Bonferroni predictors" are about as good in their information efficiency and vastly superior in computational efficiency.
This paper, which is also published in the COPA 2013 Proceedings, discusses conformal prediction for hypergraphical models (in the book, hypergraphical models are used only for Venn predictors).
Conformal prediction can be applied on top of a wide range of prediction algorithms, which often make strong assumptions about the data-generating mechanism. For the method to be really useful it is desirable that in the case where the assumptions of the underlying algorithm are satisfied, the conformal predictor loses little in efficiency as compared with the underlying algorithm (whereas being a conformal predictor, it produces valid results even when those assumptions are not satisfied, provided the data are IID). This paper (published in the COLT 2014 Proceedings) explores the degree to which this additional requirement of efficiency is satisfied in the case of Bayesian ridge regression.
We study optimal conformity measures for various criteria of efficiency in an idealized setting. This leads to an important class of criteria of efficiency that we call probabilistic; it turns out that the most standard criteria of efficiency used in literature are not probabilistic. The conference version is published in the proceedings of COPA 2016. The journal version is published in the Annals of Mathematics and Artificial Intelligence and is part of the COPA 2016 Special Issue.
This paper proposes a new method of probabilistic prediction, which is based on conformal prediction. The method is applied to the standard USPS data set and gives encouraging results. Published in the Proceedings of COPA 2014.
This paper introduces computationally efficient versions of Venn-Abers predictors. When the imprecise probabilities that they output are merged into precise probabilities, the resulting predictors, while losing the theoretical property of perfect calibration, are consistently more accurate than the existing methods in empirical studies. Published in the Proceedings of NIPS 2015.
We construct universal prediction systems in the spirit of Popper's falsifiability and Kolmogorov complexity and randomness. These prediction systems do not depend on any statistical assumptions (but under the IID assumption they dominate, to within the usual accuracy, conformal prediction). Our constructions give rise to a theory of algorithmic complexity and randomness of time containing analogues of several notions and results of the classical theory of Kolmogorov complexity and randomness. The conference version is published in the proceedings of COPA 2016. The journal version is published in the COPA 2016 Special Issue of the Annals of Mathematics and Artificial Intelligence.
The relation between the randomness (IID) and exchangeability models plays an important role in conformal prediction. In the case of binary sequences, this relation was clarified in a 1986 note, but its results were published without proofs. This working paper comprises a new English translation of the 1986 note, a brief description of its historical background, and the proofs.
This paper starts careful analysis of the notion of p-values, including randomized p-values, which are particularly important in conformal prediction. See also the on-line prediction wiki. Published in the Proceedings of COPA 2019.
This paper extends conformal prediction to predictive distributions in the case of regression. Its focus is on the Least Squares Prediction Machine. The conference version is published in the Proceedings of COPA 2017 and the journal version in the Special Issue of Machine Learning on Conformal Prediction.
This note describes a simple universally consistent algorithm for probability forecasting that satisfies a natural property of small-sample validity. Published in the Proceedings of COPA 2019.
This paper explains how conformal predictive distributions can be used for the purpose of decision making. It is published in the Proceedings of COPA 2018.
We kernelize the main algorithms of Working Paper 17, arriving at KRRPM (Kernel Ridge Regression Prediction Machines). Published in: Braverman Readings in Machine Learning. Key Ideas from Inception to Current State. For the Python code used for the figures, click here.
This paper explores the degree to which the validity of cross-conformal prediction (as described in Working Paper 6) can be broken (as in Working Paper 6, Appendix A, and the recent work by Henrik Linusson et al.). But perhaps its main value is in applying recent developments in mathematical finance, specifically robust risk aggregation, to the general problem of merging p-values. To be published in Biometrika (and published online).
We discuss two computationally efficient versions of conformal predictive systems, which we call split conformal predictive systems and cross-conformal predictive systems, and discuss their advantages and limitations. The paper is published in the Proceedings of COPA 2018.
Most existing examples of conformal predictive systems impose severe restrictions on the adaptation of predictive distributions to the object in hand. In this paper we develop split conformal and cross-conformal predictive systems that are fully adaptive.
This paper poses the problem of investigating the power and limitations of conformal martingales as a means of detecting deviations from randomness. It gives several preliminary propositions in this direction and discusses connections between randomness, exchangeability, and conformal martingales (cf. Working Paper 15). To be published in Statistical Science.
This paper proposes an alternative language for expressing results of the algorithmic theory of randomness. The language is more precise in that it does not involve unspecified additive or multiplicative constants, making mathematical results, in principle, applicable in practice. Our main testing ground for the proposed language is the problem of defining Bernoulli sequences (see Working Papers 15 and 24) and, more generally, connections between statistical randomness and exchangeability. This paper is also listed as Working Paper 1 in the project "Hypothesis testing with e-values" and published in Gurevich Festschrift 2020.
This note discusses a simple modification of cross-conformal prediction in which p-values are replaced by e-values. It draws the reader's attention to the obvious fact that this version of cross-conformal prediction enjoys a guaranteed property of validity.
The main topic of this paper is multiple hypothesis testing, but it is somewhat related to conformal prediction. First, it was motivated by cross-conformal prediction (see Working Paper 21), and second, in its empirical studies section it uses conformalized versions of permutation e-values, which are exactly (and not just approximately) valid. This paper is also listed as Working Paper 3 in the project "Hypothesis testing with e-values".
Efficiency criteria for conformal prediction (see Working Paper 11), such as observed fuzziness (i.e., the sum of p-values associated with false labels), are commonly used to evaluate the performance of given conformal predictors. Here, we investigate whether it is possible to exploit efficiency criteria to learn classifiers, both conformal predictors and point classifiers, by using such criteria as training objective functions.
We adapt conformal e-prediction (see Working Paper 26) to change detection, defining analogues of the Shiryaev-Roberts and CUSUM procedures for detecting violations of the IID assumption. Asymptotically, the frequency of false alarms for these analogues does not exceed the usual bounds.
This working paper presents conformal predictive distributions as an approach to nonparametric fiducial prediction. It is a written version of a talk at the Sixth BFF meeting (Duke University, 2019), and its intended audience is the BFF (Bayes–frequentist–fiducial) community.
This note continues study of exchangeability martingales. Violations of the IID assumption are sometimes referred to as dataset shift, and dataset shift is sometimes subdivided into concept shift, covariate shift, etc. Our primary interest is in concept shift, but we will also discuss exchangeability martingales that decompose perfectly into two components one of which detects concept shift and the other detects what we call label shift.
For other research on conformal prediction, see the on-line prediction wiki. See also the events page.