The main topic of this book is conformal prediction, a method of prediction recently developed in machine learning. Conformal predictors are among the most accurate methods of machine learning, and unlike other state-of-the-art methods, they provide information about their own accuracy and reliability.
The book integrates mathematical theory and revealing experimental work. It demonstrates mathematically the validity of the reliability claimed by conformal predictors when they are applied to independent and identically distributed data, and it confirms experimentally that the accuracy is sufficient for many practical problems. Later chapters generalize these results to models called repetitive structures, which originate in the algorithmic theory of randomness and statistical physics. The approach is flexible enough to incorporate most existing methods of machine learning, including newer methods such as boosting and support vector machines and older methods such as nearest neighbors and the bootstrap.
Topics and Features:
Researchers in computer science, statistics, and artificial intelligence will find the book an authoritative and rigorous treatment of some of the most promising new developments in machine learning. Practitioners and students in all areas of research that use quantitative prediction or machine learning will learn about important new methods.
The book may be purchased directly from Springer and from many on-line booksellers, including amazon.com.
Vladimir Vovk and Alex Gammerman are Professors of Computer Science at Royal Holloway, University of London. Glenn Shafer is Professor in the Rutgers School of Business - Newark and New Brunswick. All three authors are affiliated with the Centre for Reliable Machine Learning at Royal Holloway, University of London.
This section links to the authors' papers that develop and review the theory presented in the book. They are often also published on arXiv, in conference proceedings, or in journals, but the version given in this section is usually most up-to-date. The name of this project derives from the title of Part IV of the second edition and refers to a new kind of modelling uncertainty going back to Kolmogorov's complexity modelling. The most standard online compression model is the exchangeability model, and many of the papers in this project assume exchangeability.
The following working papers develop the ideas presented in the first edition:
This article is written for statisticians. We consider the on-line predictive version of the standard statistical problem of linear regression; the goal is to predict each consecutive response given the corresponding explanatory variables and all the previous observations. We are mainly interested in prediction intervals rather than point predictions. The standard treatment of prediction intervals in linear regression analysis has two drawbacks: (1) the usual prediction intervals guarantee that the probability of error is equal to the nominal significance level epsilon, but this property per se does not imply that the long-run frequency of error is close to epsilon; (2) it is not suitable for prediction of complex systems as it assumes that the number of observations exceeds the number of parameters. We state the book's result showing that in the on-line protocol the frequency of error does equal the nominal significance level, up to statistical fluctuations, and we describe alternative regression models in which informative prediction intervals can be found before the number of observations exceeds the number of parameters. One of these models, which only assumes that the observations are independent and identically distributed, is greatly underused in the statistical theory of regression. Published in the Annals of Statistics 37:1566–1590 (2009).
This is a written version of the second Computer Journal lecture, presented at the British Computer Society London Office on 12 June 2006. The lecture was followed by a discussion by some of the leading experts in machine learning, which is also included in the article. Its definitive version is published in the Computer Journal 50:151–177 (2007). This is the description of conformal prediction as given in the article's abstract:
Recent advances in machine learning make it possible to design efficient prediction algorithms for data sets with huge numbers of parameters. This article describes a new technique for "hedging" the predictions output by many such algorithms, including support vector machines, kernel ridge regression, kernel nearest neighbours, and by many other state-of-the-art methods. The hedged predictions for the labels of new objects include quantitative measures of their own accuracy and reliability. These measures are provably valid under the assumption of randomness, traditional in machine learning: the objects and their labels are assumed to be generated independently from the same probability distribution. In particular, it becomes possible to control (up to statistical fluctuations) the number of erroneous predictions by selecting a suitable confidence level. Validity being achieved automatically, the remaining goal of hedged prediction is efficiency: taking full account of the new objects' features and other available information to produce as accurate predictions as possible. This can be done successfully using the powerful machinery of modern machine learning.
This tutorial presents a self-contained account of the theory of conformal prediction and works through several numerical examples. Published in the Journal of Machine Learning Research 9:371–421 (2008).
A standard assumption in machine learning is the exchangeability of data, which is equivalent to assuming that the examples are generated from the same probability distribution independently. This paper is devoted to testing the assumption of exchangeability on-line: the examples arrive one by one, and after receiving each example we would like to have a valid measure of the degree to which the assumption of exchangeability has been falsified. Such measures are provided by exchangeability martingales. We extend known techniques for constructing exchangeability martingales and show that our new method is competitive with the martingales introduced before. Finally we investigate the performance of our testing method on two benchmark datasets, USPS and Statlog Satellite data; for the former, the known techniques give satisfactory results, but for the latter our new more flexible method becomes necessary. This article appears in the Proceedings of ICML 2012.
Conformal predictors are automatically valid in the sense of having coverage probability equal to or exceeding a given confidence level. Inductive conformal predictors are a computationally efficient version of conformal predictors satisfying the same property of validity. However, inductive conformal predictors have been only known to control unconditional coverage probability. This paper explores various versions of conditional validity and various ways to achieve them using inductive conformal predictors and their modifications. These are the data set (the Spambase data set contributed by George Forman to the UCI Machine Learning Repository with column names) and R programs used in the experimental section of the conference version of the paper. The conference version of this article is published in the Proceedings of ACML 2012 (JMLR: Workshop and Conference Proceedings 25:475–490, 2012). The journal version is published in Machine Learning (ACML 2012 Special Issue) 92:349–376 (2013).
This note introduces the method of cross-conformal prediction, which is a hybrid of the methods of inductive conformal prediction and cross-validation, and studies its validity and predictive efficiency empirically. The journal version is published in the Special Issue of Annals of Mathematics and Artificial Intelligence on Conformal Prediction.
This paper continues study, both theoretical and empirical, of the method of Venn prediction, concentrating on binary prediction problems. Venn predictors produce probability-type predictions for the labels of test objects which are guaranteed to be well calibrated under the standard assumption that the observations are generated independently from the same distribution. We give a simple formalization and proof of this property. We also introduce Venn-Abers predictors, a new class of Venn predictors based on the idea of isotonic regression, and report promising empirical results both for Venn-Abers predictors and for their more computationally efficient simplified version. For the code that we used for running the experiments, click here. The conference version is published in the UAI 2014 Proceedings.
This paper, published in the COPA 2013 Proceedings, discusses a transductive version of conformal predictors. This version is computationally inefficient for big test sets, but it turns out that apparently crude "Bonferroni predictors" are about as good in their information efficiency and vastly superior in computational efficiency. Journal version: International Journal on Artificial Intelligence Tools (Special Issue devoted to COPA 2013) 24(6):1560001 (2015).
This paper, which is also published in the COPA 2013 Proceedings, discusses conformal prediction for hypergraphical models. Journal version: International Journal on Artificial Intelligence Tools (Special Issue devoted to COPA 2013) 24(6):1560003 (2015).
Conformal prediction can be applied on top of a wide range of prediction algorithms, which often make strong assumptions about the data-generating mechanism. For the method to be really useful it is desirable that in the case where the assumptions of the underlying algorithm are satisfied, the conformal predictor loses little in efficiency as compared with the underlying algorithm (whereas being a conformal predictor, it produces valid results even when those assumptions are not satisfied, provided the data are IID). This paper (published in the COLT 2014 Proceedings) explores the degree to which this additional requirement of efficiency is satisfied in the case of Bayesian ridge regression.
We study optimal conformity measures for various criteria of efficiency in an idealized setting. This leads to an important class of criteria of efficiency that we call probabilistic; it turns out that the most standard criteria of efficiency used in literature are not probabilistic. The conference version is published in the proceedings of COPA 2016. The journal version is published in the Annals of Mathematics and Artificial Intelligence and is part of the COPA 2016 Special Issue.
This paper proposes a new method of probabilistic prediction, which is based on conformal prediction. The method is applied to the standard USPS data set and gives encouraging results. Published in the Proceedings of COPA 2014.
This paper introduces computationally efficient versions of Venn-Abers predictors. When the imprecise probabilities that they output are merged into precise probabilities, the resulting predictors, while losing the theoretical property of perfect calibration, are consistently more accurate than the existing methods in empirical studies. Published in the Proceedings of NIPS 2015.
We construct universal prediction systems in the spirit of Popper's falsifiability and Kolmogorov complexity and randomness. These prediction systems do not depend on any statistical assumptions (but under the IID assumption they dominate, to within the usual accuracy, conformal prediction). Our constructions give rise to a theory of algorithmic complexity and randomness of time containing analogues of several notions and results of the classical theory of Kolmogorov complexity and randomness. The conference version is published in the proceedings of COPA 2016. The journal version is published in the COPA 2016 Special Issue of the Annals of Mathematics and Artificial Intelligence.
The relation between the randomness (IID) and exchangeability models plays an important role in conformal prediction. In the case of binary sequences, this relation was clarified in a 1986 note, but its results were published without proofs. This working paper comprises a new English translation of the 1986 note, a brief description of its historical background, and the proofs.
This paper starts careful analysis of the notion of p-values, including randomized p-values, which are particularly important in conformal prediction. See also the on-line prediction wiki. Published in the Proceedings of COPA 2019.
This paper extends conformal prediction to predictive distributions in the case of regression. Its focus is on the Least Squares Prediction Machine. The conference version is published in the Proceedings of COPA 2017 and the journal version in the Special Issue of Machine Learning on Conformal Prediction.
This paper describes a simple universally consistent algorithm for probability forecasting that satisfies a natural property of small-sample validity. Published in the Proceedings of COPA 2019. Journal version: Universal predictive systems, Pattern Recognition 126:108536 (2022).
This paper explains how conformal predictive distributions can be used for the purpose of decision making. It is published in the Proceedings of COPA 2018.
We kernelize the main algorithms of Working Paper 17, arriving at KRRPM (Kernel Ridge Regression Prediction Machines). Published in: Braverman Readings in Machine Learning. Key Ideas from Inception to Current State. For the Python code used for the figures, click here.
This paper explores the degree to which the validity of cross-conformal prediction (as described in Working Paper 6) can be broken (as in Working Paper 6, Appendix A, and the experiments by Henrik Linusson et al.). But perhaps its main value is in applying recent developments in mathematical finance, specifically robust risk aggregation, to the general problem of merging p-values. Published in Biometrika 107:791–808 (2020).
We discuss two computationally efficient versions of conformal predictive systems, which we call split conformal predictive systems and cross-conformal predictive systems, and discuss their advantages and limitations. The paper is published in the Proceedings of COPA 2018 (under the title "Cross-conformal predictive distributions"). Journal version: Neurocomputing 397 292–308 (2020).
Most existing examples of conformal predictive systems impose severe restrictions on the adaptation of predictive distributions to the object in hand. In this paper we develop split conformal and cross-conformal predictive systems that are fully adaptive. Published in the Proceedings of COPA 2020.
This paper poses the problem of investigating the power and limitations of conformal martingales as a means of detecting deviations from randomness. It gives several preliminary propositions in this direction and discusses connections between randomness, exchangeability, and conformal martingales (cf. Working Paper 15). Published in Statistical Science 36:595–611 (2021).
This paper proposes an alternative language for expressing results of the algorithmic theory of randomness. The language is more precise in that it does not involve unspecified additive or multiplicative constants, making mathematical results, in principle, applicable in practice. Our main testing ground for the proposed language is the problem of defining Bernoulli sequences (see Working Papers 15 and 24) and, more generally, connections between statistical randomness and exchangeability. This paper is also listed as Working Paper 1 in the project "Hypothesis testing with e-values" and published in Gurevich Festschrift 2020.
This paper reviews conformal e-prediction, which is the modification of conformal prediction in which p-values are replaced by e-values. In particular, it draws the reader's attention to two advantages of conformal e-prediction: conditional validity becomes easier to achieve and cross-conformal prediction becomes provably valid.
The main topic of this paper is multiple hypothesis testing, but it is somewhat related to conformal prediction. First, it was motivated by cross-conformal prediction (see Working Paper 21), and second, in its empirical studies section it uses conformalized versions of permutation e-values, which are exactly (and not just approximately) valid. This paper is also listed as Working Paper 3 in the project "Hypothesis testing with e-values". Published in Statistical Science 38:329–354 (2023).
Efficiency criteria for conformal prediction (see Working Paper 11), such as observed fuzziness (i.e., the sum of p-values associated with false labels), are commonly used to evaluate the performance of given conformal predictors. Here, we investigate whether it is possible to exploit efficiency criteria to learn classifiers, both conformal predictors and point classifiers, by using such criteria as training objective functions. Published in the Proceedings of COPA 2020.
We adapt conformal e-prediction (see Working Paper 26) to testing exchangeability and to quickest detection of lack of exchangeability. In particular, we define and study analogues of the Shiryaev-Roberts and CUSUM procedures for detecting violations of exchangeability.
This working paper presents conformal predictive distributions as an approach to nonparametric fiducial prediction. It is a written version of a talk at the Sixth BFF meeting (Duke University, 2019), its intended audience is the BFF (Bayes–frequentist–fiducial) community, and it has been published in the Handbook of Bayesian, Fiducial, and Frequentist Inference.
This note continues study of exchangeability martingales. Violations of the IID assumption are sometimes referred to as dataset shift, and dataset shift is sometimes subdivided into concept shift, covariate shift, etc. Our primary interest is in concept shift, but we will also discuss exchangeability martingales that decompose perfectly into two components one of which detects concept shift and the other detects what we call label shift.
We argue for supplementing the process of training a prediction algorithm by setting up a scheme for detecting the moment when the distribution of the data changes and the algorithm needs to be retrained. Our proposed schemes are based on conformal test martingales. Published in the Proceedings of COPA 2021.
The topic of this note is computational evaluation of the performance of conformal testing in a model situation in which IID binary observations generated from a Bernoulli distribution are followed by IID binary observations generated from another Bernoulli distribution, with the parameters of the distributions and changepoint unknown. Existing conformal test martingales can be used for this task and work well in simple cases, but their efficiency can be improved greatly. Published in the Proceedings of COPA 2021.
This paper proposes a procedure for protecting probabilistic regression algorithms against changes in the distribution of the incoming data. It is inspired by the success of the recent conformal test martingales.
This paper proposes a way of protecting probability forecasting rules against changes in the data distribution, concentrating on the case classification and paying particular attention to binary classification. This is important in applications of machine learning, where the quality of a trained predictor may drop significantly in the process of its exploitation. Our techniques are based on recent work on conformal test martingales and older work on prediction with expert advice, namely tracking the best expert. For the Python code for this paper, click here.
We continue study of conformal testing in binary model situations. In this note we consider Markov alternatives to the null hypothesis of exchangeability. We propose two new classes of conformal test martingales; one class is statistically efficient in our experiments, and the other class partially sacrifices statistical efficiency to gain computational efficiency. Published in the Proceedings of COPA 2022.
The second edition of the book (December 2022) incorporated many of the results in these working papers (major exceptions are Working Papers 26 and 29 about conformal e-prediction and conformal e-testing, respectively). The following working papers have appeared after the publication of the second edition:
This paper places conformal testing in a general framework of statistical hypothesis testing. A standard approach to testing a composite null hypothesis H is to test each of its elements and to reject H when each of its elements is rejected. My discussion is centred on the observation that we can fully cover conformal testing using this approach only if we allow forgetting some of the data. Published in the Proceedings of COPA 2023.
This paper deals with testing exchangeability in the batch mode (as, e.g., Working Paper 15). The null hypothesis of exchangeability is formalized as a Kolmogorov-type compression model, and the Bayes mixture of the Markov model with respect to the uniform prior is taken as simple alternative hypothesis. Using e-values instead of p-values leads to a computationally efficient testing procedure. For the Python code for this paper, click here.
This note states a simple property of optimality of the Bayes-Kelly algorithm for conformal testing and poses a related open problem. It is a comment on the RSS discussion paper by Peter Grunwald, Rianne de Heide, and Wouter Koolen presented on 24 January 2024.
A very simple example demonstrates that Fisher's application of the conditionality principle to regression ("fixed x regression"), endorsed by Sprott and many other followers, makes prediction impossible in the context of statistical learning theory. On the other hand, relaxing the requirement of conditionality makes it possible via, e.g., conformal prediction.
For the old working papers, mainly superseded by the 1st edition of the book, follow this link.
For other research on conformal prediction and testing, see the on-line prediction wiki. See also the events page maintained by Khuong Nguyen; for a list of events related to conformal prediction that happened before December 2022 (the publication of the second edition), click here.