No Free Lunch Theorem - Proof and Discussion

No Free Lunch Theorem

Formal Statement. Let be any learning algorithm for the task of binary classification with the loss (the loss is if the prediction is correct and otherwise) over a domain . Let represent a training set size. Then, there exists a distribution over such that:

  • with (i.e., perfectly classifies the data). Here, denotes the expected loss of under distribution .
  • With probability at least over the choice of training set , we have .

Interpretation. For any learning algorithm you choose, there will always be some distribution such that the final error has a high probability of being large, even if there does exist a perfect classifier for that distribution.

Proof

Most existing proofs are tedious and involve a tons of notation. Here, we provide a rigorous, but much shorter proof (I tried my best to keep it simple).

Functions and Distributions. Let be a subset of size . Then, there exists mappings from to . We denote these mappings as . For each , we can easily construct a distribution over that is uniform over for and elsewhere. This ensures that .

Training Sets. We consider all sequences of elements from . There are such sequences, which we denote as . For each , we can obtain the training set by sampling some uniformly from these sequences, and then labeling it according to .

Building the Error Matrix. For each and , we train the algorithm on the training set labeled by and evaluate the error on . We denote this error as . Importantly, if we evaluate the hypothesis on the entire set , then the resulting error must be at least , since . Therefore, bounding from below also bounds the error on the entire set .

Property of the Error Matrix. Consider any column of the matrix . We claim that, the average of the entries in this column is . This is because, for any function and any point , there exists exactly one function that disagrees with on and agrees on all the rest. Since we only show to the algorithm, it must produce the same output on for both and , leading to an error on one of them. This means that for each , half of the functions will incur an error on . Averaging over all gives us the desired result.

Conclusion. Since the average of each column is , there must exist at least one row with average . The following inequality holds for this row :

Let . Then, we have:

Hence

Solving for gives us

Discussion

The No Free Lunch Theorem highlights a fundamental limitation in machine learning: every algorithm can fail on some distributions. However, we have seen the remarkable success of certain algorithms in practice -- So what is going on here?

A shallow answer is that, while such "bad" distributions exist, they do not occur in real-world scenarios. A deeper answer is that, in practice, we often have some prior knowledge or assumptions (inductive biases) about the data distribution, like:

  • The local pixel correlations in images (CNN exploits this).
  • The next word is related to previous words (RNN, Transformer exploit this).

In other words, real-world data is not arbitrary; it has patterns and structures that certain algorithms can exploit effectively.

If we go back to the claim "the algorithm must make an error on at least one of and " in the proof, we can see that the one making the error is often the one whose pattern is not aligned with the algorithm's inductive bias. Therefore, the "row" we pick in the conclusion step is typically the most bizarre function that does not occur in practice at all.