We provide an easy-to-understand proof of the No Free Lunch Theorem, along with a discussion on why it doesn't contradict the success of machine learning in practice.
No Free Lunch Theorem
Formal Statement. Let be any learning algorithm for the task
of binary classification with the loss (the loss is if the prediction is correct and otherwise) over a domain . Let represent a training set size. Then, there
exists a distribution over such that:
with
(i.e., perfectly classifies the data). Here,
denotes the expected loss of
under distribution .
With probability at least over the choice of training
set , we have .
Interpretation. For any learning algorithm you choose, there will always be some
distribution such that the final
error has a high probability of being large, even if there does exist a perfect classifier for that distribution.
Proof
Most existing proofs are tedious and involve a tons of notation.
Here, we provide a rigorous, but much shorter proof (I tried my best to
keep it simple).
Functions and Distributions. Let be a subset of size . Then, there exists mappings from to . We denote these mappings as
. For each
, we can easily
construct a distribution over
that is uniform
over for and elsewhere. This ensures that .
Training Sets. We consider all sequences of elements from . There are such sequences, which we
denote as .
For each , we can
obtain the training set by sampling some uniformly from these sequences, and then labeling it
according to .
Building the Error Matrix. For each and , we train the algorithm
on the training set labeled by and evaluate the error on . We denote
this error as . Importantly,
if we evaluate the hypothesis on the entire set , then the resulting error
must be at least ,
since .
Therefore, bounding from
below also bounds the error on the entire set .
Property of the Error Matrix. Consider any column
of the matrix . We claim that, the average of the
entries in this column is . This is because, for any function and any point , there exists
exactly one function
that disagrees with on and agrees on all the rest. Since we
only show to the algorithm, it
must produce the same output on
for both and , leading to an error on one of
them. This means that for each , half of the functions will incur an error on
. Averaging over all gives us the
desired result.
Conclusion. Since the average of each column is
, there must exist at
least one row with average . The following inequality holds for this row :
Let . Then, we have:
Hence
Solving for gives us
Discussion
The No Free Lunch Theorem highlights a fundamental limitation in
machine learning: every algorithm can fail on some distributions.
However, we have seen the remarkable success of certain algorithms in
practice -- So what is going on here?
A shallow answer is that, while such "bad" distributions exist, they
do not occur in real-world scenarios. A deeper answer is that, in
practice, we often have some prior knowledge or assumptions (inductive
biases) about the data distribution, like:
The local pixel correlations in images (CNN exploits this).
The next word is related to previous words (RNN, Transformer exploit
this).
In other words, real-world data is not arbitrary; it has patterns and
structures that certain algorithms can exploit effectively.
If we go back to the claim "the algorithm must make an error on at
least one of and " in the proof, we can see that
the one making the error is often the one whose pattern is not aligned
with the algorithm's inductive bias. Therefore, the "row" we pick in the
conclusion step is typically the most bizarre function that does not
occur in practice at all.