[ci skip] MTN Add section about noise at the end of the notebook on overfit, generalization and underfit (#818) d365ddc

ArturoAmorQ · ArturoAmorQ · commit 0c9d9b444ba8 · 2025-04-14T16:01:53.000Z
diff --git a/_sources/python_scripts/cross_validation_validation_curve.py b/_sources/python_scripts/cross_validation_validation_curve.py
@@ -165,6 +165,54 @@
 # small compared to their respective values, and therefore the conclusions above
 # are quite clear. This is not necessarily always the case.
 
+# %% [markdown]
+# ## What is noise?
+#
+# In this notebook, we talked about the fact that datasets can contain noise.
+#
+# There can be several kinds of noises, among which we can identify:
+#
+# - measurement imprecision from a physical sensor (e.g. temperature);
+# - reporting errors by human collectors.
+#
+# Those unpredictable data acquisition errors can happen either on the input
+# features or in the target variable (in which case we often name this label
+# noise).
+#
+# In practice, the **most common source of "noise" is not necessarily a
+# real noise**, but rather **the absence of the measurement of a relevant
+# feature**.
+#
+# Consider the following example: when predicting the price of a house, the
+# surface area will surely impact the price. However, the price will also be
+# influenced by whether the seller is in a rush and decides to sell the house
+# below the market price. A model will be able to make predictions based on the
+# former but not the latter, so "seller's rush" is a source of noise since it
+# won't be present in the features.
+#
+# Since this missing/unobserved feature is randomly varying from one sample to
+# the next, it appears as if the target variable was changing because of the
+# impact of a random perturbation or noise, even if there were no significant
+# errors made during the data collection process (besides not measuring the
+# unobserved input feature).
+#
+# One extreme case could happen if there where samples in the dataset with
+# exactly the same input feature values but different values for the target
+# variable. That is very unlikely in real life settings, but could the case if
+# all features are categorical or if the numerical features were discretized
+# or rounded up naively. In our example, we can imagine two houses having
+# the exact same features in our dataset, but having different prices because
+# of the (unmeasured) seller's rush.
+#
+# Apart from these extreme case, it's hard to know for sure what should qualify
+# or not as noise and which kind of "noise" as introduced above is dominating.
+# But in practice, the best ways to make our predictive models robust to noise
+# are to avoid overfitting models by:
+#
+# - selecting models that are simple enough or with tuned hyper-parameters as
+#   explained in this module;
+# - collecting a larger number of labeled samples for the training set.
+
 # %% [markdown]
 # ## Summary:
 #
diff --git a/python_scripts/cross_validation_validation_curve.html b/python_scripts/cross_validation_validation_curve.html
@@ -695,6 +695,7 @@ <h2> Contents </h2>
                 <ul class="visible nav section-nav flex-column">
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#overfitting-vs-underfitting">Overfitting vs. underfitting</a></li>
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#validation-curve">Validation curve</a></li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#what-is-noise">What is noise?</a></li>
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#summary">Summary:</a></li>
 </ul>
             </nav>
@@ -882,6 +883,48 @@ <h2>Validation curve<a class="headerlink" href="#validation-curve" title="Link t
 small compared to their respective values, and therefore the conclusions above
 are quite clear. This is not necessarily always the case.</p>
 </section>
+<section id="what-is-noise">
+<h2>What is noise?<a class="headerlink" href="#what-is-noise" title="Link to this heading">#</a></h2>
+<p>In this notebook, we talked about the fact that datasets can contain noise.</p>
+<p>There can be several kinds of noises, among which we can identify:</p>
+<ul class="simple">
+<li><p>measurement imprecision from a physical sensor (e.g. temperature);</p></li>
+<li><p>reporting errors by human collectors.</p></li>
+</ul>
+<p>Those unpredictable data acquisition errors can happen either on the input
+features or in the target variable (in which case we often name this label
+noise).</p>
+<p>In practice, the <strong>most common source of “noise” is not necessarily a
+real noise</strong>, but rather <strong>the absence of the measurement of a relevant
+feature</strong>.</p>
+<p>Consider the following example: when predicting the price of a house, the
+surface area will surely impact the price. However, the price will also be
+influenced by whether the seller is in a rush and decides to sell the house
+below the market price. A model will be able to make predictions based on the
+former but not the latter, so “seller’s rush” is a source of noise since it
+won’t be present in the features.</p>
+<p>Since this missing/unobserved feature is randomly varying from one sample to
+the next, it appears as if the target variable was changing because of the
+impact of a random perturbation or noise, even if there were no significant
+errors made during the data collection process (besides not measuring the
+unobserved input feature).</p>
+<p>One extreme case could happen if there where samples in the dataset with
+exactly the same input feature values but different values for the target
+variable. That is very unlikely in real life settings, but could the case if
+all features are categorical or if the numerical features were discretized
+or rounded up naively. In our example, we can imagine two houses having
+the exact same features in our dataset, but having different prices because
+of the (unmeasured) seller’s rush.</p>
+<p>Apart from these extreme case, it’s hard to know for sure what should qualify
+or not as noise and which kind of “noise” as introduced above is dominating.
+But in practice, the best ways to make our predictive models robust to noise
+are to avoid overfitting models by:</p>
+<ul class="simple">
+<li><p>selecting models that are simple enough or with tuned hyper-parameters as
+explained in this module;</p></li>
+<li><p>collecting a larger number of labeled samples for the training set.</p></li>
+</ul>
+</section>
 <section id="summary">
 <h2>Summary:<a class="headerlink" href="#summary" title="Link to this heading">#</a></h2>
 <p>In this notebook, we saw:</p>
@@ -959,6 +1002,7 @@ <h2>Summary:<a class="headerlink" href="#summary" title="Link to this heading">#
     <ul class="visible nav section-nav flex-column">
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#overfitting-vs-underfitting">Overfitting vs. underfitting</a></li>
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#validation-curve">Validation curve</a></li>
+<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#what-is-noise">What is noise?</a></li>
 <li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#summary">Summary:</a></li>
 </ul>
   </nav></div>
diff --git a/searchindex.js b/searchindex.js