Skip to content

Commit 0c9d9b4

Browse files
committed
[ci skip] MTN Add section about noise at the end of the notebook on overfit, generalization and underfit (#818) d365ddc
1 parent 6963515 commit 0c9d9b4

File tree

3 files changed

+93
-1
lines changed

3 files changed

+93
-1
lines changed

_sources/python_scripts/cross_validation_validation_curve.py

+48
Original file line numberDiff line numberDiff line change
@@ -165,6 +165,54 @@
165165
# small compared to their respective values, and therefore the conclusions above
166166
# are quite clear. This is not necessarily always the case.
167167

168+
# %% [markdown]
169+
# ## What is noise?
170+
#
171+
# In this notebook, we talked about the fact that datasets can contain noise.
172+
#
173+
# There can be several kinds of noises, among which we can identify:
174+
#
175+
# - measurement imprecision from a physical sensor (e.g. temperature);
176+
# - reporting errors by human collectors.
177+
#
178+
# Those unpredictable data acquisition errors can happen either on the input
179+
# features or in the target variable (in which case we often name this label
180+
# noise).
181+
#
182+
# In practice, the **most common source of "noise" is not necessarily a
183+
# real noise**, but rather **the absence of the measurement of a relevant
184+
# feature**.
185+
#
186+
# Consider the following example: when predicting the price of a house, the
187+
# surface area will surely impact the price. However, the price will also be
188+
# influenced by whether the seller is in a rush and decides to sell the house
189+
# below the market price. A model will be able to make predictions based on the
190+
# former but not the latter, so "seller's rush" is a source of noise since it
191+
# won't be present in the features.
192+
#
193+
# Since this missing/unobserved feature is randomly varying from one sample to
194+
# the next, it appears as if the target variable was changing because of the
195+
# impact of a random perturbation or noise, even if there were no significant
196+
# errors made during the data collection process (besides not measuring the
197+
# unobserved input feature).
198+
#
199+
# One extreme case could happen if there where samples in the dataset with
200+
# exactly the same input feature values but different values for the target
201+
# variable. That is very unlikely in real life settings, but could the case if
202+
# all features are categorical or if the numerical features were discretized
203+
# or rounded up naively. In our example, we can imagine two houses having
204+
# the exact same features in our dataset, but having different prices because
205+
# of the (unmeasured) seller's rush.
206+
#
207+
# Apart from these extreme case, it's hard to know for sure what should qualify
208+
# or not as noise and which kind of "noise" as introduced above is dominating.
209+
# But in practice, the best ways to make our predictive models robust to noise
210+
# are to avoid overfitting models by:
211+
#
212+
# - selecting models that are simple enough or with tuned hyper-parameters as
213+
# explained in this module;
214+
# - collecting a larger number of labeled samples for the training set.
215+
168216
# %% [markdown]
169217
# ## Summary:
170218
#

python_scripts/cross_validation_validation_curve.html

+44
Original file line numberDiff line numberDiff line change
@@ -695,6 +695,7 @@ <h2> Contents </h2>
695695
<ul class="visible nav section-nav flex-column">
696696
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#overfitting-vs-underfitting">Overfitting vs. underfitting</a></li>
697697
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#validation-curve">Validation curve</a></li>
698+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#what-is-noise">What is noise?</a></li>
698699
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#summary">Summary:</a></li>
699700
</ul>
700701
</nav>
@@ -882,6 +883,48 @@ <h2>Validation curve<a class="headerlink" href="#validation-curve" title="Link t
882883
small compared to their respective values, and therefore the conclusions above
883884
are quite clear. This is not necessarily always the case.</p>
884885
</section>
886+
<section id="what-is-noise">
887+
<h2>What is noise?<a class="headerlink" href="#what-is-noise" title="Link to this heading">#</a></h2>
888+
<p>In this notebook, we talked about the fact that datasets can contain noise.</p>
889+
<p>There can be several kinds of noises, among which we can identify:</p>
890+
<ul class="simple">
891+
<li><p>measurement imprecision from a physical sensor (e.g. temperature);</p></li>
892+
<li><p>reporting errors by human collectors.</p></li>
893+
</ul>
894+
<p>Those unpredictable data acquisition errors can happen either on the input
895+
features or in the target variable (in which case we often name this label
896+
noise).</p>
897+
<p>In practice, the <strong>most common source of “noise” is not necessarily a
898+
real noise</strong>, but rather <strong>the absence of the measurement of a relevant
899+
feature</strong>.</p>
900+
<p>Consider the following example: when predicting the price of a house, the
901+
surface area will surely impact the price. However, the price will also be
902+
influenced by whether the seller is in a rush and decides to sell the house
903+
below the market price. A model will be able to make predictions based on the
904+
former but not the latter, so “seller’s rush” is a source of noise since it
905+
won’t be present in the features.</p>
906+
<p>Since this missing/unobserved feature is randomly varying from one sample to
907+
the next, it appears as if the target variable was changing because of the
908+
impact of a random perturbation or noise, even if there were no significant
909+
errors made during the data collection process (besides not measuring the
910+
unobserved input feature).</p>
911+
<p>One extreme case could happen if there where samples in the dataset with
912+
exactly the same input feature values but different values for the target
913+
variable. That is very unlikely in real life settings, but could the case if
914+
all features are categorical or if the numerical features were discretized
915+
or rounded up naively. In our example, we can imagine two houses having
916+
the exact same features in our dataset, but having different prices because
917+
of the (unmeasured) seller’s rush.</p>
918+
<p>Apart from these extreme case, it’s hard to know for sure what should qualify
919+
or not as noise and which kind of “noise” as introduced above is dominating.
920+
But in practice, the best ways to make our predictive models robust to noise
921+
are to avoid overfitting models by:</p>
922+
<ul class="simple">
923+
<li><p>selecting models that are simple enough or with tuned hyper-parameters as
924+
explained in this module;</p></li>
925+
<li><p>collecting a larger number of labeled samples for the training set.</p></li>
926+
</ul>
927+
</section>
885928
<section id="summary">
886929
<h2>Summary:<a class="headerlink" href="#summary" title="Link to this heading">#</a></h2>
887930
<p>In this notebook, we saw:</p>
@@ -959,6 +1002,7 @@ <h2>Summary:<a class="headerlink" href="#summary" title="Link to this heading">#
9591002
<ul class="visible nav section-nav flex-column">
9601003
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#overfitting-vs-underfitting">Overfitting vs. underfitting</a></li>
9611004
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#validation-curve">Validation curve</a></li>
1005+
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#what-is-noise">What is noise?</a></li>
9621006
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#summary">Summary:</a></li>
9631007
</ul>
9641008
</nav></div>

searchindex.js

+1-1
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)