@@ -695,6 +695,7 @@ <h2> Contents </h2>
695
695
< ul class ="visible nav section-nav flex-column ">
696
696
< li class ="toc-h2 nav-item toc-entry "> < a class ="reference internal nav-link " href ="#overfitting-vs-underfitting "> Overfitting vs. underfitting</ a > </ li >
697
697
< li class ="toc-h2 nav-item toc-entry "> < a class ="reference internal nav-link " href ="#validation-curve "> Validation curve</ a > </ li >
698
+ < li class ="toc-h2 nav-item toc-entry "> < a class ="reference internal nav-link " href ="#what-is-noise "> What is noise?</ a > </ li >
698
699
< li class ="toc-h2 nav-item toc-entry "> < a class ="reference internal nav-link " href ="#summary "> Summary:</ a > </ li >
699
700
</ ul >
700
701
</ nav >
@@ -882,6 +883,48 @@ <h2>Validation curve<a class="headerlink" href="#validation-curve" title="Link t
882
883
small compared to their respective values, and therefore the conclusions above
883
884
are quite clear. This is not necessarily always the case.</ p >
884
885
</ section >
886
+ < section id ="what-is-noise ">
887
+ < h2 > What is noise?< a class ="headerlink " href ="#what-is-noise " title ="Link to this heading "> #</ a > </ h2 >
888
+ < p > In this notebook, we talked about the fact that datasets can contain noise.</ p >
889
+ < p > There can be several kinds of noises, among which we can identify:</ p >
890
+ < ul class ="simple ">
891
+ < li > < p > measurement imprecision from a physical sensor (e.g. temperature);</ p > </ li >
892
+ < li > < p > reporting errors by human collectors.</ p > </ li >
893
+ </ ul >
894
+ < p > Those unpredictable data acquisition errors can happen either on the input
895
+ features or in the target variable (in which case we often name this label
896
+ noise).</ p >
897
+ < p > In practice, the < strong > most common source of “noise” is not necessarily a
898
+ real noise</ strong > , but rather < strong > the absence of the measurement of a relevant
899
+ feature</ strong > .</ p >
900
+ < p > Consider the following example: when predicting the price of a house, the
901
+ surface area will surely impact the price. However, the price will also be
902
+ influenced by whether the seller is in a rush and decides to sell the house
903
+ below the market price. A model will be able to make predictions based on the
904
+ former but not the latter, so “seller’s rush” is a source of noise since it
905
+ won’t be present in the features.</ p >
906
+ < p > Since this missing/unobserved feature is randomly varying from one sample to
907
+ the next, it appears as if the target variable was changing because of the
908
+ impact of a random perturbation or noise, even if there were no significant
909
+ errors made during the data collection process (besides not measuring the
910
+ unobserved input feature).</ p >
911
+ < p > One extreme case could happen if there where samples in the dataset with
912
+ exactly the same input feature values but different values for the target
913
+ variable. That is very unlikely in real life settings, but could the case if
914
+ all features are categorical or if the numerical features were discretized
915
+ or rounded up naively. In our example, we can imagine two houses having
916
+ the exact same features in our dataset, but having different prices because
917
+ of the (unmeasured) seller’s rush.</ p >
918
+ < p > Apart from these extreme case, it’s hard to know for sure what should qualify
919
+ or not as noise and which kind of “noise” as introduced above is dominating.
920
+ But in practice, the best ways to make our predictive models robust to noise
921
+ are to avoid overfitting models by:</ p >
922
+ < ul class ="simple ">
923
+ < li > < p > selecting models that are simple enough or with tuned hyper-parameters as
924
+ explained in this module;</ p > </ li >
925
+ < li > < p > collecting a larger number of labeled samples for the training set.</ p > </ li >
926
+ </ ul >
927
+ </ section >
885
928
< section id ="summary ">
886
929
< h2 > Summary:< a class ="headerlink " href ="#summary " title ="Link to this heading "> #</ a > </ h2 >
887
930
< p > In this notebook, we saw:</ p >
@@ -959,6 +1002,7 @@ <h2>Summary:<a class="headerlink" href="#summary" title="Link to this heading">#
959
1002
< ul class ="visible nav section-nav flex-column ">
960
1003
< li class ="toc-h2 nav-item toc-entry "> < a class ="reference internal nav-link " href ="#overfitting-vs-underfitting "> Overfitting vs. underfitting</ a > </ li >
961
1004
< li class ="toc-h2 nav-item toc-entry "> < a class ="reference internal nav-link " href ="#validation-curve "> Validation curve</ a > </ li >
1005
+ < li class ="toc-h2 nav-item toc-entry "> < a class ="reference internal nav-link " href ="#what-is-noise "> What is noise?</ a > </ li >
962
1006
< li class ="toc-h2 nav-item toc-entry "> < a class ="reference internal nav-link " href ="#summary "> Summary:</ a > </ li >
963
1007
</ ul >
964
1008
</ nav > </ div >
0 commit comments