Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
SlideShare a Scribd company logo
Robust approach to machine learning models comparison - Dmitry Larko, Sr. Data Scientist, H2O.ai
Sr Data Scientist @ H2O.ai
@DmitryLarko / dmitry@h2o.ai / h2o.ai
Dmitry Larko
Robust approach to
ML models
comparison
Problem and data
• Competition: Mercedes-Benz Greener Manufacturing
• The target – car testing time (sec)
• Evaluation metric – 𝑹 𝟐
• Train 4029 rows
• Test 4029 rows: 81% private, 19% public
• Features (378 columns):
• Binary – tests' characteristics (369 columns)
• Categorical – car's characteristics (8 columns)
• ID – numerical order
What could possibly go wrong?
Leaderboard shake-up stats
• Biggest improvement: 3808 places (3923 ⇒ 115)
• Second biggest improvement: 2838 places (2843 ⇒ 5)
• Biggest fall: 3564 places (I won't point anyone out)
Source: https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/discussion/36103
Leaderboard shake-up
Source: https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/discussion/36103
Leaderboard shake-up
Source: https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/discussion/36103
Why?
• Small public test:
• Test 4029 rows:
81% private, 19% public
• Outliers
• Dropping them or making
stratification with respect to the
them did not help
• Metric very sensitive to outliers:
• 𝑅2
• 5-fold cross-validation std is 0.068
• Most of the competitors overfit to public leaderboard
Better approach to cross-validation
• 10 splits into 5 folds – 50 scores for each fold
• To compare two models we use the t-test for related samples
(scipy.stats.ttest_rel)
• We asses the significance of the difference between two models
𝑇 𝑋1
𝑛
, 𝑋2
𝑛
=
𝑋1 − 𝑋2
𝑆
𝑛
𝑋1, 𝑋2 – out-of-fold values of 𝑅2
for the respective folds,
𝑆 – deviance of the pairwise differences of 𝑋1 and 𝑋2, 𝑆 = 𝑉𝑎𝑟(𝑋1 − 𝑋2)
𝑛 – number of folds (50 in our case)
Wiki: http://en.wikipedia.org/wiki/T-test#Dependent_t-test
Better approach to cross-validation
• The main strategy is to test less
hypothesis
• When changing a hyper-parameter
we should observe the tendency of
the t-statistic's changing. If t-
statistics changes smoothly and
has an optimum, where 𝜌 < 0.05,
then the difference is statistically
significant
• If t-statistic changes rapidly with
small changes of hyper-parameter -
there is no difference
Why testing a lot of hypothesis is a bad
thing?
• The more hypothesis tests you do, the
higher probability to get significant score
by pure chance is (bad luck)
• There are a lot of approaches how to
avoid this but the easiest one is to do less
tests and a lot of careful thinking what to
test 
*https://xkcd.com/882/
- Is that enough?
- Well, no 
Robust approach to machine learning models comparison - Dmitry Larko, Sr. Data Scientist, H2O.ai
Outliers
• ML algorithms get distracted by outliers and spend most part of
the learning process trying to predict them
• Let’s say we identify outliers, 2 ways to fight it:
• Remove outliers from dataset and train and validate your models
without them. An overview of this approach can be found here:
https://www.kaggle.com/c/sberbank-russian-housing-
market/discussion/35684. Guys took 1st place using it.
• Keep outliers in train part but remove them from validation, so your
model will be trained with outliers but decision regarding performance
will be made without considering outliers impact. Check
cross_validation_score_statement function in this repo:
https://github.com/Danila89/cross_validation_custom
Dealing with outliers. Approach 1.
Dataset
Remove outliers
Train
Validation
Dealing with outliers. Approach 2.
Dataset Train
Validation Validation
Which approach to choose?
That depends, but as a rule of thumb:
• If outliers are introduced by mistake (all records in millions, a
few in thousands) – Approach 1 is preferable
• If outliers are defined by nature of the data – use 2nd approach
Wrapping up
• I did not participate in this competition, but I found this approaches
very promising to share with you.
• All kudos goes to Danila Savenkov who place in this competition 11th
and shared his insights here (he covers much more stuff than this
presentation):
• https://www.kaggle.com/c/mercedes-benz-greener-
manufacturing/discussion/36242
• https://www.youtube.com/watch?v=0qHXNeuNOAE (English subtitles
available)
• slides: https://github.com/yandexdataschool/ml-training-website/raw/gh-
pages/presentations/Savenkov_KaggleMercedes_2017_eng.pdf
Robust approach to machine learning models comparison - Dmitry Larko, Sr. Data Scientist, H2O.ai

More Related Content

Robust approach to machine learning models comparison - Dmitry Larko, Sr. Data Scientist, H2O.ai

  • 2. Sr Data Scientist @ H2O.ai @DmitryLarko / [email protected] / h2o.ai Dmitry Larko
  • 3. Robust approach to ML models comparison
  • 4. Problem and data • Competition: Mercedes-Benz Greener Manufacturing • The target – car testing time (sec) • Evaluation metric – 𝑹 𝟐 • Train 4029 rows • Test 4029 rows: 81% private, 19% public • Features (378 columns): • Binary – tests' characteristics (369 columns) • Categorical – car's characteristics (8 columns) • ID – numerical order
  • 6. Leaderboard shake-up stats • Biggest improvement: 3808 places (3923 ⇒ 115) • Second biggest improvement: 2838 places (2843 ⇒ 5) • Biggest fall: 3564 places (I won't point anyone out) Source: https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/discussion/36103
  • 9. Why? • Small public test: • Test 4029 rows: 81% private, 19% public • Outliers • Dropping them or making stratification with respect to the them did not help • Metric very sensitive to outliers: • 𝑅2 • 5-fold cross-validation std is 0.068 • Most of the competitors overfit to public leaderboard
  • 10. Better approach to cross-validation • 10 splits into 5 folds – 50 scores for each fold • To compare two models we use the t-test for related samples (scipy.stats.ttest_rel) • We asses the significance of the difference between two models 𝑇 𝑋1 𝑛 , 𝑋2 𝑛 = 𝑋1 − 𝑋2 𝑆 𝑛 𝑋1, 𝑋2 – out-of-fold values of 𝑅2 for the respective folds, 𝑆 – deviance of the pairwise differences of 𝑋1 and 𝑋2, 𝑆 = 𝑉𝑎𝑟(𝑋1 − 𝑋2) 𝑛 – number of folds (50 in our case) Wiki: http://en.wikipedia.org/wiki/T-test#Dependent_t-test
  • 11. Better approach to cross-validation • The main strategy is to test less hypothesis • When changing a hyper-parameter we should observe the tendency of the t-statistic's changing. If t- statistics changes smoothly and has an optimum, where 𝜌 < 0.05, then the difference is statistically significant • If t-statistic changes rapidly with small changes of hyper-parameter - there is no difference
  • 12. Why testing a lot of hypothesis is a bad thing? • The more hypothesis tests you do, the higher probability to get significant score by pure chance is (bad luck) • There are a lot of approaches how to avoid this but the easiest one is to do less tests and a lot of careful thinking what to test  *https://xkcd.com/882/
  • 13. - Is that enough? - Well, no 
  • 15. Outliers • ML algorithms get distracted by outliers and spend most part of the learning process trying to predict them • Let’s say we identify outliers, 2 ways to fight it: • Remove outliers from dataset and train and validate your models without them. An overview of this approach can be found here: https://www.kaggle.com/c/sberbank-russian-housing- market/discussion/35684. Guys took 1st place using it. • Keep outliers in train part but remove them from validation, so your model will be trained with outliers but decision regarding performance will be made without considering outliers impact. Check cross_validation_score_statement function in this repo: https://github.com/Danila89/cross_validation_custom
  • 16. Dealing with outliers. Approach 1. Dataset Remove outliers Train Validation
  • 17. Dealing with outliers. Approach 2. Dataset Train Validation Validation
  • 18. Which approach to choose? That depends, but as a rule of thumb: • If outliers are introduced by mistake (all records in millions, a few in thousands) – Approach 1 is preferable • If outliers are defined by nature of the data – use 2nd approach
  • 19. Wrapping up • I did not participate in this competition, but I found this approaches very promising to share with you. • All kudos goes to Danila Savenkov who place in this competition 11th and shared his insights here (he covers much more stuff than this presentation): • https://www.kaggle.com/c/mercedes-benz-greener- manufacturing/discussion/36242 • https://www.youtube.com/watch?v=0qHXNeuNOAE (English subtitles available) • slides: https://github.com/yandexdataschool/ml-training-website/raw/gh- pages/presentations/Savenkov_KaggleMercedes_2017_eng.pdf