Svoboda | Graniru | BBC Russia | Golosameriki | Facebook

XGBoost: A Scalable Tree Boosting System

Authors:

Carlos GuestrinAuthors Info & Claims

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 785 - 794

https://doi.org/10.1145/2939672.2939785

Published: 13 August 2016 Publication History

Abstract

Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.

Supplementary Material

MP4 File (kdd2016_chen_boosting_system_01-acm.mp4)

Download
396.08 MB

References

[1]

R. Bekkerman. The present and the future of the kdd cup competition: an outsider's perspective.

[2]

R. Bekkerman, M. Bilenko, and J. Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011.

Digital Library

[3]

J. Bennett and S. Lanning. The netflix prize. In Proceedings of the KDD Cup Workshop 2007, pages 3--6, New York, Aug. 2007.

[4]

L. Breiman. Random forests. Maching Learning, 45(1):5--32, Oct. 2001.

Digital Library

[5]

C. Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23--581, 2010.

[6]

O. Chapelle and Y. Chang. Yahoo! Learning to Rank Challenge Overview. Journal of Machine Learning Research - W & CP, 14:1--24, 2011.

[7]

T. Chen, H. Li, Q. Yang, and Y. Yu. General functional matrix factorization using gradient boosting. In Proceeding of 30th International Conference on Machine Learning (ICML'13), volume 1, pages 436--444, 2013.

[8]

T. Chen, S. Singh, B. Taskar, and C. Guestrin. Efficient second-order gradient boosting for conditional random fields. In Proceeding of 18th Artificial Intelligence and Statistics Conference (AISTATS'15), volume 1, 2015.

[9]

R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871--1874, 2008.

Digital Library

[10]

J. Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189--1232, 2001.

[11]

J. Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367--378, 2002.

Digital Library

[12]

J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337--407, 2000.

[13]

J. H. Friedman and B. E. Popescu. Importance sampled learning ensembles, 2003.

[14]

M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pages 58--66, 2001.

Digital Library

[15]

X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, and J. Q. n. Candela. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, ADKDD'14, 2014.

Digital Library

[16]

P. Li. Robust Logitboost and adaptive base class (ABC) Logitboost. In Proceedings of the Twenty-Sixth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI'10), pages 302--311, 2010.

Digital Library

[17]

P. Li, Q. Wu, and C. J. Burges. Mcrank: Learning to rank using multiple classification and gradient boosting. In Advances in Neural Information Processing Systems 20, pages 897--904. 2008.

Digital Library

[18]

X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. MLlib: Machine learning in apache spark. Journal of Machine Learning Research, 17(34):1--7, 2016.

Digital Library

[19]

B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo. Planet: Massively parallel learning of tree ensembles with mapreduce. Proceeding of VLDB Endowment, 2(2):1426--1437, Aug. 2009.

Digital Library

[20]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825--2830, 2011.

Digital Library

[21]

G. Ridgeway. Generalized Boosted Models: A guide to the gbm package.

[22]

S. Tyree, K. Weinberger, K. Agrawal, and J. Paykin. Parallel boosted regression trees for web search ranking. In Proceedings of the 20th international conference on World wide web, pages 387--396. ACM, 2011.

Digital Library

[23]

J. Ye, J.-H. Chow, J. Chen, and Z. Zheng. Stochastic gradient boosted distributed decision trees. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM '09.

Digital Library

[24]

Q. Zhang and W. Wang. A fast algorithm for approximate quantiles in high speed data streams. In Proceedings of the 19th International Conference on Scientific and Statistical Database Management, 2007.

Digital Library

[25]

T. Zhang and R. Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.

Cited By

Wu SLu WYin XYang R(2025)Robust watermarking against arbitrary scaling and cropping attacksSignal Processing10.1016/j.sigpro.2024.109655226(109655)Online publication date: Jan-2025
https://doi.org/10.1016/j.sigpro.2024.109655
Bilancia PLocatelli ATutarini AMucciarini MIori MPellicciari M(2025)Online motion accuracy compensation of industrial servomechanisms using machine learning approachesRobotics and Computer-Integrated Manufacturing10.1016/j.rcim.2024.10283891(102838)Online publication date: Feb-2025
https://doi.org/10.1016/j.rcim.2024.102838
Guo AZhao WDing PTang PZeng X(2025)Machine learning approach in multi-channel fiber-optic SPR sensorsOptics & Laser Technology10.1016/j.optlastec.2024.111618181(111618)Online publication date: Feb-2025
https://doi.org/10.1016/j.optlastec.2024.111618
Show More Cited By

Index Terms

XGBoost: A Scalable Tree Boosting System
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Factorizing YAGO: scalable machine learning for linked data
WWW '12: Proceedings of the 21st international conference on World Wide Web

Vast amounts of structured information have been published in the Semantic Web's Linked Open Data (LOD) cloud and their size is still growing rapidly. Yet, access to this information via reasoning and querying is sometimes difficult, due to LOD's size, ...
Large-Scale Machine Learning at Verizon: Theory and Applications
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

This talk will cover recent innovations in large-scale machine learning and their applications on massive, real-world data sets at Verizon. These applications power new revenue generating products and services for the company and are hosted on a massive ...
Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

With massive high-dimensional data now commonplace in research and industry, there is a strong and growing demand for more scalable computational techniques for data analysis and knowledge discovery. Key to turning these data into knowledge is the ...

Comments

Information & Contributors

Information

Published In

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 2016

2176 pages

ISBN:9781450342322

DOI:10.1145/2939672

General Chairs:
Balaji Krishnapuram
IBM
,
Mohak Shah
Bosch
,
Program Chairs:
Alex Smola
Amazon
,
Charu Aggarwal
IBM
,
Dou Shen
Baidu
,
Rajeev Rastogi
Amazon

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 August 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tag

large-scale machine learning

Qualifiers

Research-article

Funding Sources

Conference

KDD '16

Sponsor:

KDD '16: The 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 13 - 17, 2016

California, San Francisco, USA

Acceptance Rates

KDD '16 Paper Acceptance Rate 66 of 1,115 submissions, 6%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

21,820
Total Citations
View Citations
164,506
Total Downloads

Downloads (Last 12 months)44,327
Downloads (Last 6 weeks)5,278

Reflects downloads up to 21 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wu SLu WYin XYang R(2025)Robust watermarking against arbitrary scaling and cropping attacksSignal Processing10.1016/j.sigpro.2024.109655226(109655)Online publication date: Jan-2025
https://doi.org/10.1016/j.sigpro.2024.109655
Bilancia PLocatelli ATutarini AMucciarini MIori MPellicciari M(2025)Online motion accuracy compensation of industrial servomechanisms using machine learning approachesRobotics and Computer-Integrated Manufacturing10.1016/j.rcim.2024.10283891(102838)Online publication date: Feb-2025
https://doi.org/10.1016/j.rcim.2024.102838
Guo AZhao WDing PTang PZeng X(2025)Machine learning approach in multi-channel fiber-optic SPR sensorsOptics & Laser Technology10.1016/j.optlastec.2024.111618181(111618)Online publication date: Feb-2025
https://doi.org/10.1016/j.optlastec.2024.111618
Ma THu XLiu HPeng KLin YChen YLuo KXie SHan CChen M(2025)Elastic modulus prediction for high-temperature treated rock using multi-step hybrid ensemble model combined with coronavirus herd immunity optimizerMeasurement10.1016/j.measurement.2024.115596240(115596)Online publication date: Jan-2025
https://doi.org/10.1016/j.measurement.2024.115596
Castillo FArias LCifuentes J(2025)Biomass flame spectroscopy technique to identify wood species through spectral emission during combustion processesMeasurement10.1016/j.measurement.2024.115581240(115581)Online publication date: Jan-2025
https://doi.org/10.1016/j.measurement.2024.115581
Doan QMa JChen SZhang X(2025)Nonlinear and threshold effects of the built environment, road vehicles and air pollution on urban vitalityLandscape and Urban Planning10.1016/j.landurbplan.2024.105204253(105204)Online publication date: Jan-2025
https://doi.org/10.1016/j.landurbplan.2024.105204
Li YZang GSong CYuan X(2025)A Universal Adaptive Algorithm for Graph Anomaly DetectionInformation Processing & Management10.1016/j.ipm.2024.10390562:1(103905)Online publication date: Jan-2025
https://doi.org/10.1016/j.ipm.2024.103905
Min WJin WHoo YWang HHe XWei YXia J(2025)A stacking ensemble model for predicting the flexural fatigue life of fiber-reinforced concreteInternational Journal of Fatigue10.1016/j.ijfatigue.2024.108599190(108599)Online publication date: Jan-2025
https://doi.org/10.1016/j.ijfatigue.2024.108599
Wang XBraun M(2025)Explainable machine learning-based fatigue assessment of 316L stainless steel fabricated by laser-powder bed fusionInternational Journal of Fatigue10.1016/j.ijfatigue.2024.108588190(108588)Online publication date: Jan-2025
https://doi.org/10.1016/j.ijfatigue.2024.108588
Maged ANour M(2025)Prediction of combustion pressure with deep learning using flame imagesFuel10.1016/j.fuel.2024.133203380(133203)Online publication date: Jan-2025
https://doi.org/10.1016/j.fuel.2024.133203
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents