Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
skip to main content
10.1145/2939672.2939785acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections

XGBoost: A Scalable Tree Boosting System

Authors: Tianqi Chen, Carlos GuestrinAuthors Info & Claims
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Pages 785 - 794
Published: 13 August 2016 Publication History
  • Get Citation Alerts
  • Abstract

    Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.

    Supplementary Material

    MP4 File (kdd2016_chen_boosting_system_01-acm.mp4)

    References

    [1]
    R. Bekkerman. The present and the future of the kdd cup competition: an outsider's perspective.
    [2]
    R. Bekkerman, M. Bilenko, and J. Langford. Scaling Up Machine Learning: Parallel and Distributed Approaches. Cambridge University Press, New York, NY, USA, 2011.
    [3]
    J. Bennett and S. Lanning. The netflix prize. In Proceedings of the KDD Cup Workshop 2007, pages 3--6, New York, Aug. 2007.
    [4]
    L. Breiman. Random forests. Maching Learning, 45(1):5--32, Oct. 2001.
    [5]
    C. Burges. From ranknet to lambdarank to lambdamart: An overview. Learning, 11:23--581, 2010.
    [6]
    O. Chapelle and Y. Chang. Yahoo! Learning to Rank Challenge Overview. Journal of Machine Learning Research - W & CP, 14:1--24, 2011.
    [7]
    T. Chen, H. Li, Q. Yang, and Y. Yu. General functional matrix factorization using gradient boosting. In Proceeding of 30th International Conference on Machine Learning (ICML'13), volume 1, pages 436--444, 2013.
    [8]
    T. Chen, S. Singh, B. Taskar, and C. Guestrin. Efficient second-order gradient boosting for conditional random fields. In Proceeding of 18th Artificial Intelligence and Statistics Conference (AISTATS'15), volume 1, 2015.
    [9]
    R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9:1871--1874, 2008.
    [10]
    J. Friedman. Greedy function approximation: a gradient boosting machine. Annals of Statistics, 29(5):1189--1232, 2001.
    [11]
    J. Friedman. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4):367--378, 2002.
    [12]
    J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2):337--407, 2000.
    [13]
    J. H. Friedman and B. E. Popescu. Importance sampled learning ensembles, 2003.
    [14]
    M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, pages 58--66, 2001.
    [15]
    X. He, J. Pan, O. Jin, T. Xu, B. Liu, T. Xu, Y. Shi, A. Atallah, R. Herbrich, S. Bowers, and J. Q. n. Candela. Practical lessons from predicting clicks on ads at facebook. In Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, ADKDD'14, 2014.
    [16]
    P. Li. Robust Logitboost and adaptive base class (ABC) Logitboost. In Proceedings of the Twenty-Sixth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI'10), pages 302--311, 2010.
    [17]
    P. Li, Q. Wu, and C. J. Burges. Mcrank: Learning to rank using multiple classification and gradient boosting. In Advances in Neural Information Processing Systems 20, pages 897--904. 2008.
    [18]
    X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. MLlib: Machine learning in apache spark. Journal of Machine Learning Research, 17(34):1--7, 2016.
    [19]
    B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo. Planet: Massively parallel learning of tree ensembles with mapreduce. Proceeding of VLDB Endowment, 2(2):1426--1437, Aug. 2009.
    [20]
    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825--2830, 2011.
    [21]
    G. Ridgeway. Generalized Boosted Models: A guide to the gbm package.
    [22]
    S. Tyree, K. Weinberger, K. Agrawal, and J. Paykin. Parallel boosted regression trees for web search ranking. In Proceedings of the 20th international conference on World wide web, pages 387--396. ACM, 2011.
    [23]
    J. Ye, J.-H. Chow, J. Chen, and Z. Zheng. Stochastic gradient boosted distributed decision trees. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM '09.
    [24]
    Q. Zhang and W. Wang. A fast algorithm for approximate quantiles in high speed data streams. In Proceedings of the 19th International Conference on Scientific and Statistical Database Management, 2007.
    [25]
    T. Zhang and R. Johnson. Learning nonlinear functions using regularized greedy forest. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(5), 2014.

    Cited By

    View all
    • (2025)Online motion accuracy compensation of industrial servomechanisms using machine learning approachesRobotics and Computer-Integrated Manufacturing10.1016/j.rcim.2024.10283891(102838)Online publication date: Feb-2025
    • (2024)Algorithm for Handling Incomplete Data with the Application of Fisher's Linear Discriminant Function and AdaboostJournal of Digital Contents Society10.9728/dcs.2024.25.2.49525:2(495-501)Online publication date: 28-Feb-2024
    • (2024)Review of a concrete strength prediction model using machine learningInternational Journal of Highway Engineering10.7855/IJHE.2024.26.1.02726:1(27-32)Online publication date: 28-Feb-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
    August 2016
    2176 pages
    ISBN:9781450342322
    DOI:10.1145/2939672
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 August 2016

    Permissions

    Request permissions for this article.
    Request Permissions

    Check for updates

    Author Tag

    1. large-scale machine learning

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    KDD '16
    Sponsor:

    Acceptance Rates

    KDD '16 Paper Acceptance Rate 66 of 1,115 submissions, 6%;
    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '24

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)43,261
    • Downloads (Last 6 weeks)5,491
    Reflects downloads up to 28 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Online motion accuracy compensation of industrial servomechanisms using machine learning approachesRobotics and Computer-Integrated Manufacturing10.1016/j.rcim.2024.10283891(102838)Online publication date: Feb-2025
    • (2024)Algorithm for Handling Incomplete Data with the Application of Fisher's Linear Discriminant Function and AdaboostJournal of Digital Contents Society10.9728/dcs.2024.25.2.49525:2(495-501)Online publication date: 28-Feb-2024
    • (2024)Review of a concrete strength prediction model using machine learningInternational Journal of Highway Engineering10.7855/IJHE.2024.26.1.02726:1(27-32)Online publication date: 28-Feb-2024
    • (2024)Comparing the Performance of Dimensional Reduction and Model Prediction Performance Due to Missing Data Handling for Building Energy Data AnalysisJournal of the Korean Solar Energy Society10.7836/kses.2024.44.1.05944:1(59-75)Online publication date: 28-Feb-2024
    • (2024)A CT-based integrated model for preoperative prediction of occult lymph node metastasis in early tongue cancerPeerJ10.7717/peerj.1725412(e17254)Online publication date: 26-Apr-2024
    • (2024)Predicting Chinese stock market using XGBoost multi-objective optimization with optimal weightingPeerJ Computer Science10.7717/peerj-cs.193110(e1931)Online publication date: 8-Mar-2024
    • (2024)Systematic literature review on the application of machine learning for the prediction of properties of different types of concretePeerJ Computer Science10.7717/peerj-cs.185310(e1853)Online publication date: 16-May-2024
    • (2024)Vulnerable JavaScript functions detection using stacking of convolutional neural networksPeerJ Computer Science10.7717/peerj-cs.183810(e1838)Online publication date: 29-Feb-2024
    • (2024)Transcriptional immune suppression and up-regulation of double-stranded DNA damage and repair repertoires in ecDNA-containing tumorseLife10.7554/eLife.88895.312Online publication date: 19-Jun-2024
    • (2024)Transcriptional immune suppression and up-regulation of double-stranded DNA damage and repair repertoires in ecDNA-containing tumorseLife10.7554/eLife.8889512Online publication date: 19-Jun-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media

    View Table of Contents