Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
skip to main content

Latent dirichlet allocation

Authors: David M. Blei, Andrew Y. Ng, Michael I. JordanAuthors Info & Claims
Pages 993 - 1022
Published: 01 March 2003 Publication History
  • Get Citation Alerts
  • Abstract

    We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

    References

    [1]
    M. Abramowitz and I. Stegun, editors. Handbook of Mathematical Functions. Dover, New York, 1970.
    [2]
    D. Aldous. Exchangeability and related topics. In École d'été de probabilités de Saint-Flour, XIII-- 1983, pages 1-198. Springer, Berlin, 1985.
    [3]
    H. Attias. A variational Bayesian framework for graphical models. In Advances in Neural Information Processing Systems 12, 2000.
    [4]
    L. Avery. Caenorrhabditis genetic center bibliography. 2002. URL http://elegans.swmed.edu/wli/cgcbib.
    [5]
    R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999.
    [6]
    D. Blei and M. Jordan. Modeling annotated data. Technical Report UCB//CSD-02-1202, U.C. Berkeley Computer Science Division, 2002.
    [7]
    B. de Finetti. Theory of probability. Vol. 1-2. John Wiley & Sons Ltd., Chichester, 1990. Reprint of the 1975 translation.
    [8]
    S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6): 391-407, 1990.
    [9]
    P. Diaconis. Recent progress on de Finetti's notions of exchangeability. In Bayesian statistics, 3 (Valencia, 1987), pages 111-125. Oxford Univ. Press, New York, 1988.
    [10]
    J. Dickey. Multiple hypergeometric functions: Probabilistic interpretations and statistical uses. Journal of the American Statistical Association, 78: 628-637, 1983.
    [11]
    J. Dickey, J. Jiang, and J. Kadane. Bayesian methods for censored categorical data. Journal of the American Statistical Association, 82: 773-781, 1987.
    [12]
    A. Gelman, J. Carlin, H. Stern, and D. Rubin. Bayesian data analysis. Chapman & Hall, London, 1995.
    [13]
    T. Griffiths and M. Steyvers. A probabilistic approach to semantic representation. In Proceedings of the 24th Annual Conference of the Cognitive Science Society, 2002.
    [14]
    D. Harman. Overview of the first text retrieval conference (TREC-1). In Proceedings of the First Text Retrieval Conference (TREC-1), pages 1-20, 1992.
    [15]
    D. Heckerman and M. Meila. An experimental comparison of several clustering and initialization methods. Machine Learning, 42: 9-29, 2001.
    [16]
    T. Hofmann. Probabilistic latent semantic indexing. Proceedings of the Twenty-Second Annual International SIGIR Conference, 1999.
    [17]
    F. Jelinek. Statistical Methods for Speech Recognition. MIT Press, Cambridge, MA, 1997.
    [18]
    T. Joachims. Making large-scale SVM learning practical. In Advances in Kernel Methods - Support Vector Learning. M.I.T. Press, 1999.
    [19]
    M. Jordan, editor. Learning in Graphical Models. MIT Press, Cambridge, MA, 1999.
    [20]
    M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. Introduction to variational methods for graphical models. Machine Learning, 37: 183-233, 1999.
    [21]
    R. Kass and D. Steffey. Approximate Bayesian inference in conditionally independent hierarchical models (parametric empirical Bayes models). Journal of the American Statistical Association, 84 (407): 717-726, 1989.
    [22]
    M. Leisink and H. Kappen. General lower bounds based on computer generated higher order expansions. In Uncertainty in Artificial Intelligence, Proceedings of the Eighteenth Conference, 2002.
    [23]
    T. Minka. Estimating a Dirichlet distribution. Technical report, M.I.T., 2000.
    [24]
    T. P. Minka and J. Lafferty. Expectation-propagation for the generative aspect model. In Uncertainty in Artificial Intelligence (UAI), 2002.
    [25]
    C. Morris. Parametric empirical Bayes inference: Theory and applications. Journal of the American Statistical Association, 78(381): 47-65, 1983. With discussion.
    [26]
    K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. IJCAI-99 Workshop on Machine Learning for Information Filtering, pages 61-67, 1999.
    [27]
    K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3): 103-134, 2000.
    [28]
    C. Papadimitriou, H. Tamaki, P. Raghavan, and S. Vempala. Latent semantic indexing: A probabilistic analysis. pages 159-168, 1998.
    [29]
    A. Popescul, L. Ungar, D. Pennock, and S. Lawrence. Probabilistic models for unified collaborative and content-based recommendation in sparse-data environments. In Uncertainty in Artificial Intelligence, Proceedings of the Seventeenth Conference, 2001.
    [30]
    J. Rennie. Improving multi-class text classification with naive Bayes. Technical Report AITR-2001- 004, M.I.T., 2001.
    [31]
    G. Ronning. Maximum likelihood estimation of Dirichlet distributions. Journal of Statistcal Computation and Simulation, 34(4): 215-221, 1989.
    [32]
    G. Salton and M. McGill, editors. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.

    Cited By

    View all
    • (2024)A Natural Language Processing Model for Automated Organization and Analysis of Intangible Cultural HeritageJournal of Organizational and End User Computing10.4018/JOEUC.34973636:1(1-27)Online publication date: 6-May-2024
    • (2024)Hotel Rating Prediction System Based on Time FactorsJournal of Organizational and End User Computing10.4018/JOEUC.34212936:1(1-29)Online publication date: 15-May-2024
    • (2024)A Machine Learning and Large Language Model-Integrated Approach to Research Project EvaluationJournal of Database Management10.4018/JDM.34540035:1(1-14)Online publication date: 7-Jan-2024
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    The Journal of Machine Learning Research  Volume 3, Issue
    3/1/2003
    1437 pages
    ISSN:1532-4435
    EISSN:1533-7928
    Issue’s Table of Contents

    Publisher

    JMLR.org

    Publication History

    Published: 01 March 2003
    Published in JMLR Volume 3

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2,772
    • Downloads (Last 6 weeks)371
    Reflects downloads up to 28 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Natural Language Processing Model for Automated Organization and Analysis of Intangible Cultural HeritageJournal of Organizational and End User Computing10.4018/JOEUC.34973636:1(1-27)Online publication date: 6-May-2024
    • (2024)Hotel Rating Prediction System Based on Time FactorsJournal of Organizational and End User Computing10.4018/JOEUC.34212936:1(1-29)Online publication date: 15-May-2024
    • (2024)A Machine Learning and Large Language Model-Integrated Approach to Research Project EvaluationJournal of Database Management10.4018/JDM.34540035:1(1-14)Online publication date: 7-Jan-2024
    • (2024)Community Detection on Social Networks With Sentimental InteractionInternational Journal on Semantic Web & Information Systems10.4018/IJSWIS.34123220:1(1-23)Online publication date: 9-Apr-2024
    • (2024)Characteristics of students’ learning behavior preferences — an analysis of self-commentary data based on the LDA modelJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-23297146:2(4495-4509)Online publication date: 14-Feb-2024
    • (2024)A review on network representation learning with multi-granularity perspectiveIntelligent Data Analysis10.3233/IDA-22732828:1(3-32)Online publication date: 1-Jan-2024
    • (2024)Frontiers in OperationsManufacturing & Service Operations Management10.1287/msom.2022.064126:4(1286-1305)Online publication date: 1-Jul-2024
    • (2024)Product Development in CrowdfundingManufacturing & Service Operations Management10.1287/msom.2022.034426:2(701-721)Online publication date: 1-Mar-2024
    • (2024)Can AI Help in Ideation? A Theory-Based Model for Idea Screening in Crowdsourcing ContestsMarketing Science10.1287/mksc.2023.143443:1(54-72)Online publication date: 1-Jan-2024
    • (2024)FrontiersMarketing Science10.1287/mksc.2023.030643:4(709-722)Online publication date: 1-Jul-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media

    View Issue’s Table of Contents