Svoboda | Graniru | BBC Russia | Golosameriki | Facebook

The InPhO DataBlog

Topic Modeling Tutorial at JCDL 2015

Posted on May 14, 2015 by Jaimie Murdock No Comments

Join the HathiTrust Research Center (HTRC) and InPhO Project for a half-day tutorial on HathiTrust data access and topic modeling at JCDL 2015 in Knoxville, TN on Sunday, June 21, 2015, 9am-12pm!
Topic Exploration with the HTRC Data Capsule for Non-Consumptive Research 
Organizers: Jaimie Murdock, Jiaan Zeng and Robert McDonald
Abstract: In this half-day tutorial, we will show 1) how the HathiTrust Research Center (HTRC) Data Capsule can be used for non-­consumptive research over collection of texts and 2) how integrated tools for LDA topic modeling and visualization can be used to drive formulation of new research questions. Participants will be given an account in the HTRC Data Capsule and taught how to use the workset manager to create a corpus, and then use the VM’s secure mode to download texts and analyze their contents. [tutorial paper]

We draw your attention to the astonishingly low half-day tutorial fees:

Half-Day Tutorial/Workshop Early Registration (by May 22!)
ACM/IEEE/SIG/ASIS&T Members – $70
Non-ACM/IEEE/SIG/ASIS&T Members – $95
ACM/IEEE/SIG/ASIS&T Student – $20
Non-member Student – $40

Half-Day Tutorial/Workshop Late/Onsite Registration
ACM/IEEE/SIG/ASIS&T Members – $95
Non-ACM/IEEE/SIG/ASIS&T Members – $120
ACM/IEEE/SIG/ASIS&T Student – $40
Non-member Student – $60

Hope to see you there!

The InPhO Topic Explorer

Posted on August 10, 2014 by Jaimie Murdock No Comments

This summer, the InPhO Project launched the topic explorer. This visualization shows the similarity of articles in the Stanford Encyclopedia of Philosophy, as determined by LDA topic models.

Click for interactive topic explorer

InPhO Topic Explorer for the SEP entry on Animal Consciousness. Click to go to the interactive visualization

The color bands within each article’s row show the topic distribution within that article, and the relative sizes of each band indicates the weight of that topic in the article. The full width of each row indicates the similarity to the focus article. Each topic’s label and color is arbitrarily assigned, but is consistent across articles in the browser per topic.

Display options include topic normalization, alphabetical sort and topic sort. By normalizing topics, the full width of each bar expands and topic weights per document can be compared. By clicking a topic, the documents will reorder acoording to that topic’s weight and topic bars will reorder according to the topic weights in the highest weighted document.

By varying the number of topics, one can get a finer or coarser-grained analysis of the areas discussed in the articles. The visualization currently has 20, 40, 60, 80, 100, and 120 topic models for the Stanford Encyclopedia of Philosophy.

In early explorations, the visualization already highlights some interesting phenomena:

  • For central articles, such as kant (40 topics), one finds that a single topic (topic 30) comprises much of the article. By increasing the number of topics, such as to kant (120 topics), topic 77 now captures the “kant”-ness of the article, but several other components can now be explored. This shows the value of having multiple topic models.
  • For creationism (120 topics), one can see that the particular blend of topics generating that article is truly an outlier, with the probability only just over .5 of generating the next closest document; compare this to the distribution of top articles related to animal-consciousness (120 topics) or kant (120 topics).  Can you find other outliers in the SEP?

The underlying dataset was generated using the InPhO VSM module’s LDA implementation. See Wikipedia: Latent Dirichlet Allocation for more on the LDA topic modeling approach or “Probabilistic Topic Models” (Blei, 2012) for a recent review.

Source code and issue tracking are available at GitHub.

Please share any notes in the comments below!

Network visualization of LDA models through topic similarity

Posted on December 15, 2013 by Doori Lee No Comments

In machine learning, a topic model is a type of statistical model for discovering the “topics” that occur in a corpus composed of documents. The Latent Dirichlet Allocation (LDA) model is one of the most commonly used topic models that represents the corpus as a network of topics.

I have been using the LDA model to see how specific philosophical topics relate to each other in a selection of 1315 volumes in  Hathi Trust library. The LDA assumes that a given textual corpus has K number of topics and each document in the corpus is a mixture of topics. A “topic” is defined as a probability distribution over words and often represented as a list of most probable words in the topic. The number of topics is selected by the user when the model is trained. Thus the LDA model can be trained over the same textual corpus with different number of topics.

The number of topics is important in topic modeling as it determines the extent of a topic. Previous research states there is a natural number of topics for a given corpus “On Finding the Natural Number of Topics with Latent Dirichlet Allocation : Some Observations” R. Arun et al. 2010. However depending on the task, a small or large number of topics, in other words, broad or more specific topics, may be suitable.

By visualizing topic networks, we investigate the connections between the LDA models trained over the same corpus with different number of topics.

In this experiment, we compare the LDA models with different number of topics trained over the same corpus to investigate the relationships between models. We train the LDA model with different number of topics (K=20,40,160) and find similar topics between models using similarity functions from Indiana Philosophy Ontology project vector space model toolkit. For example, for every topic in 20, 40-topic model we find similar topics in 160 topics. The pair of models (e.g. 20 and 160-topic LDA models) is combined in a graph using Gephi. Below, the graphs show the network of topics by color-coded clusters based on modularity.

The graphs show topics from the K=20, 40 models as T# and topics from K=160 model as plain numbers. The graph distinguishes modules (clusters) with different colors and a module contains similar topics measured by how much internal structure there is within the module.

In Graph 1, each topic in the K=20 LDA model is mapped to 8 similar topics in the K=160 model. The 20 topics are grouped into 9 clusters. In Graph 2, each topic in the K=40 LDA model is mapped to 4 similar topics in K=160 model and the 40 topics are grouped into 15 clusters.

The tables below show a sample of topic clusters from each network graphs. The first two rows are topics from the K=20, 40 models (labeled T#) and the following rows are from the K=160 model. In these tables, a topic is labeled with an arbitrary number that is assigned to identify the topic and represented by 5 words that most commonly occur in the topic. The blue bold topics are the topics from the K=160 model that are similar to all topics in the topic cluster.

In Table 1, the related topics are regarding church, gods, and people. In Table 2, Topics 17 and 35 from the K=40 model shares 3 common topics from the K=160 model which relates to ‘social’, ‘individual’, ‘life’.

Through visualizing topic networks, we observe that various numbers of topics can be grouped into clusters by modularity or semantic similarity. Further research could compare clustering algorithms and the LDA models. For example, comparing the semantic closeness of N topic clusters from (K > N) LDA model with topics in (K = N) LDA model could help us obtain high quality topics.

InPhO and Open Access

Posted on March 11, 2013 by Jaimie Murdock 5 Comments

Recent blog posts by Lisa Spiro at Digital Scholarship in the Humanities and by Stefan Heßbrüggen-Walter at Early Modern Thought Online have raised several interesting questions about the position of philosophy in general and the InPhO project in particular with respect to Digital Humanities.

However, we need to address some serious misrepresentations in the latter, especially the claims that we are committed to a “closed development model” and that our “reuse of the Stanford Encyclopedia of Philosophy is not based on liberal licensing, but apparently on special arrangements.” Also, it is incorrect of Heßbrüggen-Walter to say that, “Web scraping and data mining outside the project is prevented by Copyright. So it may not come as a surprise that the ontology developed by InPho is not licensed for reuse (though you can use an API to search it programmatically).”

More….

The Shape of Philosophy (pt. 2)

Posted on December 11, 2012 by junotk 2 Comments

 

As I suggested in the previous post, one interesting use of Isomap is its iterative application — as we focus on a particular region of the map, we obtain finer details and new coordinates emerge.

In the overall map reproduced at right, let us first focus on the area in the red rectangle. This area roughly contains analytic philosophers in the 20th century, and is replotted below (a zoomable version of the image can be reached by clicking on it):

More….

The Shape of Philosophy (pt.1)

Posted on December 3, 2012 by junotk 9 Comments

In my previous posts (pt. 1, 2) I gave some graphical representations of Beagle models on the Stanford Encyclopedia of Philosophy (SEP) and the Internet Encyclopedia of Philosophy (IEP). The shape of those two graphs were basically determined by the force-atlas algorithm in Gephi. Although they look cool, the coordinates produced by the algorithm have no intrinsic meanings — we thus cannot interpret relative locations of philosophers or the x/y-axes. More….

French and English Philosophy (Part 3)

Posted on November 18, 2012 by bkievitk 2 Comments

In the previous two sections, we have looked at different visualizations of the fr.wiki (100 articles about philosophers), en.wiki (100 articles about the same philosophers), Stanford Encyclopedia of Philosophy (SEP) and Internet Enyclopedia of Philosophy (IEP) corpora.

Visualization is a powerful tool for understanding large data sets and helps to direct continued studies, but it is also important to validate our intuitive understanding of the visualizations with quantitative data.

More….