Abstract Automatic recognition of named entities such as people, places, organizations, books, an... more Abstract Automatic recognition of named entities such as people, places, organizations, books, and movies across the entire web presents a number of challenges, both of scale and scope. Data for training general named entity recognizers is difficult to come by, and efficient machine learning methods are required once we have found hundreds of millions of labeled observations. We present an implemented system that addresses these issues, including a method for automatically generating training data, and a multi-class online ...
Journal of The American Society for Information Science and Technology, 2007
Most text analysis and retrieval work to date has focused on the topic of a text; that is, what i... more Most text analysis and retrieval work to date has focused on the topic of a text; that is, what it is about. However, a text also contains much useful information in its style, or how it is written. This includes information about its author, its purpose, feelings it is meant to evoke, and more. This article develops a new type of lexical feature for use in stylistic text classification, based on taxonomies of various semantic functions of certain choice words or phrases. We demonstrate the usefulness of such features for the stylistic text classification tasks of determining author identity and nationality, the gender of literary characters, a text's sentiment (positive/negative evaluation), and the rhetorical character of scientific journal articles. We further show how the use of functional features aids in gaining insight about stylistic differences among different kinds of texts.
Systemic features use linguistically- derived language models as a basis for text classification.... more Systemic features use linguistically- derived language models as a basis for text classification. The graph structure of these models allows for feature repre- sentations not available with traditional bag-of-words approaches. This paper explores the set of possible represen- tations, and proposes feature selection methods that aim to produce the most compact and effective set of attributes for a given classification problem. We show that small sets of systemic fea- tures can outperform larger sets of word- based features in the task of identifying financial scam documents.
We present a new collection of training corpora for evaluation of language-independent named enti... more We present a new collection of training corpora for evaluation of language-independent named entity recognition systems. For the five languages included in this initial release, Basque, Dutch, English, Korean, and Spanish, we provide an analysis of the relative difficulty of the NER task for both the language in general, and as a supervised task using these corpora. We construct three strongly language-independent systems, each using only orthographic features, and compare their performance on both seen and unseen data. We achieve improved results through combining these classifiers, showing that ensemble approaches are suitable when dealing with language-independent problems.
1 Introduction Identification of named entities is an increas-ingly important task with applicati... more 1 Introduction Identification of named entities is an increas-ingly important task with applications in many areas of human language technology, includ-ing information extraction and machine trans-lation. There has been a move away from hand-coded systems toward machine ...
Abstract Automatic recognition of named entities such as people, places, organizations, books, an... more Abstract Automatic recognition of named entities such as people, places, organizations, books, and movies across the entire web presents a number of challenges, both of scale and scope. Data for training general named entity recognizers is difficult to come by, and efficient machine learning methods are required once we have found hundreds of millions of labeled observations. We present an implemented system that addresses these issues, including a method for automatically generating training data, and a multi-class online ...
Journal of The American Society for Information Science and Technology, 2007
Most text analysis and retrieval work to date has focused on the topic of a text; that is, what i... more Most text analysis and retrieval work to date has focused on the topic of a text; that is, what it is about. However, a text also contains much useful information in its style, or how it is written. This includes information about its author, its purpose, feelings it is meant to evoke, and more. This article develops a new type of lexical feature for use in stylistic text classification, based on taxonomies of various semantic functions of certain choice words or phrases. We demonstrate the usefulness of such features for the stylistic text classification tasks of determining author identity and nationality, the gender of literary characters, a text's sentiment (positive/negative evaluation), and the rhetorical character of scientific journal articles. We further show how the use of functional features aids in gaining insight about stylistic differences among different kinds of texts.
Systemic features use linguistically- derived language models as a basis for text classification.... more Systemic features use linguistically- derived language models as a basis for text classification. The graph structure of these models allows for feature repre- sentations not available with traditional bag-of-words approaches. This paper explores the set of possible represen- tations, and proposes feature selection methods that aim to produce the most compact and effective set of attributes for a given classification problem. We show that small sets of systemic fea- tures can outperform larger sets of word- based features in the task of identifying financial scam documents.
We present a new collection of training corpora for evaluation of language-independent named enti... more We present a new collection of training corpora for evaluation of language-independent named entity recognition systems. For the five languages included in this initial release, Basque, Dutch, English, Korean, and Spanish, we provide an analysis of the relative difficulty of the NER task for both the language in general, and as a supervised task using these corpora. We construct three strongly language-independent systems, each using only orthographic features, and compare their performance on both seen and unseen data. We achieve improved results through combining these classifiers, showing that ensemble approaches are suitable when dealing with language-independent problems.
1 Introduction Identification of named entities is an increas-ingly important task with applicati... more 1 Introduction Identification of named entities is an increas-ingly important task with applications in many areas of human language technology, includ-ing information extraction and machine trans-lation. There has been a move away from hand-coded systems toward machine ...
Uploads
Papers by Casey Whitelaw