Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
Skip to main content
We describe an annotation scheme for syntactic information in the CHILDES database (MacWhinney, 2000), which contains several megabytes of transcribed dialogs between parents and children. The annotation scheme is based on grammatical... more
We describe an annotation scheme for syntactic information in the CHILDES database (MacWhinney, 2000), which contains several megabytes of transcribed dialogs between parents and children. The annotation scheme is based on grammatical relations (GRs) that are composed of bilexical dependencies (between a head and a dependent) labeled with the name of the relation involving the two words (such as subject, object and adjunct). We also discuss automatic annotation using our syntactic annotation scheme.
This paper discusses the process of parsing adult utterances directed to a child, in an effort to produce a syntactically annotated corpus of the verbal input to a human language learner. In parsing the Eve corpus of the CHILDES database,... more
This paper discusses the process of parsing adult utterances directed to a child, in an effort to produce a syntactically annotated corpus of the verbal input to a human language learner. In parsing the Eve corpus of the CHILDES database, we encountered several challenges relating to parser coverage and ambiguity, for which we describe solutions that result in a system capable of analyzing almost 80% of the adult utterances in the corpus correctly. We describe characteristics of the language in the corpus that make ...
We describe the design, evolution, and development of the user interface components of the NESPOLE! speech-to-speech translation system. The NESPOLE! system was designed for users with medium-to-low levels of computer literacy and Web... more
We describe the design, evolution, and development of the user interface components of the NESPOLE! speech-to-speech translation system. The NESPOLE! system was designed for users with medium-to-low levels of computer literacy and Web expertise. The user interface was designed to effectively combine Web browsing, real-time sharing of graphical information and multi-modal annotations using a shared whiteboard, and real-time multilingual speech communication, all within an e-commerce scenario. Data collected in ...
Performance and usability of real-world speech-to-speech translation systems, like the one developed within the Nespole! project, are affected by several aspects that go beyond the pure translation quality provided by the Human Language... more
Performance and usability of real-world speech-to-speech translation systems, like the one developed within the Nespole! project, are affected by several aspects that go beyond the pure translation quality provided by the Human Language Technology components of the system. In this paper we describe these aspects as viewpoints along which we have evaluated the Nespole! system. Four main issues are investigated:(1) assessing system performance under various network traffic conditions;(2) a study on the ...
ABSTRACT We introduce BLANC, a family of dy- namic, trainable evaluation metrics for ma- chine translation. Flexible, parametrized models can be learned from past data and automatically optimized to correlate well with human judgments for... more
ABSTRACT We introduce BLANC, a family of dy- namic, trainable evaluation metrics for ma- chine translation. Flexible, parametrized models can be learned from past data and automatically optimized to correlate well with human judgments for different cri- teria (e.g. adequacy, fluency) using dif- ferent correlation measures. Towards this end, we discuss ACS (all common skip- ngrams), a practical algorithm with train- able parameters that estimates reference- candidate translation overlap by comput- ing a weighted sum of all common skip- ngrams in polynomial time. We show that the BLEU and ROUGE metric families are special cases of BLANC, and we compare correlations with human judgments across these three metric families. We analyze the algorithmic complexity of ACS and argue that it is more powerful in modeling both local meaning and sentence-level structure, while offering the same practicality as the established algorithms it generalizes.
Performance and usability of real-world speech-to-speech translation systems, like the one developed within the Nespole! project, are affected by several aspects that go beyond the pure translation quality provided by the Human Language... more
Performance and usability of real-world speech-to-speech translation systems, like the one developed within the Nespole! project, are affected by several aspects that go beyond the pure translation quality provided by the Human Language Technology components of the system. In this paper we describe these aspects as viewpoints along which we have evaluated the Nespole! system. Four main issues are investigated:(1) assessing system performance under various network traffic conditions;(2) a study on the ...
Producing machine translation (MT) for the many minority languages in the world is a serious challenge. Minority languages typically have few resources for building MT systems. For many minor languages there is little machine readable... more
Producing machine translation (MT) for the many minority languages in the world is a serious challenge. Minority languages typically have few resources for building MT systems. For many minor languages there is little machine readable text, few knowledgeable linguists, and little money available for MT development. For these reasons, our research programs on minority language MT have focused on leveraging to the maximum extent two resources that are available for minority languages: linguistic structure and bilingual ...
This Book Chapter is brought to you for free and open access by the School of Computer Science at Research Showcase. It has been accepted for inclusion in Computer Science Department by an authorized administrator of Research Showcase.... more
This Book Chapter is brought to you for free and open access by the School of Computer Science at Research Showcase. It has been accepted for inclusion in Computer Science Department by an authorized administrator of Research Showcase. For more information, please contact researchshowcase@ andrew. cmu. edu.
In this paper we address the issue of efficiently and effectively handling the problem ofextragrammaticality in a large-scale spontaneous spoken language system. We proposeand argue in favor of ROSE, a domain independent parse-and-repair... more
In this paper we address the issue of efficiently and effectively handling the problem ofextragrammaticality in a large-scale spontaneous spoken language system. We proposeand argue in favor of ROSE, a domain independent parse-and-repair approach to theproblem of interpreting extragrammaticalities in spontaneous language input. We arguethat in order for an approach to robust interpretation to be practical, it must be domainindependent, efficient, and effective.
Ambiguity packing is a well known technique for enhancing the efficiency of context-free parsers. However, in the case of unification-augmented context-free parsers where parsing is interleaved with feature unification, the propagation of... more
Ambiguity packing is a well known technique for enhancing the efficiency of context-free parsers. However, in the case of unification-augmented context-free parsers where parsing is interleaved with feature unification, the propagation of feature structures imposes difficulties on the ability of the parser to effectively perform ambiguity packing. We demonstrate that a clever heuristic for prioritizing the execution order of grammar rules and parsing actions can achieve a high level of ambiguity packing that is provably optimal.
Over the past six years the AVENUE project at the Language Technologies Institute at Carnegie Mellon University has worked with native informants and the governments of Chile and Peru to produce a variety of language tools for two... more
Over the past six years the AVENUE project at the Language Technologies Institute at Carnegie Mellon University has worked with native informants and the governments of Chile and Peru to produce a variety of language tools for two indigenous south American languages: Mapudungun, spoken by less than 1 million people in Chile and Argentina, and Quechua, spoken by approximately 10 million people in Peru, Bolivia, and northern Chile. Electronic resources for both Quechua and Mapudungun are scarce.
Abstract Producing machine translation (MT) for the many minority languages in the world is a serious challenge. Minority languages typically have few resources for building MT systems. For many minor languages there is little machine... more
Abstract Producing machine translation (MT) for the many minority languages in the world is a serious challenge. Minority languages typically have few resources for building MT systems. For many minor languages there is little machine readable text, few knowledgeable linguists, and little money available for MT development.
1.1 Problem Statement In the area of parsing spontaneous speech, most w or k so far has primarily focused on dealing w ith te x ts w ithin a narro w, w ell-de fi ned domain. T he main reasons b ehind this restriction have b een to avoid... more
1.1 Problem Statement In the area of parsing spontaneous speech, most w or k so far has primarily focused on dealing w ith te x ts w ithin a narro w, w ell-de fi ned domain. T he main reasons b ehind this restriction have b een to avoid having to maintain very large and comple x grammars on the one hand, and large semantic k no w ledge sources on the other hand.
Abstract We describe the design, evolution, and development of the user interface components of the NESPOLE! speech-to-speech translation system. The NESPOLE! system was designed for users with medium-to-low levels of computer literacy... more
Abstract We describe the design, evolution, and development of the user interface components of the NESPOLE! speech-to-speech translation system. The NESPOLE! system was designed for users with medium-to-low levels of computer literacy and Web expertise. The user interface was designed to effectively combine Web browsing, real-time sharing of graphical information and multi-modal annotations using a shared whiteboard, and real-time multilingual speech communication, all within an e-commerce scenario.
Abstract We investigate an aspect of the relationship between parsing and corpus-based methods in NLP that has received relatively little attention: coverage augmentation in rule-based parsers.
Abstract This paper describes a semi-automatic paraphrasing task for English-Arabic machine translation conducted using Amazon Mechanical Turk. The method for automatically extracting paraphrases is described, as are several human... more
Abstract This paper describes a semi-automatic paraphrasing task for English-Arabic machine translation conducted using Amazon Mechanical Turk. The method for automatically extracting paraphrases is described, as are several human judgment tasks completed by Turkers. An ideal task type, revised specifically to address feedback from Turkers, is shown to be sophisticated enough to identify and filter problem Turkers while remaining simple enough for non-experts to complete.
Abstract Morphologically rich languages pose a challenge for statistical machine translation (SMT). This challenge is magnified when translating into a morphologically rich language. In this work we address this challenge in the framework... more
Abstract Morphologically rich languages pose a challenge for statistical machine translation (SMT). This challenge is magnified when translating into a morphologically rich language. In this work we address this challenge in the framework of a broad-coverage English-to-Arabic phrase based statistical machine translation (PBSMT). We explore the largest-to-date set of Arabic segmentation schemes ranging from full word form to fully segmented forms and examine the effects on system performance.
Abstract Many contemporary language technology systems are characterized by long pipelines of tools with complex dependencies. Too often, these workflows are implemented by ad hoc scripts; or, worse, tools are run manually, making... more
Abstract Many contemporary language technology systems are characterized by long pipelines of tools with complex dependencies. Too often, these workflows are implemented by ad hoc scripts; or, worse, tools are run manually, making experiments difficult to reproduce. These practices are difficult to maintain in the face of rapidly evolving workflows while they also fail to expose and record important details about intermediate data.
–Simple (naïve) modeling of the language translation problem!–Cannot model and generate the correct translation for many linguistic phenomena across languages–both common and rare!–Doesn't generalize well–models are purely... more
–Simple (naïve) modeling of the language translation problem!–Cannot model and generate the correct translation for many linguistic phenomena across languages–both common and rare!–Doesn't generalize well–models are purely lexical–Performance varies widely across language pairs and domains–These issues are particularly severe for languages with rich morphology and languages with highly-divergent syntax and semantics
Abstract With three adaptations we significantly improve the performance of ParaMor, the unsupervised morphology induction algorithm we first proposed in Monson et al.(2007). Our extensions boost ParaMor's performance in all language... more
Abstract With three adaptations we significantly improve the performance of ParaMor, the unsupervised morphology induction algorithm we first proposed in Monson et al.(2007). Our extensions boost ParaMor's performance in all language tracks and in both the linguistic evaluation as well as in the task based information retrieval (IR) evaluation of the peer operated competition Morpho Challenge 2007 (Kurimo et al., 2008a; Kurimo et al., 2008b).
Abstract A key concern in building syntax-based machine translation systems is how to improve coverage by incorporating more traditional phrase-based SMT phrase pairs that do not correspond to syntactic constituents. At the same time, it... more
Abstract A key concern in building syntax-based machine translation systems is how to improve coverage by incorporating more traditional phrase-based SMT phrase pairs that do not correspond to syntactic constituents. At the same time, it is desirable to include as much syntactic information in the system as possible in order to carry out linguistically motivated reordering, for example.
Abstract We present recent advances from our efforts in increasing coverage, robustness, generality and speed of JANUS, CMU's speech-to-speech translation system. JANUS is a speaker-independent system translating spoken utterances in... more
Abstract We present recent advances from our efforts in increasing coverage, robustness, generality and speed of JANUS, CMU's speech-to-speech translation system. JANUS is a speaker-independent system translating spoken utterances in English and also in German into one of German, English or Japanese. The system has been designed around the task of conference registration (CR). It has initially been built based on a speech database of 12 read dialogs, encompassing a vocabulary of around 500 words.
Abstract To evaluate theoretical proposals regarding the course of child language acquisition, researchers often need to rely on the processing of large numbers of syntacticallyparsed utterances, both from children and from their parents.... more
Abstract To evaluate theoretical proposals regarding the course of child language acquisition, researchers often need to rely on the processing of large numbers of syntacticallyparsed utterances, both from children and from their parents. Because it is so difficult to do this by hand, there are currently no parsed corpora of child language input data. To automate this process, we developed a system that combined the MOR tagger, a rule-based parser, and statistical disambiguation techniques.
Baseline Lower log p5 (is)=− 2.63− 2. 30= log p1 log p5 (one| is)=− 2.03− 1. 92= log p2 log p5 (of| is one)=− 0.24− 0. 08= log p3 log p5 (the| is one of)=− 0.47− 0. 21= log p4+ log p5 (few| is one of the)=− 1.26− 1. 26= log p5= log p5 (is... more
Baseline Lower log p5 (is)=− 2.63− 2. 30= log p1 log p5 (one| is)=− 2.03− 1. 92= log p2 log p5 (of| is one)=− 0.24− 0. 08= log p3 log p5 (the| is one of)=− 0.47− 0. 21= log p4+ log p5 (few| is one of the)=− 1.26− 1. 26= log p5= log p5 (is one of the few)=− 6.62− 5. 77= log pLow
Abstract The NESPOLE! System is a speech communication system designed to support multilingual interaction between common users and providers of e-commerce services over the Internet. The core of the system is a distributed... more
Abstract The NESPOLE! System is a speech communication system designed to support multilingual interaction between common users and providers of e-commerce services over the Internet. The core of the system is a distributed interlingua-based speech-to-speech translation system, which is supported by multimodal capabilities that allow the two parties participating in the communication to share Web pages and graphical content which can be annotated using gestures.
Janus is a multi-lingual speech translation system currently operating in the domain of meeting scheduling. Translating spontaneous speech requires a high degree of robustness to overcome the disfluencies of spoken language as well as... more
Janus is a multi-lingual speech translation system currently operating in the domain of meeting scheduling. Translating spontaneous speech requires a high degree of robustness to overcome the disfluencies of spoken language as well as errors in speech recognition. In this system description, we focus on the robust speech translation components in Janus—the skipping GLR* parser, the segmentation of full utterances into semantic dialogue units (SDUs), and the late-stage disambiguation of utterances.
Abstract We propose a novel language-independent framework for inducing a collection of morphological inflection classes from a monolingual corpus of full form words. Our approach involves two main stages. In the first stage, we generate... more
Abstract We propose a novel language-independent framework for inducing a collection of morphological inflection classes from a monolingual corpus of full form words. Our approach involves two main stages. In the first stage, we generate a large data structure of candidate inflection classes and their interrelationships. In the second stage, search and filtering techniques are applied to this data structure, to identify a select collection of" true" inflection classes of the language.
Abstract. We summarize the strong performance of ParaMor, an unsupervised morphology induction system, at Morpho Challenge 2008. When ParaMor's morphological analyses, which specialize at identifying inflectional morphology, are added to... more
Abstract. We summarize the strong performance of ParaMor, an unsupervised morphology induction system, at Morpho Challenge 2008. When ParaMor's morphological analyses, which specialize at identifying inflectional morphology, are added to the analyses from the general-purpose unsupervised morphology induction system, Morfessor, the combined system identifies the morphemes of all five Morpho Challenge languages at recall scores higher than those of any other system which competed in the Challenge.
Abstract The classification of speech genre is not yet an established task in language technologies. However we believe that it is a task that will become fairly important as large amounts of audio (and video) data become widely... more
Abstract The classification of speech genre is not yet an established task in language technologies. However we believe that it is a task that will become fairly important as large amounts of audio (and video) data become widely available. The technological cability to easily transmit and store all human interactions in audio and video could have a radical impact on our social structure. The major open question is how this information can be used in practical and beneficial ways.
Abstract We investigate the possibility of translating continuous spoken conversations in a cross talk environment. This is a task known to be difficult for human translators due to several factors. It is characterized by rapid and even... more
Abstract We investigate the possibility of translating continuous spoken conversations in a cross talk environment. This is a task known to be difficult for human translators due to several factors. It is characterized by rapid and even overlapping turn taking, a high degree of coarticulation, and fragmentary language. We describe experiments using both push to talk as well as cross talk recording conditions. Our results indicate that conversational speech recognition and translation is possible, even in a free crosstalk environment.
Abstract Syntax-based approaches to statistical MT require syntax-aware methods for acquiring their underlying translation models from parallel data. This acquisition process can be driven by syntactic trees for either the source or... more
Abstract Syntax-based approaches to statistical MT require syntax-aware methods for acquiring their underlying translation models from parallel data. This acquisition process can be driven by syntactic trees for either the source or target language, or by trees on both sides. Work to date has demonstrated that using trees for both sides suffers from severe coverage problems.
Abstract Attempts at discourse processing of spontaneously spoken dialogue face several difficulties: multiple hypotheses that result from the parser's attempts to make sense of the output from the speech recognizer, ambiguity that... more
Abstract Attempts at discourse processing of spontaneously spoken dialogue face several difficulties: multiple hypotheses that result from the parser's attempts to make sense of the output from the speech recognizer, ambiguity that results from segmentation of multi-sentence utterances, and cumulative error-errors in the discourse context which cause further errors when subsequent sentences are processed.
Abstract The paper is concerned with the analysis of automatic transcription of spoken input into an interlingua formalism for a speech-to-speech machine translation system. This process is based on two sub-tasks:(1) the recognition of... more
Abstract The paper is concerned with the analysis of automatic transcription of spoken input into an interlingua formalism for a speech-to-speech machine translation system. This process is based on two sub-tasks:(1) the recognition of the domain action (a speech act and a sequence of concepts);(2) the extraction of arguments consisting of feature-value information. Statistical models are used for the former, while a knowledge-based approach is employed for the latter.

And 200 more