Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
Skip to main content

    Luisa Coheur

    The task of Statistical Machine Translation depends on large amounts of training corpora. Despite the availability of several parallel corpora, these are typically composed of declarative sentences, which may not be appropriate when the... more
    The task of Statistical Machine Translation depends on large amounts of training corpora. Despite the availability of several parallel corpora, these are typically composed of declarative sentences, which may not be appropriate when the goal is to translate other types of sentences, e.g., interrogatives. There have been efforts to create corpora of questions, specially in the context of the evaluation of Question-Answering systems. One of those corpora is the UIUC dataset, composed of nearly 6,000 questions, widely used in the task of Question Classification. In this work, we make available the Portuguese version of the UIUC dataset, which we manually translated, as well as the translation guidelines. We show the impact of this corpus in the performance of a state-of-the-art SMT system when translating questions. Finally, we present a taxonomy of translation errors, according to which we analyze the output of the automatic translation before and after using the corpus as training data.
    As a linguistic phenomenon, collocations have been the subject of numerous researches both in the fields of theoretical and descriptive linguistics, and, more recently, in automatic Natural Language Processing. In the area of Machine... more
    As a linguistic phenomenon, collocations have been the subject of numerous researches both in the fields of theoretical and descriptive linguistics, and, more recently, in automatic Natural Language Processing. In the area of Machine there is still improvements to be done, as major translation engines do not handle collocations in the appropriate way and end up producing literal unsatisfactory translations. Having as a starting point our previous work on machine translation error analysis (Costa et al., 2015), in this article we present a corpus annotated with collocation errors and their classification. To our believe, to have a clear understanding of the difficulties that the collocations represent to the Machine Translations engines, it is necessary a detailed linguistic analysis of their errors.
    We focus on the task of linking topically related segments in a collection of documents. In this scope, an existing corpus of learning materials was annotated with links between its segments. Using this corpus, we evaluate clustering,... more
    We focus on the task of linking topically related segments in a collection of documents. In this scope, an existing corpus of learning materials was annotated with links between its segments. Using this corpus, we evaluate clustering, topic models, and graph-community detection algorithms in an unsupervised approach to the linking task. We propose several schemes to weight the word co-occurrence graph in order to discovery word communities, as well as a method for assigning segments to the discovered communities. Our experimental results indicate that the graph-community approach might BE more suitable for this task.
    Several cases of autistic children successfully interacting with virtual assistants such as Siri or Cortana have been recently reported. In this demo we describe ChatWoz, an application that can be used as a Wizard of Oz, to collect real... more
    Several cases of autistic children successfully interacting with virtual assistants such as Siri or Cortana have been recently reported. In this demo we describe ChatWoz, an application that can be used as a Wizard of Oz, to collect real data for dialogue systems, but also to allow children to interact with their caregivers through it, as it is based on a virtual agent. ChatWoz is composed of an interface controlled by the caregiver, which establishes what the agent will utter, in a synthesised voice. Several elements of the interface can be controlled, such as the agent's face emotions. In this paper we focus on the scenario of child-caregiver interaction and detail the features implemented in order to couple with it.
    This paper describes a system to identify entailment and quantify semantic similarity among pairs of Portuguese sentences. The system relies on a corpus to build a supervised model, and employs the same features regardless of the task.... more
    This paper describes a system to identify entailment and quantify semantic similarity among pairs of Portuguese sentences. The system relies on a corpus to build a supervised model, and employs the same features regardless of the task. Our experiments cover two types of features, contextualized embeddings and lexical features, which we evaluate separately and in combination. The model is derived from a voting strategy on an ensemble of distinct regressors, on similarity measurement, or calibrated classifiers, on entailment detection. Applying such system to other languages mainly depends on the availability of corpora, since all features are either multilingual or language independent. We obtain competitive results on a recent Portuguese corpus, where our best result is obtained by joining embeddings with lexical features.
    Although current Question Generation systems can be used to automatically generate questions for students’ assessments, these need validation and, often, manual corrections. However, this information is never used to improve the... more
    Although current Question Generation systems can be used to automatically generate questions for students’ assessments, these need validation and, often, manual corrections. However, this information is never used to improve the performance of QG systems, where it can play an important role. In this work, we present a system, GEN, that learns from such (implicit) feedback in a online learning setting. Following an example-based approach, it takes as input a small set of sentence/question pairs and creates patterns which are then applied to learning materials. Each generated question, after being corrected by the teacher, is used as a new seed in the next iteration, so more patterns are created each time. We also take advantage of the corrections made by the teacher to score the patterns and therefore rank the generated questions. We measure the teacher’s effort in post-editing required and show that GEN improves over time, reducing from 70% to 30% in average corrections needed per question.
    We present a simple approach to create a “persona” conversational agent. First, we take advantage of a large collection of subtitles to train a generative model based on neural networks. Second, we manually handcraft a small corpus of... more
    We present a simple approach to create a “persona” conversational agent. First, we take advantage of a large collection of subtitles to train a generative model based on neural networks. Second, we manually handcraft a small corpus of interactions that specify our character (from now on the “persona corpus”). Third, we enrich a retrieval based engine with this corpus. Finally, we combine both into a single agent. A preliminary evaluation shows that the generative model can hardly implement a coherent “persona, but can successfully complement the retrieval model.
    We present JUST.ASK, a publicly available Question Answering system, which is freely available. Its architecture is composed of the usual Question Processing, Passage Retrieval and Answer Extraction components. Several details on the... more
    We present JUST.ASK, a publicly available Question Answering system, which is freely available. Its architecture is composed of the usual Question Processing, Passage Retrieval and Answer Extraction components. Several details on the information generated and manipulated by each of these components are also provided to the user when interacting with the demonstration. Since JUST.ASK also learns to answer new questions based on users’ feedback, (s)he is invited to identify the correct answers. These will then be used to retrieve answers to future questions.
    A língua gestual portuguesa, tal como a língua portuguesa, evoluiu de forma natural, adquirindo características gramaticais distintas do português. Assim, o desenvolvimento de um tradutor entre as duas não consiste somente no mapeamento... more
    A língua gestual portuguesa, tal como a língua portuguesa, evoluiu de forma natural, adquirindo características gramaticais distintas do português. Assim, o desenvolvimento de um tradutor entre as duas não consiste somente no mapeamento de uma palavra num gesto (português gestuado), mas em garantir que os gestos resultantes satisfazem a gramática da língua gestual portuguesa e que as traduções estejam semanticamente corretas. Trabalhos desenvolvidos anteriormente utilizam exclusivamente regras de tradução manuais, sendo muito limitados na quantidade de fenómenos gramaticais abrangidos, produzindo pouco mais que português gestuado. Neste artigo, apresenta-se o primeiro sistema de tradução de português para a língua gestual portuguesa, o PE2LGP, que, para além de regras manuais, se baseia em regras de tradução construídas automaticamente a partir de um corpus de referência. Dada uma frase em português, o sistema devolve uma sequência de glosas com marcadores que identificam expressões...
    Two sentences can be related in many different ways. Distinct tasks in natural language processing aim to identify different semantic relations between sentences. We developed several models for natural language inference and semantic... more
    Two sentences can be related in many different ways. Distinct tasks in natural language processing aim to identify different semantic relations between sentences. We developed several models for natural language inference and semantic textual similarity for the Portuguese language. We took advantage of pre-trained models (BERT); additionally, we studied the roles of lexical features. We tested our models in several datasets—ASSIN, SICK-BR and ASSIN2—and the best results were usually achieved with ptBERT-Large, trained in a Brazilian corpus and tuned in the latter datasets. Besides obtaining state-of-the-art results, this is, to the best of our knowledge, the most all-inclusive study about natural language inference and semantic textual similarity for the Portuguese language.
    Abstract. We present a syntax/semantics interface that was developed having in mind a set of problems identified in system Edite, which wasbased on a traditional syntax/semantics interface. In our syntax/semantics interface, syntactic and... more
    Abstract. We present a syntax/semantics interface that was developed having in mind a set of problems identified in system Edite, which wasbased on a traditional syntax/semantics interface. In our syntax/semantics interface, syntactic and semantic rules are independent, semantic rules are hierarchically organized, and partial analysis can be produced. Resumo. Apresentamos uma interface sintaxe/semântica que foi desenvolvida tendo em mente um conjunto de problemas identificados no sistema Edite, uma interface sintaxe/semântica tradicional. Na nossa interface sintaxe/semântica as regras sintácticas e semânticas são independentes, as regras semânticas estão hierarquicamente organizadas, e podem ser produzidos resultados parciais. Keywords. Syntax/semantics interface, partial results, semantic rules hierarchically organized. 1
    Abstract. This report addresses the problem of maintaining linguistic data collections adequate to the needs of different applications. We posit that when developing NLP applications, one has to manage not only the software development... more
    Abstract. This report addresses the problem of maintaining linguistic data collections adequate to the needs of different applications. We posit that when developing NLP applications, one has to manage not only the software development process, but also the linguistic data: handling them separately will reduce the complexity of the process as a whole, thereby increasing the overall quality. Data consistency is also improved since there is only one collection to manage. We present two illustrative experiments that benefitted ...
    There are several tools for the Portuguese language. However, and due to different choices at the basis of these tools' behaviour (different preprocessing, different labels, etc.), it becomes difficult to have an idea of each... more
    There are several tools for the Portuguese language. However, and due to different choices at the basis of these tools' behaviour (different preprocessing, different labels, etc.), it becomes difficult to have an idea of each one's comparative performance. In this work, we propose an evaluation of tools, publicly available and free, that perform the tasks of Part-of-Speech Tagging and Named Entity Recognition, for the Portuguese language. We evaluate twelve different models for the first task and eight for the second. All the resources used in this evaluation (mapping tables between labels, testing corpora, etc.) will be made available, allowing to replicate/fine-tune the results here presented. We also present a qualitative analysis of two dependency parsers. To the best of our knowledge, no recent work that considers the recent available tools, was carried out for the Portuguese language.Têm sido desenvolvidas várias ferramentas para o processamento da língua portuguesa. N...
    An algorithm for text analysis is presented - the leaves analysis algorithm. Our approach, inscribed in the 5P methodology, is characterised by several points: The declarative source of the algorithm is completely separated from... more
    An algorithm for text analysis is presented - the leaves analysis algorithm. Our approach, inscribed in the 5P methodology, is characterised by several points: The declarative source of the algorithm is completely separated from linguistic descriptions. We set ourselves apart from those approaches that use unification grammars, where the grammar itself has the double function of expressing the linguistic descriptions while being used as the algorithm's declarative source. We can choose the fineness of the analysis by extracting from the descriptions more or less information, according to the functionality we want to provide. We emphasise the general principles governing language behaviour (diabolic transition).
    Research Interests:
    Research Interests:

    And 104 more