Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
Skip to main content
Dan Cristea

    Dan Cristea

    This paper describes the on-going work carried out within the CoBiLiRo (Bimodal Corpus for Romanian Language) research project, part of ReTeRom (Resources and Technologies for Developing Human-Machine Interfaces in Romanian). Data... more
    This paper describes the on-going work carried out within the CoBiLiRo (Bimodal Corpus for Romanian Language) research project, part of ReTeRom (Resources and Technologies for Developing Human-Machine Interfaces in Romanian). Data annotation finds increasing use in speech recognition and synthesis with the goal to support learning processes. In this context, a variety of different annotation systems for application to Speech and Text Processing environments have been presented. Even if many designs for the data annotations workflow have emerged, the process of handling metadata, to manage complex user-defined annotations, is not covered enough. We propose a design of the format aimed to serve as an annotation standard for bimodal resources, which facilitates searching, editing and statistical analysis operations over it. The design and implementation of an infrastructure that houses the resources are also presented. The goal is widening the dissemination of bimodal corpora for resea...
    Themes: current work on annotation tools, interraction of human and automatic annotation tools. We present an annotation tool, called GLOSS, that manifests the following features: accepts as inputs SGML source documents and/or their... more
    Themes: current work on annotation tools, interraction of human and automatic annotation tools. We present an annotation tool, called GLOSS, that manifests the following features: accepts as inputs SGML source documents and/or their database images and produces as output source SGML documents as well as the associated database images; allows for simultaneous opening of more documents; can collapse independent annotation views of the same original document, which also allows for a layer-by-layer annotation process in different annotation sessions and by different annotators, includig automatic; offers an attractive interface to the user; permits discourse structure annotation by offering a pair of building operations (adjoining and substitution) and remaking operations (undo, delete parent-child link and tree dismember). Finally we display an example that shows how GLOSS is employed to validate, using a corpora, a theory of global discourse. A demo can be offered on a PC platform run...
    The paper argues in favour of an electronic form of the thesaurus dictionary of the Romanian language, the dictionary edited by the Romanian Academy in two editions since 1913. Preliminary steps like scanning, optical character... more
    The paper argues in favour of an electronic form of the thesaurus dictionary of the Romanian language, the dictionary edited by the Romanian Academy in two editions since 1913. Preliminary steps like scanning, optical character recognition, and pre-processing operations have already been done. The paper presents a prototype for the correction of the digital form of the dictionary. The numerous advantages of the digital thesaurus dictionary are discussed, as a basis for future work in Romanian lexicography and, more generally, in language processing. Key words: resources.
    Editorial Board of the Volume Comit�� editorial del volumen Eneko Agirre Christian Boitet Nicoletta Calzolari John Carroll Kenneth Church Dan Cristea Walter Daelemans Barbara Di Eugenio Claire Gardent Alexander Gelbukh Gregory... more
    Editorial Board of the Volume Comit�� editorial del volumen Eneko Agirre Christian Boitet Nicoletta Calzolari John Carroll Kenneth Church Dan Cristea Walter Daelemans Barbara Di Eugenio Claire Gardent Alexander Gelbukh Gregory Grefenstette Eva Hajicova Yasunari Harada Eduard Hovy Nancy Ide Diana Inkpen Aravind Joshi Dimitar Kazakov Alma Kharrat Adam Kilgarriff Alexander Koller Sandra Kuebler Hugo Liu Aurelio Lopez Lopez Diana McCarthy Igor Mel'cuk Rada Mihalcea Masaki Murata Nicolas Nicolov Kemal Oflazer Constantin Orasan Manuel Palomar Ted ...
    The present paper examines a variety of ways in which the Corpus of Contemporary Romanian Language (CoRoLa) can be used. A multitude of examples intends to highlight a wide range of interrogation possibilities that CoRoLa opens for... more
    The present paper examines a variety of ways in which the Corpus of Contemporary Romanian Language (CoRoLa) can be used. A multitude of examples intends to highlight a wide range of interrogation possibilities that CoRoLa opens for different types of users. The querying of CoRoLa displayed here is supported by the KorAP frontend, through the querying language Poliqarp. Interrogations address annotation layers, such as the lexical, morphological and, in the near future, the syntactical layer, as well as the metadata. Other issues discussed are how to build a virtual corpus, how to deal with errors, how to find expressions and how to identify expressions
    This paper presents the almost final results of a priority project of the Romanian Academy-the Corpus of Contemporary Romanian Language (CoRoLa). The Corpus includes data in both written and spoken forms of the language. The textual... more
    This paper presents the almost final results of a priority project of the Romanian Academy-the Corpus of Contemporary Romanian Language (CoRoLa). The Corpus includes data in both written and spoken forms of the language. The textual collection is made up of publications covering the period from the 2nd World War to our days, while the spoken collection includes only recent recordings.
    This paper presents the almost final results of a priority project of the Romanian Academy-the Corpus of Contemporary Romanian Language (CoRoLa). The Corpus includes data in both written and spoken forms of the language. The textual... more
    This paper presents the almost final results of a priority project of the Romanian Academy-the Corpus of Contemporary Romanian Language (CoRoLa). The Corpus includes data in both written and spoken forms of the language. The textual collection is made up of publications covering the period from the 2nd World War to our days, while the spoken collection includes only recent recordings.
    In this article we present a method for automatic extraction of syntactic patterns that are used to develop a dependency parsing method. The patterns have been extracted from a corpus automatically annotated for tokens, sentences'... more
    In this article we present a method for automatic extraction of syntactic patterns that are used to develop a dependency parsing method. The patterns have been extracted from a corpus automatically annotated for tokens, sentences' borders, parts of speech and noun phrases, and manually annotated for dependency relations between words. The evaluation shows promising results in the case of an order-free language.
    The way in which discourse features express connections back to the previous discourse has been described in the literature in terms of adjoining at the right frontier of discourse structure. But this does not allow for discourse features... more
    The way in which discourse features express connections back to the previous discourse has been described in the literature in terms of adjoining at the right frontier of discourse structure. But this does not allow for discourse features that express expectations about what is to come in the subsequent discourse. After characterizing these expectations and their distribution in text, we show how an approach that makes use of substitution as well as adjoining on a suitably defined right frontier, can be used to both process expectations and constrain discouse processing in general.
    The quality of discourse structure annotations is negatively influenced by the numerous difficulties that occur in the analysis process. In contrast, referential annotation resources are considerably more reliable, given the high... more
    The quality of discourse structure annotations is negatively influenced by the numerous difficulties that occur in the analysis process. In contrast, referential annotation resources are considerably more reliable, given the high precision of the existent anaphora resolution systems. We present an approach based on the Veins Theory (Cristea, Ide, Romary, 1998), in which successful reference annotations of texts are exploited in order to improve arbitrary structural analyses; in this way, the large amount of corpora annotated at reference level can be used for the acquisition of discourse structure annotation resources. 1.
    Preface Big cultural heritage data present an unprecedented opportunity for the humanities that is reshaping conventional research methods. However, digital humanities have grown past the stage where the mere availability of digital data... more
    Preface Big cultural heritage data present an unprecedented opportunity for the humanities that is reshaping conventional research methods. However, digital humanities have grown past the stage where the mere availability of digital data was enough as a demonstrator of possibilities. Knowledge resource modeling, development, enrichment and integration is crucial for associating relevant information in pools of digital material which are not only scattered across various archives, libraries and collections, but they also often lack relevant metadata. Within this research framework, NLP approaches originally stemming from lexico-semantic information extraction and knowledge resource representation, modeling, development and reuse have a pivotal role to play. From the NLP perspective, applications of knowledge resources for the SocioEconomic Sciences and Humanities present numerous interesting research challenges that relate among others to the development of historical lexico-semantic...
    In this paper we present the methodology employed in the creation of an aligned speech-to-text Romanian Corpus. The corpus uses recordings from the AMPER-ROM and AMPRom projects as well as ad-hoc recordings of continuous speech. The... more
    In this paper we present the methodology employed in the creation of an aligned speech-to-text Romanian Corpus. The corpus uses recordings from the AMPER-ROM and AMPRom projects as well as ad-hoc recordings of continuous speech. The protocol for speech recording and labelling, as well as the manual annotation procedure, are described. The corpus is intended to be used for training a speech segmentation module and an automatic speech-to-text aligner module.
    As it is known, on the political scene the success of a speech can be measured by the degree in which the speaker is able to change attitudes, opinions, feelings and political beliefs in his auditorium. We suggest a range of analysis... more
    As it is known, on the political scene the success of a speech can be measured by the degree in which the speaker is able to change attitudes, opinions, feelings and political beliefs in his auditorium. We suggest a range of analysis tools, all be-longing to semiotics, from lexical-semantic, to syntactical and rhetorical, that integrated in the exploratory panoply of discur-sive weapons of a political speaker could influence the impact of her/his speeches over a sensible auditory. Our approach is based on the assumption that semiotics, in its quality of methodology and meta-language, can capitalize a situational analysis over the political discourse. Such an analysis assumes establishing the communication situation, in our case, the Parliament’s vote in favour of suspending the Romanian President, through which we can describe an action of communication. We describe a platform, the Discourse Analysis Tool (DAT), which integrates a range of natural language processing tools with the ...
    The paper presents the architecture and behaviour of a system that integrates several ideas from artificial intelligence and natural language processing in order to build a semantic representation for discourse. It is shown how modules... more
    The paper presents the architecture and behaviour of a system that integrates several ideas from artificial intelligence and natural language processing in order to build a semantic representation for discourse. It is shown how modules that can contribute with different kinds of expertise (syntactic, semantic, common sense inference, discourse planning, anaphora resolution, cue-words and temporal) can be placed around a skeleton made up of a POS/morphological tagger and an incremental discourse parser. The performance of the system is affected but is not vitally dependent of any of the contributing expert modules 1.
    Evaluation campaigns have become an established way to evaluate automatic systems which tackle the same task. This paper presents the first edition of the Anaphora Resolution Exercise (ARE) and the lessons learnt from it. This first... more
    Evaluation campaigns have become an established way to evaluate automatic systems which tackle the same task. This paper presents the first edition of the Anaphora Resolution Exercise (ARE) and the lessons learnt from it. This first edition focused only on English pronominal anaphora and NP coreference, and was organised as an exploratory exercise where various issues were investigated. ARE proposed four different tasks: pronominal anaphora resolution and NP coreference resolution on a predefined set of entities, pronominal anaphora resolution and NP coreference resolution on raw texts. For each of these tasks different inputs and evaluation metrics were prepared. This paper presents the four tasks, their input data and evaluation metrics used. Even though a large number of researchers in the field expressed their interest to participate, only three institutions took part in the formal evaluation. The paper briefly presents their results, but does not try to interpret them because i...
    We compare the potential of two classes of linear and hierarchical models of discourse to determine co-reference links and resolve anaphors. The comparison uses a corpus of thirty texts, which were manually annotated for co-reference and... more
    We compare the potential of two classes of linear and hierarchical models of discourse to determine co-reference links and resolve anaphors. The comparison uses a corpus of thirty texts, which were manually annotated for co-reference and discourse structure. 1
    In this paper, we outline a theory of referential accessibility called Veins Theory (VT). We show how VT addresses the problem of "left satellites", currently a problem for stack-based models, and show that VT can be used to... more
    In this paper, we outline a theory of referential accessibility called Veins Theory (VT). We show how VT addresses the problem of "left satellites", currently a problem for stack-based models, and show that VT can be used to significantly reduce the search space for antecedents. We also show that VT provides a better model for determining domains of referential accessibility, and discuss how VT can be used to address various issues of structural ambiguity.
    We describe an encoding scheme for discourse structure and reference, based on the TEI Guidelines and the recommendations of the Corpus Encoding Specification (CES). A central feature of the scheme is a CES-based data architecture... more
    We describe an encoding scheme for discourse structure and reference, based on the TEI Guidelines and the recommendations of the Corpus Encoding Specification (CES). A central feature of the scheme is a CES-based data architecture enabling the encoding of and access to multiple views of a marked-up document. We describe a tool architecture that supports the encoding scheme, and then show how we have used the encoding scheme and the tools to perform a discourse analytic task in support of a model of global discourse cohesion called Veins Theory (Cristea, Ide and Romary, forthcoming). 1.
    In this paper the origin of describe the preliminary steps towards a recursive reconstruction of Romanian words together with the positioning of their loans within a time frame, as reflected in the European Linguistic Thesauri. A pilot... more
    In this paper the origin of describe the preliminary steps towards a recursive reconstruction of Romanian words together with the positioning of their loans within a time frame, as reflected in the European Linguistic Thesauri. A pilot application accepts as input a Romanian word and accesses online linguistic resources, such as eDTLR – The Thesaurus Dictionary of the Romanian Language in electronic form, displaying etymological information. The etymology of a word is subsequently searched in foreign sources (for the time being only French and Italian online dictionaries), in order to compute its etymological trajectory. Import years, where available, are used to place on the time axes the approximate time of imports. The research intends to highlight a methodological framework on which a future real scale investigation could be anchored.
    The paper proposes a legislative initiative for acquiring large scale language resources. It militates for raising a large awareness campaign that would allow the storing and preservation for research purpose, in electronic form, of all... more
    The paper proposes a legislative initiative for acquiring large scale language resources. It militates for raising a large awareness campaign that would allow the storing and preservation for research purpose, in electronic form, of all textual documents which go to print in a country. 1.
    The aim of this paper is to give an intuitive look onto that part of the Veins Theory – a theory of discourse structure and cohesion – that deals with the relationship between discourse structure and referentiality. It formalizes the... more
    The aim of this paper is to give an intuitive look onto that part of the Veins Theory – a theory of discourse structure and cohesion – that deals with the relationship between discourse structure and referentiality. It formalizes the central notion of vein used to identify domains of referential accessibility for anaphors in the discourse. One application of the theory is to make corrections to discourse structure based on firm referential links.
    The paper investigates difficult problems that could arise in anaphora resolution and proposes some solutions within the framework of a general anaphora resolution solver. Departing from the current research settings that deal with... more
    The paper investigates difficult problems that could arise in anaphora resolution and proposes some solutions within the framework of a general anaphora resolution solver. Departing from the current research settings that deal with anaphora resolution on contiguous corpora, our investigation uses instead a collection of carefully hand chosen examples. The research is motivated by the belief that interpretation of free language in modern applications, especially those related to semantic web, requires more and more sophisticated tools. 1.
    We compare the potential of two classes of linear and hierarchical models of discourse to determine co-reference links and resolve anaphors. The comparison uses a corpns of thirty texts, which were manually annotated for co-reference and... more
    We compare the potential of two classes of linear and hierarchical models of discourse to determine co-reference links and resolve anaphors. The comparison uses a corpns of thirty texts, which were manually annotated for co-reference and discourse structure.
    We compare the potential of two classes of linear and hierarchical models of discourse to determine co-reference links and resolve anaphors. The comparison uses a corpus of thirty texts, which were manually annotated for co-reference and... more
    We compare the potential of two classes of linear and hierarchical models of discourse to determine co-reference links and resolve anaphors. The comparison uses a corpus of thirty texts, which were manually annotated for co-reference and discourse structure.
    The paper presents a proposal for correlating human’s performance in discourse coherence with a linear model of immediate memory. We begin by estimating experimentally the discourse coherence as produced by humans, using for that a... more
    The paper presents a proposal for correlating human’s performance in discourse coherence with a linear model of immediate memory. We begin by estimating experimentally the discourse coherence as produced by humans, using for that a measure based on Centering transitions. Then we introduce a parametrised model of immediate memory, and we propose a simple access cost model, which mimics cognitive effort during discourse processing. We show that an agent, equipped with the most economical model of immediate memory and manifesting a greedy behaviour in choosing the focus at each step, produces discourses having similar qualities as those produced by humans.
    325 Abstract—Facial expressions are very important in human communications. Creating them programmatically, using a muscle-based system, can greatly reduce the amount of time needed to produce an animation. The main advantage of the... more
    325 Abstract—Facial expressions are very important in human communications. Creating them programmatically, using a muscle-based system, can greatly reduce the amount of time needed to produce an animation. The main advantage of the muscle approach is that the only thing that has to be done before starting to animate a new character is to tailor the muscle system to fit the new facial features. We describe a mass-spring system used for the physical simulation of the structure and dynamics of the facial muscles and the skin.
    The paper proposes a scheme for hierarchical representation of XML annotation standards. The representation allows individual work on documents displaying partial fitness in markings, mixing of annotated documents observing or not the... more
    The paper proposes a scheme for hierarchical representation of XML annotation standards. The representation allows individual work on documents displaying partial fitness in markings, mixing of annotated documents observing or not the same standard, as well as concurrent annotation. The approach allows access to different annotations of a corpus, with minimal representation overhead, which also facilitates accommodation of different, even incompatible, annotations of the same data. Two methods to build a hierarchical representation of annotation standards are shown, one allowing explicit declarations and the other inferring the hierarchy from a set of consistently annotated documents. Merging and extraction operations, which produce derived documents from existing ones are described. A system that implements the formal declarations of the hierarchy and the operations over it is presented.
    We propose a research methodology intended to prove empirically, using quantitative methods, whether the modern Romanian language was influenced by the two historical union events (1859 and 1918). The changes in language suited for being... more
    We propose a research methodology intended to prove empirically, using quantitative methods, whether the modern Romanian language was influenced by the two historical union events (1859 and 1918). The changes in language suited for being approached by computational means address: lexicon, morphology, grammatical structure and semantics. Out of these manifestations of language use, we will concentrate in this preparatory study only on the lexicon and, partly, on the semantics. The study is restricted to the written language.
    Written corpus, general, multilingual, parallel; domain: literature; 110,000 tokens; For EN-RO: discourse segments, co-references (partially), FDG (partially), RST (partially)
    COROLA stands for the computational COrpus of contemporary ROmanian LAnguage (referring to the period after the 2nd World War) and is designed to include both written and spoken forms of the language. To record and search the huge... more
    COROLA stands for the computational COrpus of contemporary ROmanian LAnguage (referring to the period after the 2nd World War) and is designed to include both written and spoken forms of the language. To record and search the huge quantity of electronic resources that make up the Corpus, metadata for describing the linguistic content are needed. Metadata are also essential for organizing the way in which the corpus will be processed and are the primary support for different types of statistics. Only a small part of the metadata in our corpus is automatically generated, while the majority of elements should still be added manually. Our paper presents the process of filling in the metadata that describe the primary documents, simultaneously with cleaning the data, the platform supporting these operations in an interactive and collaborative manner, and the current stage of our project. COROLA is a project started in 2014, as a collaboration between two institutes of the Romanian Academy: the "Mihai Drăgănescu" Romanian Academy Center for Artificial Intelligence in Bucharest (RACAI) and the Institute of Computer Science, Romanian Academy, Iaşi Branch (ICS). By the end of the project, in 2017, the corpus will have a collection of 500 million words of written texts and 300 hours of speech records.
    The human head representations usually are based on the morphological - structural components of a real model. Over the time became more and more necessary to achieve full virtual models that comply very rigorous with the specifications... more
    The human head representations usually are based on the morphological - structural components of a real model. Over the time became more and more necessary to achieve full virtual models that comply very rigorous with the specifications of the human anatomy. Still, making and using a model perfectly fitted with the real anatomy is a difficult task, because it requires large hardware resources and significant times for processing. That is why it is necessary to choose the best compromise solution, which keeps the right balance between the details perfection and the resources consumption, in order to obtain facial animations with real-time rendering. We will present here the way in which we achieved such a 3D system that we intend to use as a base point in order to create facial animations with real-time rendering, used in medicine to find and to identify different types of pathologies.
    Research Interests:
    We describe two group projects carried out by Masters-level students at the University of Osnabr ück. These projects gave students the opportunity to learn not only about the structure of language resources (LRs) but also to handle LRs... more
    We describe two group projects carried out by Masters-level students at the University of Osnabr ück. These projects gave students the opportunity to learn not only about the structure of language resources (LRs) but also to handle LRs and thus to gain professional knowledge of them. In the KoKs project (2001) a bilingual aligned German/English corpus was created and was used for contrastive collocation extraction. In the MAPA project students applied the GermaNet lexical resource to the construction of a vocabulary trainer for German second-language learners. In such projects, groups of approximately 8 students spend 12 months developing software applications. They invest more than 30% of their time on the project over this period and are awarded four times the number of creditpoints (ECTS) as for standard courses. Such projects are an integral part of student training in Osnabr ück, providing an environment in which the students can acquire the practical skills essential to succes...
    Evaluation campaigns have become an established way to evaluate automatic systems which tackle the same task. This paper presents the first edition of the Anaphora Resolution Exercise (ARE) and the lesson s learnt from it. This first... more
    Evaluation campaigns have become an established way to evaluate automatic systems which tackle the same task. This paper presents the first edition of the Anaphora Resolution Exercise (ARE) and the lesson s learnt from it. This first edition focused only on English pronominal anaphora and NP coreference, and was organised as an exploratory exercise where various issues were investigated. ARE proposed four different tasks: pronominal anaphora resolution and NP coreference resolution on a predefined set of entities, pronomina l anaphora resolution and NP coreference resolution on raw texts. For each of these tasks different inputs and evaluation metrics were prepared. This paper presents the four tasks, their input data and eva luation metrics used. Even though a large number of researchers in the field expressed their interest to participate, only three institutions took par t in the formal evaluation. The paper briefly presents their results, but does not try to interpret them becau...
    In this paper we investigate two fundamental issues related to the production of coherent discourse by intelligent agents: a cohesion property and a fluency property. The cohesion aspects of discourse production relate to the use of... more
    In this paper we investigate two fundamental issues related to the production of coherent discourse by intelligent agents: a cohesion property and a fluency property. The cohesion aspects of discourse production relate to the use of pronominal anaphora: whether and in what conditions intelligent agents could acquire pronouns as means to express recently mentioned entities? We show that the acquisition of pronouns in the vocabulary of an agent is conditioned by the existence of a memory channel recording the object previously in focus. The approach follows an evolutionary paradigm of language acquisition. Experiments show that pronouns spontaneously appear in the vocabulary of a community of 10 agents dialogging on a static scene and that, generally, the use of pronouns enhance the communication success. The processing load experiments address the fluency of discourse, measured in terms of Centering transitions. Contrary to previous findings, this side of discourse coherence seems to...
    The emergence of the WWW as the main source of distributing content opened the floodgates of information. The sheer volume and diversity of this content necessitate an approach that will reinvent the way it is analysed. The quantitative... more
    The emergence of the WWW as the main source of distributing content opened the floodgates of information. The sheer volume and diversity of this content necessitate an approach that will reinvent the way it is analysed. The quantitative route to processing information which relies on content management tools provides structural analysis. The challenge we address is to evolve from the process of streamlining data to a level of understanding that assigns value to content. The solution we present incorporates human language technologies in the process of multilingual web content management. i-Librarian is a website built with an open-source software platform ATLAS. ATLAS complements a content management software-asa-service component used for creating, running and managing dynamic content-driven websites with a linguistic platform. The platform enriches the content of these websites with revealing details and reduces the manual work of classification editors by automatically categorisi...
    This paper revises notions related to Language Resources and Technologies (LRT), including a brief overview of some resources developed worldwide and with a special focus on Romanian language. It then describes a joined Romanian,... more
    This paper revises notions related to Language Resources and Technologies (LRT), including a brief overview of some resources developed worldwide and with a special focus on Romanian language. It then describes a joined Romanian, Moldavian, English initiative aimed at developing electronically coded resources for Romanian language, tools for their maintenance and usage, as well as for the creation of applications based on these resources.
    In most cases,, the COLLECT module determines an LPA by enumerating all antecedents in a window of text that pLeced__es the anaphor under scrutiny (Hobbs, 1978; Lappin and Leass, 1994; Mitkov, 1997; Kameyama, 1997; Ge et al., 1998). This... more
    In most cases,, the COLLECT module determines an LPA by enumerating all antecedents in a window of text that pLeced__es the anaphor under scrutiny (Hobbs, 1978; Lappin and Leass, 1994; Mitkov, 1997; Kameyama, 1997; Ge et al., 1998). This window can be as small as two or three sentences or as large as the entire preceding text. The FILTER module usually imposes semantic constraints by requiring that the anaphor and potential antecedents have the same number and gender, that selectional restrictions are obeyed, etc. The PREFERENCE module imposes preferences on potential antecedents on the basis of their grammatical roles, parallelism, frequency, proximity, etc. In some cases, anaphora resolution systems implement these modules explicitly (I-Iobbs, 1978; Lappin and Leass, 1994; Mitkov, 1997; Kameyama, 1997). In other cases, these modules are integrated by means of statistical (Ge et al., 1998) or uncertainty reasoning techniques (Mitkov, 1997).

    And 121 more