Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
Skip to main content
Étalé sur une période de deux ans (2007-2009), le projet de mise au point d'une base de données lexicale multifonctionnelle est un projet mené par l'Université Cheikh Anta Diop de Dakar (UCAD) au Sénégal en collaboration avec le... more
Étalé sur une période de deux ans (2007-2009), le projet de mise au point d'une base de données lexicale multifonctionnelle est un projet mené par l'Université Cheikh Anta Diop de Dakar (UCAD) au Sénégal en collaboration avec le centre de recherche en linguistique ...
... East dataset includes the morpho-syntactic specifications, morphosyntactic lexica, and a parallel corpus, the novel ''1984'' by George Orwell, which is ... Sharoff 2005), which is comparable to the... more
... East dataset includes the morpho-syntactic specifications, morphosyntactic lexica, and a parallel corpus, the novel ''1984'' by George Orwell, which is ... Sharoff 2005), which is comparable to the BNC Sampler in its size and accuracy of annotation, and HANCO (Kopotev and ...
Natural language processing (NLP) has become a vital requirement in a wide range of applications, including machine translation, information retrieval, and text classification. The development and evaluation of NLP models for various... more
Natural language processing (NLP) has become a vital requirement in a wide range of applications, including machine translation, information retrieval, and text classification. The development and evaluation of NLP models for various languages have received significant attention in recent years, but there has been relatively little work done on comparing the performance of different language models on Romanian data. In particular, the introduction and evaluation of various Romanian language models with multilingual models have barely been comparatively studied. In this paper, we address this gap by evaluating eight NLP models on two Romanian datasets, XQuAD and RoITD. Our experiments and results show that bert-base-multilingual-cased and bertbase- multilingual-uncased, perform best on both XQuAD and RoITD tasks, while RoBERT-small model and DistilBERT models perform the worst. We also discuss the implications of our findings and outline directions for future work in this area.
The availability of parallel corpora is limited, especially for under-resourced languages and narrow domains. On the other hand, the number of comparable documents in these areas that are freely available on the Web is continuously... more
The availability of parallel corpora is limited, especially for under-resourced languages and narrow domains. On the other hand, the number of comparable documents in these areas that are freely available on the Web is continuously increasing. Algorithmic approaches to identify these documents from the Web are needed for the purpose of automatically building comparable corpora for these under-resourced languages and domains. How do we identify these comparable documents? What approaches should be used in collecting these comparable documents from different Web sources? In this chapter, we firstly present a review of previous techniques that have been developed for collecting comparable documents from the Web. Then we describe in detail three new techniques to gather comparable documents from three different types of Web sources: Wikipedia, news articles, and narrow domains.
This paper describes RACAI’s word sense alignment system, which participated in the Monolingual Word Sense Alignment shared task organized at GlobaLex 2020 workshop. We discuss the system architecture, some of the challenges that we faced... more
This paper describes RACAI’s word sense alignment system, which participated in the Monolingual Word Sense Alignment shared task organized at GlobaLex 2020 workshop. We discuss the system architecture, some of the challenges that we faced as well as present our results on several of the languages available for the task.
We present here the largest publicly available corpus of Romanian. Its written component contains 1,257,752,812 tokens, distributed, in an unbalanced way, in several language styles (legal, administrative, scientific, journalistic,... more
We present here the largest publicly available corpus of Romanian. Its written component contains 1,257,752,812 tokens, distributed, in an unbalanced way, in several language styles (legal, administrative, scientific, journalistic, imaginative, memoirs, blogposts), in four domains (arts and culture, nature, society, science) and in 71 subdomains. The oral component consists of almost 152 hours of recordings, with associated transcribed texts. All files have CMDI metadata associated. The written texts are automatically sentencesplit, tokenized, part-of-speech tagged, lemmatized; a part of them are also syntactically annotated. The oral files are aligned with their corresponding transcriptions at word-phoneme level. The transcriptions are also automatically part-of-speech tagged, lemmatised and syllabified. CoRoLa contains original, IPR-cleared texts and is representative for the contemporary phase of the language, covering mostly the last 20 years. Its written component can be querie...
This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian,... more
This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency annotation, the corpus is enriched with the IATE and EUROVOC labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpo...
This paper presents RELATE (http://relate.racai.ro), a high-performance natural language platform designed for Romanian language. It is meant both for demonstration of available services, from text-span annotations to syntactic dependency... more
This paper presents RELATE (http://relate.racai.ro), a high-performance natural language platform designed for Romanian language. It is meant both for demonstration of available services, from text-span annotations to syntactic dependency trees as well as playing or automatically synthesizing Romanian words, and for the development of new annotated corpora. It also incorporates the search engines for the large COROLA reference corpus of contemporary Romanian and the Romanian wordnet. It integrates multiple text and speech processing modules and exposes their functionality through a web interface designed for the linguist researcher. It makes use of a scheduler-runner architecture, allowing processing to be distributed across multiple computing nodes. A series of input/output converters allows large corpora to be loaded, processed and exported according to user preferences.
The article describes the current status of a large national project, CoRoLa, aiming at building a reference corpus for the contemporary Romanian language. Unlike many other national corpora, CoRoLa contains only - IPR cleared texts and... more
The article describes the current status of a large national project, CoRoLa, aiming at building a reference corpus for the contemporary Romanian language. Unlike many other national corpora, CoRoLa contains only - IPR cleared texts and speech data, obtained from some of the country’s most representative publishing houses, broadcasting agencies, editorial offices, newspapers and popular bloggers. For the written component 500 million tokens are targeted and for the oral one 300 hours of recordings. The choice of texts is done according to their functional style, domain and subdomain, also with an eye to the international practice. A metadata file (following the CMDI model) is associated to each text file. Collected texts are cleaned and transformed in a format compatible with the tools for automatic processing (segmentation, tokenization, lemmatization, part-of-speech tagging). The paper also presents up-to-date statistics about the structure of the corpus almost two years before it...
We present the Romanian legislative corpus which is a valuable linguistic asset for the development of machine translation systems, especially for under-resourced languages. The knowledge that can be extracted from this resource is... more
We present the Romanian legislative corpus which is a valuable linguistic asset for the development of machine translation systems, especially for under-resourced languages. The knowledge that can be extracted from this resource is necessary for a deeper understanding of how law terminology is used and how it can be made more consistent. At this moment the corpus contains more than 140k documents representing the legislative body of Romania. This corpus is processed and annotated at different levels: linguistically (tokenized, lemmatized and pos-tagged), dependency parsed, chunked, named entities identified and labeled with IATE terms and EUROVOC descriptors. Each annotated document has a CONLL-U Plus format consisting in 14 columns, in addition to the standard 10-column format, four other types of annotations were added. Moreover the repository will be periodically updated as new legislative texts are published. These will be automatically collected and transmitted to the processin...
This article reports on the on-going CoRoLa project, aiming at creating a reference corpus of contemporary Romanian (from 1945 onwards), opened for online free exploitation by researchers in linguistics and language processing, teachers... more
This article reports on the on-going CoRoLa project, aiming at creating a reference corpus of contemporary Romanian (from 1945 onwards), opened for online free exploitation by researchers in linguistics and language processing, teachers of Romanian, students. We invest serious efforts in persuading large publishing houses and other owners of IPR on relevant language data to join us and contribute the project with selections of their text and speech repositories. The CoRoLa project is coordinated by two Computer Science institutes of the Romanian Academy, but enjoys cooperation of and consulting from professional linguists from other institutes of the Romanian Academy. We foresee a written component of the corpus of more than 500 million word forms, and a speech component of about 300 hours of recordings. The entire collection of texts (covering all functional styles of the language) will be pre-processed and annotated at several levels, and also documented with standardized metadata...
The central objective of the Metanet4u project is to contribute to the establishment of a pan-European digital platform that makes available language resources and services, encompassing both datasets and software tools, for speech and... more
The central objective of the Metanet4u project is to contribute to the establishment of a pan-European digital platform that makes available language resources and services, encompassing both datasets and software tools, for speech and language processing, and supports a new generation of exchange facilities for them.
One of the fundamental tasks in natural-language processing is the morpho-lexical disambiguation of words occurring in text. Over the last twenty years or so, approaches to part-of-speech tagging based on machine learning techniques have... more
One of the fundamental tasks in natural-language processing is the morpho-lexical disambiguation of words occurring in text. Over the last twenty years or so, approaches to part-of-speech tagging based on machine learning techniques have been developed or ported to provide high-accuracy morpho-lexical annotation for an increasing number of languages. Due to recent increases in computing power, together with improvements in tagging technology and the extension of language typologies, part-of-speech tags have become significantly more complex. The need to address multilinguality more directly in the web environment has created a demand for interoperable, harmonized morpho-lexical descriptions across languages. Given the large number of morpho-lexical descriptors for a morphologically complex language, one has to consider ways to avoid the data sparseness threat in standard statistical tagging, yet ensure that full lexicon information is available for each word form in the output. The ...
The MULTEXT-East morphosyntactic lexicons have a simple structure, where each line is a lexical entry with three tab-separated fields: (1) the word-form, the inflected form of the word; (2) the lemma, the base-form of the word; (3) the... more
The MULTEXT-East morphosyntactic lexicons have a simple structure, where each line is a lexical entry with three tab-separated fields: (1) the word-form, the inflected form of the word; (2) the lemma, the base-form of the word; (3) the MSD, the morphosyntactic description of the word-form, i.e., its fine-grained PoS tag, as defined in the MULTEXT-East morphosyntactic specifications. This submission contains the freely available MULTEXT-East lexicons, while a separate submission (http://hdl.handle.net/11356/1042) gives those that are available only for non-commercial use.
The efforts on development of large Lexical DataBases (LDB) are just emerging in most of the CE-countries. CONCEDE is an EU project aiming at harmonising the methodologies, tools and (to a less extent) resources for building LDBs for six... more
The efforts on development of large Lexical DataBases (LDB) are just emerging in most of the CE-countries. CONCEDE is an EU project aiming at harmonising the methodologies, tools and (to a less extent) resources for building LDBs for six CElanguages: Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovene. The paper addresses the specific problems concerning the process of TEI encoding of the Romanian Explanatory Dictionary. The Romanian Explanatory Dictionary (DEX, second edition, 1996) is the reference dictionary of Romanian and it was developed by the Institute for Linguistics of the Romanian Academy and published by the Univers Enciclopedic Publishing House. DEX is meant for a wide public and therefore the lexicographic content is relatively rich with head-words belonging to what is generally called a basic vocabulary for one language, including regional variants, but containing also various technical or specialised terms as well as various neologisms and recent
Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still... more
Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe's specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI – including many opportunities, synergies but also misconceptions – has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in indus...
Evaluation Forum) has played a leading role in stimulating research and innovation in a wide range
Within the larger project ROBIN, focused on the development of software and services for human-robot interaction, we present here a set of activities focused on the creation and enhancement of language resources necessary for making... more
Within the larger project ROBIN, focused on the development of software and services for human-robot interaction, we present here a set of activities focused on the creation and enhancement of language resources necessary for making dialogue possible between humans and the robot Pepper. More precisely, we describe the preparatory activities for turning the robot Pepper into a dialog partner for a human by using the Romanian language. The language resources that have been created are a lexicon, a language model, and an acoustic language model. They have been enhanced by using other language resources, such as the Romanian wordnet and word embeddings extracted from the large Romanian reference corpus CoRoLa. They will ensure oral communication within some envisaged microworlds, thus they are specific to these microworlds. These resources have been carefully curated and evaluated.
MEBA is a lexical aligner, implemented in C#, based on an iterative algorithm that uses pre-processing steps: sentence alignment ([[http://www.clarin.eu/tools/sal-sentence-aligner|SAL]]), tokenization, POS-tagging and lemmatization... more
MEBA is a lexical aligner, implemented in C#, based on an iterative algorithm that uses pre-processing steps: sentence alignment ([[http://www.clarin.eu/tools/sal-sentence-aligner|SAL]]), tokenization, POS-tagging and lemmatization (through [[http://www.clarin.eu/tools/ttl-tokenizing-tagging-and-lemmatizing-free-running-texts|TTL]], sentence chunking. Similar to YAWA aligner, MEBA generates the links step by step, beginning with the most probable (anchor links). The links to be added at any later step are supported or restricted by the links created in the previous iterations. The aligner has different weights and different significance thresholds on each feature and iteration. Each of the iterations can be configured to align different categories of tokens (named entities, dates and numbers, content words, functional words, punctuation) in decreasing order of statistical evidence. MEBA has an individual F-measure of 81.71% and it is currently integrated in the platform [[http://www...
YAWA is a four stage lexical aligner that uses bilingual translation lexicons produced by [[http://www.clarin.eu/tools/translation-equivalents-extractor|TREQ]] and phrase boundaries detection to align words of a given bitext. Using this... more
YAWA is a four stage lexical aligner that uses bilingual translation lexicons produced by [[http://www.clarin.eu/tools/translation-equivalents-extractor|TREQ]] and phrase boundaries detection to align words of a given bitext. Using this alignment, in stage 2 a language dependent module takes over and produces alignments of the remaining lexical tokens within aligned chunks. Stage 3 is specialized in aligning blocks of consecutive unaligned tokens and stage 4 deletes alignments that are likely to be wrong. Developed in PERL, YAWA is language independent, except for the modules that realise alignments specific to the pairs of aligned languages. So far, it works just for Ro-En pair of languages. It requires a parallel corpus in [[http://www.xces.org|XCES]] format, morpho-syntactically annotated and lemmatized (using [[http://www.clarin.eu/tools/ttl-tokenizing-tagging-and-lemmatizing-free-running-texts|TTL]]), and translation dictionaries produced by [[http://www.clarin.eu/tools/transla...
The biomedical domain provides a large amount of linguistic resources usable for biomedical text mining. While most of the resources used in biomedical Natural Language Processing are available for English, for other languages including... more
The biomedical domain provides a large amount of linguistic resources usable for biomedical text mining. While most of the resources used in biomedical Natural Language Processing are available for English, for other languages including Romanian the access to language resources is not straight-forward. In this paper, we present the biomedical corpus of the Romanian language, which is a valuable linguistic asset for biomedical text mining. This corpus was collected in the contexts of CoRoLa project, the reference corpus for the contemporary Romanian language. We also provide informative statistics about the corpus, a description of the data-composition. The annotation process of the corpus is also presented. Furthermore, we present the fraction of the corpus which will be made publicly available to the community without copyright restrictions.
With the increasing usage of Natural Language Processing to facilitate the interactions between humans and machines, automatic speech recognition systems have become increasingly popular as a result of their utility in a wide range of... more
With the increasing usage of Natural Language Processing to facilitate the interactions between humans and machines, automatic speech recognition systems have become increasingly popular as a result of their utility in a wide range of applications. In this paper we explore well-known open-source speech-to-text engines, namely CMUSphinx, DeepSpeech, and Kaldi, to build a baseline of models to transcribe Romanian speech. These engines employ various underlying methods from hidden Markov models to deep neural networks that also integrate language models, thus providing a solid baseline for comparison. Unfortunately, Romanian is still a low-resource language and six datasets of various qualities were merged to obtain 104 hours of speech. To further increase the size of the gathered corpora, our experiments consider data augmentation techniques, specifically SpecAugment, applied on the most promising model. Besides using existing corpora, we publicly release a dataset of 11.5 hours generated from Governmental transcripts. The best performing model is obtained using the Kaldi architecture, considers a hybrid structure with a Deep Neural Network, and achieves a WER of 3.10% on the test partition.
The application, developed in C#, automatically identifies the language of a text written in one of the 21 European Union languages. By using training texts in different languages (approx. 1.5Mb of text for each language), a training... more
The application, developed in C#, automatically identifies the language of a text written in one of the 21 European Union languages. By using training texts in different languages (approx. 1.5Mb of text for each language), a training module counts the prefixes (the first 3 characters) and the suffixes (4 characters endings) for all the words in the texts, for each language. For every language two models are constructed, containing the weights (percentages) of prefixes and suffixes in the texts representing a language. In the prediction phase, for a new text, two models are built on the fly in a similar manner. These models are then compared with the stored models representing each language for which the application was trained. Using comparison functions, the best model is chose. More detailed descriptions are available in [[http://www.racai.ro/~tufis/papers|the following papers]]: -- Dan Tufis, Radu Ion, Alexandru Ceausu, and Dan Ştefănescu (2008). RACAI's Linguistic Web Servic...
TREQ exploits the knowledge embedded in the parallel corpora and produces a set of translation equivalents (a translation lexicon), based on a 1:1 mapping hypothesis. The program uses almost no linguistic knowledge, relying on statistical... more
TREQ exploits the knowledge embedded in the parallel corpora and produces a set of translation equivalents (a translation lexicon), based on a 1:1 mapping hypothesis. The program uses almost no linguistic knowledge, relying on statistical evidence and some simplifying assumptions. The extraction process is based on a testing approach. It generates first a list of translation equivalent candidates and then successively extracts the most likely translation equivalence pairs. It does not require a pre-existing bilingual lexicon for the considered languages. Yet, if such a lexicon exists, it can be used to eliminate spurious candidate translation equivalence pairs and thus to speed up the process and increase its accuracy. The algorithm relies on some pre-processing of the bitext: sentence aligner, tokeniser (using [[(http://www.lpl.univaix.fr/projects/multext/MtSeg|MtSeg]]), a collocation extractor (unaware of translation equivalence), POS-tagger, lemmatiser. More detailed descriptions...
The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel and sentence aligned corpus contains the novel in the English original (about 100,000 words in length), and its translations... more
The novel "1984" by George Orwell is the central component of the MULTEXT-East corpus. This parallel and sentence aligned corpus contains the novel in the English original (about 100,000 words in length), and its translations into a number of languages. This version of the corpus contains structurally annotated texts only, which contain elements such as the paragraph, the footnote, and highlighted text. In terms of linguistic annotations, the text contain names and sentences. The linguistically annotated texts are a separate submission (http://hdl.handle.net/11356/1043) also with somewhat different languages.
Extracting parallel units (e.g. sentences or phrases) from comparable corpora in order to enrich existing statistical translation models is an avenue that has attracted a lot of research in recent years. There are experiments that... more
Extracting parallel units (e.g. sentences or phrases) from comparable corpora in order to enrich existing statistical translation models is an avenue that has attracted a lot of research in recent years. There are experiments that convincingly show how parallel sentences extracted from comparable corpora are able to improve statistical machine translation (SMT). Yet, the existing body of research on the subject does not take into account the degree of comparability of the corpus being processed nor the computation time that it takes to extract translational similar pairs from a corpus of a given size. We will show that the performance of a parallel unit extractor crucially depends on the degree of comparability, such that it is more difficult to mine for parallel data in a weakly comparable corpus than a strongly comparable corpus.
Large-scale pre-trained language representation and its promising performance in various downstream applications have become an area of interest in the field of natural language processing (NLP). There has been huge interest in further... more
Large-scale pre-trained language representation and its promising performance in various downstream applications have become an area of interest in the field of natural language processing (NLP). There has been huge interest in further increasing the model’s size in order to outperform the best previously obtained performances. However, at some point, increasing the model’s parameters may lead to reaching its saturation point due to the limited capacity of GPU/TPU. In addition to this, such models are mostly available in English or a shared multilingual structure. Hence, in this paper, we propose a lite BERT trained on a large corpus solely in the Romanian language, which we called “A Lite Romanian BERT (ALR-BERT)”. Based on comprehensive empirical results, ALR-BERT produces models that scale far better than the original Romanian BERT. Alongside presenting the performance on downstream tasks, we detail the analysis of the training process and its parameters. We also intend to distri...
Research Interests:
The article reports on research and developments pursued by the Research Institute for Artificial Intelligence "Mihai Draganescu" of the Romanian Academy in order to narrow the gaps identified by the deep analysis on the... more
The article reports on research and developments pursued by the Research Institute for Artificial Intelligence "Mihai Draganescu" of the Romanian Academy in order to narrow the gaps identified by the deep analysis on the European languages made by Meta-Net white papers and published by Springer in 2012. Except English, all the European languages needed significant research and development in order to reach an adequate technological level, in line with the expectations and requirements of the knowledge society.
The paper describes a general method (as well as its implementation and evaluation) for deriving mapping systems for different tagsets available in existing training corpora (gold standards) for a specific language. For each pair of... more
The paper describes a general method (as well as its implementation and evaluation) for deriving mapping systems for different tagsets available in existing training corpora (gold standards) for a specific language. For each pair of corpora (tagged with different tagsets), one such mapping system is derived. This mapping system is then used to improve the tagging of each of the two corpora with the tagset of the other (this process will be called cross-tagging). By reapplying the algorithm to the newly obtained corpora, the accuracy of the underlying training corpora can also be improved. Furthermore, comparing the results with the gold standards makes it possible to assess the distributional adequacy of various tagsets used in processing the language in case. Unlike other methods, such as those reported in (Brants, 1995) or (Tufiş & Dragomirescu, 2004), which assume a subsumption relation between the considered tagsets, and as such they aim at minimizing the tagsets by eliminating ...
Research Interests:
The benefits of using comparable corpora for improving translation quality for statistical machine translators have been already shown by various researchers. The usual approach is starting with a baseline system, trained on out-of-domain... more
The benefits of using comparable corpora for improving translation quality for statistical machine translators have been already shown by various researchers. The usual approach is starting with a baseline system, trained on out-of-domain parallel corpora, followed by its adaptation to the domain in which new translations are needed. The adaptation to a new domain, especially for a narrow one, is based on data extracted from comparable corpora from the new domain or from an as close as possible one. This article reports on a slightly different approach: building an SMT system entirely from comparable data for the domain of interest. Certainly, the approach is feasible if the comparable corpora are large enough to extract SMT useful data in sufficient quantities for a reliable training. The more comparable corpora, the better the results are. Wikipedia is definitely a very good candidate for such an experiment. We report on mass experiments showing significant improvements over a bas...
The paper proposes a paradigmatic approach to morphological knowledge acquisition. It addresses the problem of learning from examples rules for word-forms analysis and synthesis. These rules, established by generalizing the training... more
The paper proposes a paradigmatic approach to
morphological knowledge acquisition. It addresses
the problem of learning from examples
rules for word-forms analysis and synthesis.
These rules, established by generalizing the training
data sets, are effectively used by a built-in interpreter
which acts consequently as a
morphological processor within the architecture
of a natural language question-answering system.
The PARADIGM system has no a priori knowledge
which should restrict it to a particular natural language,
but instead builds up the morphological
rules based only on the examples provided, be
they in Romanian, English, French, Russian, SIovak
and the like.

And 284 more