Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
Skip to main content

    Ljiljana Dolamic

    ABSTRACT This deliverable reports on the update of the automatic categorization and present advances on the readability and trustability detection in the task T.1.5 Categorization, Trustability and Readability in the KHESMOI WP1 Large... more
    ABSTRACT This deliverable reports on the update of the automatic categorization and present advances on the readability and trustability detection in the task T.1.5 Categorization, Trustability and Readability in the KHESMOI WP1 Large Scale Biomedical Text Mining and Search. In this report we describe and evaluate the tools developed in the scope of this work package. The document categorization tool presented here and its implementation within the search engine enables the end user to filter the results returned by the search engine according to the set of predetermined set of classes, such as Cancer or Alcohol. Taking into account the fact that not all users have the same level of medical literacy, two differant approaches have been taken in order to propose to the end user the suitable documents. Firstly, classification of documents by the readability level has been performed. Documents have been labeled as being either easy or difficult to read. Integration of this tool within the search engine gives to the end user possibility to access only documents he judges suitable. Secondly, a method for prediction of user’s expertise has been proposed. The goal of this method being the ability to present the user with the documents addapted to his/hers literacy level. Can the user have confidence in the information from the pages returened by results of the search engine is another issue we have tackled in this research. With the goal of determining the level of trustability, a tool based on machine learning, whose results are presented here, have been developed and integrated in the crawling process. Integration of its results into the search engine will enable the visualisation of te level of trust aassigned to the source by the system.
    This paper describes and evaluates the public health web pages classification model based on key phrase extraction and matching. Easily extendible both in terms of new classes as well as the new language this method proves to be a good... more
    This paper describes and evaluates the public health web pages classification model based on key phrase extraction and matching. Easily extendible both in terms of new classes as well as the new language this method proves to be a good solution for text classification faced with the total lack of training data. To evaluate the proposed solution we have used a small collection of public health related web pages created by a double blind manual classification. Our experiments have shown that by choosing the adequate threshold value the desired value for either precision or recall can be achieved.
    Authors evaluated supervised automatic classification algorithms for determination of health related web-page compliance with individual HONcode criteria of conduct using varying length character n-gram vectors to represent healthcare web... more
    Authors evaluated supervised automatic classification algorithms for determination of health related web-page compliance with individual HONcode criteria of conduct using varying length character n-gram vectors to represent healthcare web page documents. The training/testing collection comprised web page fragments extracted by HONcode experts during the manual certification process. The authors compared automated classification performance of n-gram tokenization to the automated classification performance of document words and Porter-stemmed document words using a Naive Bayes classifier and DF (document frequency) dimensionality reduction metrics. The study attempted to determine whether the automated, language-independent approach might safely replace word-based classification. Using 5-grams as document features, authors also compared the baseline DF reduction function to Chi-square and Z-score dimensionality reductions. Overall study results indicate that n-gram tokenization provided a potentially viable alternative to document word stemming.
    EnglishWe aim to test the semantic difference between French epistemic devoir and Italian epistemic dovere in the conditional mood (the former seeming to convey a stronger indication of necessity than the latter). By means of quantitative... more
    EnglishWe aim to test the semantic difference between French epistemic devoir and Italian epistemic dovere in the conditional mood (the former seeming to convey a stronger indication of necessity than the latter). By means of quantitative analysis, we take into account the textual environments of the modal verb devoir/dovere as well as its degree of representativeness in corpora which indicate genre and diachronic variation in the two languages considered. francaisPrenant comme point de depart une difference semantique entre devoir epistemique au conditionnel en francais et dovere en italien (le premier semblant vehiculer une indication de necessite plus marquee que le second), nous nous proposons de la mettre a l’epreuve au moyen d’une analyse quantitative sur corpus prenant en consideration les environnements textuels du verbe modal devoir/dovere ainsi que son indice de representativite dans des corpus manifestant des variations de genres et d’epoque dans les deux langues.
    Mapping the technology landscape is crucial for market actors to take informed investment decisions. However, given the large amount of data on the Web and its subsequent information overload, manually retrieving information is a... more
    Mapping the technology landscape is crucial for market actors to take informed investment decisions. However, given the large amount of data on the Web and its subsequent information overload, manually retrieving information is a seemingly ineffective and incomplete approach. In this work, we propose an end-to-end recommendation based retrieval approach to support automatic retrieval of technologies and their associated companies from raw Web data. This is a two-task setup involving (i) technology classification of entities extracted from company corpus, and (ii) technology and company retrieval based on classified technologies. Our proposed framework approaches the first task by leveraging DistilBERT which is a state-of-the-art language model. For the retrieval task, we introduce a recommendation-based retrieval technique to simultaneously support retrieving related companies, technologies related to a specific company and companies relevant to a technology. To evaluate these tasks...
    Our first objective in participating in this domain-specific evaluation campaign is to propose and evaluate various indexing and search strategies for the German, English and Russian languages, in an effort to obtain better retrieval... more
    Our first objective in participating in this domain-specific evaluation campaign is to propose and evaluate various indexing and search strategies for the German, English and Russian languages, in an effort to obtain better retrieval effectiveness than that of the language-independent approach (n-gram). To do so we evaluate the GIRT-4 test-collection using the Okapi, various IR models derived from the Divergence from Randomness (DFR) paradigm, the statistical language model (LM) together with the classical tf.idf vector-processing scheme. Categories and Subject Descriptors
    EnglishWe aim to test the semantic difference between French epistemic devoir and Italian epistemic dovere in the conditional mood (the former seeming to convey a stronger indication of necessity than the latter). By means of quantitative... more
    EnglishWe aim to test the semantic difference between French epistemic devoir and Italian epistemic dovere in the conditional mood (the former seeming to convey a stronger indication of necessity than the latter). By means of quantitative analysis, we take into account the textual environments of the modal verb devoir/dovere as well as its degree of representativeness in corpora which indicate genre and diachronic variation in the two languages considered. francaisPrenant comme point de depart une difference semantique entre devoir epistemique au conditionnel en francais et dovere en italien (le premier semblant vehiculer une indication de necessite plus marquee que le second), nous nous proposons de la mettre a l’epreuve au moyen d’une analyse quantitative sur corpus prenant en consideration les environnements textuels du verbe modal devoir/dovere ainsi que son indice de representativite dans des corpus manifestant des variations de genres et d’epoque dans les deux langues.
    In this paper, we present the systems submitted by our team from the Institute of ICT (HEIG-VD / HES-SO) to the Unsupervised MT and Very Low Resource Supervised MT task. We first study the improvements brought to a baseline system by... more
    In this paper, we present the systems submitted by our team from the Institute of ICT (HEIG-VD / HES-SO) to the Unsupervised MT and Very Low Resource Supervised MT task. We first study the improvements brought to a baseline system by techniques such as back-translation and initialization from a parent model. We find that both techniques are beneficial and suffice to reach performance that compares with more sophisticated systems from the 2020 task. We then present the application of this system to the 2021 task for low-resource supervised Upper Sorbian (HSB) to German translation, in both directions. Finally, we present a contrastive system for HSB-DE in both directions, and for unsupervised German to Lower Sorbian (DSB) translation, which uses multi-task training with various training schedules to improve over the baseline.
    In participating in this first FIRE evaluation campaign, we design and evaluate stopword lists and light stemming strate-gies for the Hindi, Bengali and Marathi languages. As mem-bers of the Indo-European languages family, they tend to... more
    In participating in this first FIRE evaluation campaign, we design and evaluate stopword lists and light stemming strate-gies for the Hindi, Bengali and Marathi languages. As mem-bers of the Indo-European languages family, they tend to have similar syntax and morphology, and also related writ-ing systems. Our second objective is to obtain a better pic-ture of the relative merit of various search engines in ex-ploring Hindi, Bengali and Marathi documents. To evalu-ate these solutions we use our various IR models, including the Okapi, Divergence from Randomness (DFR) and statis-tical language model (LM) together with the classical tf ·idf vector-processing approach. Our various experiments with these three languages tend to demonstrate that the I(ne)C2 or the PB2 model derived
    Authors evaluated supervised automatic classification algorithms for determination of health related web-page compliance with individual HONcode criteria of conduct using varying length character n-gram vectors to represent healthcare web... more
    Authors evaluated supervised automatic classification algorithms for determination of health related web-page compliance with individual HONcode criteria of conduct using varying length character n-gram vectors to represent healthcare web page documents. The training/testing collection comprised web page fragments extracted by HONcode experts during the manual certification process. The authors compared automated classification performance of n-gram tokenization to the automated classification performance of document words and Porter-stemmed document words using a Naive Bayes classifier and DF (document frequency) dimensionality reduction metrics. The study attempted to determine whether the automated, language-independent approach might safely replace word-based classification. Using 5-grams as document features, authors also compared the baseline DF reduction function to Chi-square and Z-score dimensionality reductions. Overall study results indicate that n-gram tokenization provi...
    This paper describes and evaluates the public health web pages classification model based on key phrase extraction and matching. Easily extendible both in terms of new classes as well as the new language this method proves to be a good... more
    This paper describes and evaluates the public health web pages classification model based on key phrase extraction and matching. Easily extendible both in terms of new classes as well as the new language this method proves to be a good solution for text classification faced with the total lack of training data. To evaluate the proposed solution we have used a small collection of public health related web pages created by a double blind manual classification. Our experiments have shown that by choosing the adequate threshold value the desired value for either precision or recall can be achieved.
    In our participation in this CLEF evaluation campaign, the first objective is to propose and evaluate various indexing and search strategies for the Czech language in order to hopefully produce better retrieval effectiveness than that of... more
    In our participation in this CLEF evaluation campaign, the first objective is to propose and evaluate various indexing and search strategies for the Czech language in order to hopefully produce better retrieval effectiveness than that of the language-independent approach (n-gram). Based on our stemming strategy used with other languages, we propose two light stemmers for this Slavic language and a third one based on a more aggressive suffix-stripping scheme that removes some derivational suffixes. Our second objective is to obtain a better picture of the relative merit of various search engines in exploring Hungarian and Bulgarian documents. Moreover for the Bulgarian language we developed a new and more aggressive stemmer. To evaluate these solutions we use our various IR models, including the Okapi, Divergence from Randomness (DFR) and statistical language model (LM) together with the classical tf.idf vector-processing approach. Our experiments tend to show that for the Bulgarian ...
    Explainability is a key requirement for text classification in many application domains ranging from sentiment analysis to medical diagnosis or legal reviews. Existing methods often rely on “attention” mechanisms for explaining... more
    Explainability is a key requirement for text classification in many application domains ranging from sentiment analysis to medical diagnosis or legal reviews. Existing methods often rely on “attention” mechanisms for explaining classification results by estimating the relative importance of input units. However, recent studies have shown that such mechanisms tend to mis-identify irrelevant input units in their explanation. In this work, we propose a hybrid human-AI approach that incorporates human rationales into attention-based text classification models to improve the explainability of classification results. Specifically, we ask workers to provide rationales for their annotation by selecting relevant pieces of text. We introduce MARTA, a Bayesian framework that jointly learns an attention-based model and the reliability of workers while injecting human rationales into model training. We derive a principled optimization algorithm based on variational inference with efficient updat...
    Finding companies’ websites is important when building business databases. However, automatically finding a company’s website based on its name or its official entry in a registry is challenging, as companies often have similar names,... more
    Finding companies’ websites is important when building business databases. However, automatically finding a company’s website based on its name or its official entry in a registry is challenging, as companies often have similar names, acronyms, or descriptions. In this context, we built a system to evaluate different features and classifiers to automatically identify a company’s website from unstructured content.

    And 44 more