Svoboda | Graniru | BBC Russia | Golosameriki | Facebook
Skip to main content
We present a wide experimental work evaluating the behaviour of Recursive ECOC (RECOC) [1] learning machines based on Low Density Parity Check (LDPC) coding structures. We show that owing to the iterative decoding algorithms behind LDPC... more
We present a wide experimental work evaluating the behaviour of Recursive ECOC (RECOC) [1] learning machines based on Low Density Parity Check (LDPC) coding structures. We show that owing to the iterative decoding algorithms behind LDPC codes, RECOC multiclass learning is progressively achieved. This learning behaviour confirms the existence of new boosting dimension, the one provided by the coding space. We present a method for searching potential good RECOC codes from LDPC ones. Starting from a properly selected LDPC code, we assess the effect of boosting in both weak and strong binary learners. For nearly all domains, we find that boosting a strong learner like a Decision Tree is as effective as boosting a weak one like a Decision Stump. This surprising result substantiates the hypothesis that weakening strong classifiers by boosting has a decorrelation effect, which can be used to improve RECOC learning.
Page 1. A Methodological Proposal for Multiagent Systems Development Extending CommonKADS? Carlos A. Iglesiasy??, Mercedes Garijoz, Jos e C. Gonz alezz and Juan R. Velascoz yDep. de Teor a de la Señal, Comunicaciones e Ing. ...
Resumen: Este artículo describe el desarrollo de TASS, taller de evaluación experimental en el contexto de la SEPLN para fomentar la investigación en el campo del análisis de sentimiento en los medios sociales, específicamente centrado en... more
Resumen: Este artículo describe el desarrollo de TASS, taller de evaluación experimental en el contexto de la SEPLN para fomentar la investigación en el campo del análisis de sentimiento en los medios sociales, específicamente centrado en el idioma español. El principal objetivo es promover el diseño de nuevas técnicas y algoritmos y la aplicación de los ya existentes para la implementación de complejos sistemas capaces de realizar un análisis de sentimientos basados en opiniones de textos cortos extraídos de medios sociales (concretamente Twitter). Este artículo describe las tareas propuestas, el contenido, formato y las estadísticas más importantes del corpus generado, los participantes y los diferentes enfoques planteados, así como los resultados generales obtenidos. Palabras clave: TASS, análisis de reputación, aná lisis de sentimientos, medios sociales. Abstract: This paper describes TASS, an experimental evaluation workshop within SEPLN to foster the research in the field of s...
This paper describes the participation of MIRACLE in NTCIR 2005 CLIR task. Although our group has a strong background and long expertise in Computational Linguistics and Information Retrieval applied to European languages and using Latin... more
This paper describes the participation of MIRACLE in NTCIR 2005 CLIR task. Although our group has a strong background and long expertise in Computational Linguistics and Information Retrieval applied to European languages and using Latin and Cyrillic alphabets, this was our first attempt on East Asian languages. Our main goal was to study the particularities and distinctive characteristics of Japanese, Chinese and Korean, specially focusing on the similarities and differences with European languages, and carry out research on CLIR tasks which include those languages. The basic idea behind our participation in NTCIR is to test if the same familiar linguistic-based techniques may also applicable to East Asian languages, and study the necessary adaptations.
This paper describes our participation at SemEval-2014 sentiment analysis task, in both contextual and message polarity classification. Our idea was to compare two different techniques for sentiment analysis. First, a machine learning... more
This paper describes our participation at SemEval-2014 sentiment analysis task, in both contextual and message polarity classification. Our idea was to compare two different techniques for sentiment analysis. First, a machine learning classifier specifically built for the task using the provided training corpus. On the other hand, a lexicon-based approach using natural language processing techniques, developed for a generic sentiment analysis task with no adaptation to the provided training corpus. Results, though far from the best runs, prove that the generic model is more robust as it achieves a more balanced evaluation for message polarity along the different test sets.
Abstract. This paper presents the 2006 Miracle team's approaches to the AdHoc and Geographical Information Retrieval tasks. A first set of runs was obtained using a set of basic components. Then, by putting together special... more
Abstract. This paper presents the 2006 Miracle team's approaches to the AdHoc and Geographical Information Retrieval tasks. A first set of runs was obtained using a set of basic components. Then, by putting together special combinations of these runs, an extended set was obtained. With respect to previous campaigns some improvements have been introduced in our system: an entity recognition prototype is integrated in our tokenization scheme, and the performance of our indexing and retrieval engine has been improved. For GeoCLEF, we tested retrieving using geo-entity and textual references separately, and then combining them with different approaches.
Abstract. ImageCLEF is a pilot experiment run at CLEF 2003 for cross language image retrieval using textual captions related to image contents. In this paper, we describe the participation of the MIRACLE research team (Multilingual... more
Abstract. ImageCLEF is a pilot experiment run at CLEF 2003 for cross language image retrieval using textual captions related to image contents. In this paper, we describe the participation of the MIRACLE research team (Multilingual Information RetrievAl at CLEF), detailing the different experiments and discussing their preliminary results. 1
Resumen: Este articulo presenta un sistema automatico para recoger, almacenar, analizar y visualizar de manera agregada informacion publicada en medios de comunicacion sobre ciertas organizaciones junto con las opiniones expresadas sobre... more
Resumen: Este articulo presenta un sistema automatico para recoger, almacenar, analizar y visualizar de manera agregada informacion publicada en medios de comunicacion sobre ciertas organizaciones junto con las opiniones expresadas sobre ellas por usuarios en redes sociales. Este sistema permite automatizar la elaboracion de un analisis de reputacion completo y detallado, segun diferentes dimensiones y en tiempo real, permitiendo que una organizacion pueda conocer su posicion en el mercado, medir su evolucion, compararse con sus competidores, y detectar lo mas rapidamente posible situaciones problematicas para ser capaces de tomar medidas correctoras. Palabras clave: Reputacion, extraccion de informacion, analisis semantico, analisis de sentimiento, clasificacion, opinion, redes sociales, RSS.
For MIRACLE participation on WebClef 2005, a set of independent indexes was constructed for each top level domain of the EuroGOV collection. Each of these indexes contains information extracted from the document, like URL, title,... more
For MIRACLE participation on WebClef 2005, a set of independent indexes was constructed for each top level domain of the EuroGOV collection. Each of these indexes contains information extracted from the document, like URL, title, keywords, detected named entities or HTML headers. These indexes are queried to obtain partial document rankings, which are combined with various relative weights to test the value of each index. The trie based indexing and retrieval engine developed by the MIRACLE team is now fully functional and has been adapted to the WebClef environment and employed in this campaign. Other tools, such as the Named Entities Recognizer based on a finite automaton, have also been developed.
This paper presents the 2005 Miracle’s team approach to Monolingual Information Retrieval. The goal for the experiments in this year was twofold: continue testing the effect of combination approaches on information retrieval tasks, and... more
This paper presents the 2005 Miracle’s team approach to Monolingual Information Retrieval. The goal for the experiments in this year was twofold: continue testing the effect of combination approaches on information retrieval tasks, and improving our basic processing and indexing tools, adapting them to new languages with strange encoding schemes. The starting point was a set of basic components: stemming, transforming, filtering, proper nouns extracting, paragraph extracting, and pseudo-relevance feedback. Some of these basic components were used in different combinations and order of application for document indexing and for query processing. Second order combinations were also tested, by averaging or selective combination of the documents retrieved by different approaches for a particular query.
This paper describes the participation of MIRACLE-GSI research consortium at the ImageCLEF 2009 Photo Retrieval Task. For this campaign, the main purpose of our experiments was to compare the performance of a “standard” clustering... more
This paper describes the participation of MIRACLE-GSI research consortium at the ImageCLEF 2009 Photo Retrieval Task. For this campaign, the main purpose of our experiments was to compare the performance of a “standard” clustering algorithm, based on the k-Medoids algorithm, against a more simple classification technique that makes use of the cluster assignment that was provided for a subset of topics by the task organiz ers. First a common baseline algorithm was used in all experiments to process the document collecti on: text extraction, tokenization, conversion to lowercase, filtering, stemming and finally, indexin g and retrieval. Then this baseline algorithm is combined with these two different result reranking techniques. As expected, results show that any reranking method outperforms a standard non-cluster ing image search baseline algorithm in terms of cluster recall. In addition, using the informati on of cluster assignments leads to the best results .
This paper presents the 2005 MIRACLE’s team participation in the ImageCLEFmed task of ImageCLEF 2005. This task certainly requires the use of image retrieval techniques and therefore it is mainly aimed at image analysis research groups.... more
This paper presents the 2005 MIRACLE’s team participation in the ImageCLEFmed task of ImageCLEF 2005. This task certainly requires the use of image retrieval techniques and therefore it is mainly aimed at image analysis research groups. Although our areas of expertise don’t include image analysis research, we decided to make the effort to participate in this task to promote and encourage multidisciplinary participation in all aspects of information retrieval, no matter if it is text or content based. We resort to a publicly available image retrieval system (GIFT [1]) when needed.
This paper presents the 2006 MIRACLE’s team approach to the AdHoc Information Retrieval track. The experiments for this campaign keep on testing our IR approach. First, a baseline set of runs is obtained, including standard components:... more
This paper presents the 2006 MIRACLE’s team approach to the AdHoc Information Retrieval track. The experiments for this campaign keep on testing our IR approach. First, a baseline set of runs is obtained, including standard components: stemming, transforming, filtering, entities detection and extracting, and others. Then, a extended set of runs is obtained using several types of combinations of these baseline runs. The improvements introduced for this campaign have been a few ones: we have used an entity recognition and indexing prototype tool into our tokenizing scheme, and we have run more combining experiments for the robust multilingual case than in previous campaigns. However, no significative improvements have been achieved. For the this campaign, runs were submitted for the following languages and tracks: - Monolingual: Bulgarian, French, Hungarian, and Portuguese. - Bilingual: English to Bulgarian, French, Hungarian, and Portuguese; Spanish to French and Portuguese; and Fren...
This paper describes our participation at the RepLab 2014 reputation dimensions scenario. Our idea was to evaluate the best combination strategy of a machine learning classifier with a rule-based algorithm based on logical expressions of... more
This paper describes our participation at the RepLab 2014 reputation dimensions scenario. Our idea was to evaluate the best combination strategy of a machine learning classifier with a rule-based algorithm based on logical expressions of terms. Results show that our baseline experiment using just Naive Bayes Multinomial with a term vector model representation of the tweet text is ranked second among runs from all participants in terms of accuracy.
This paper describes the participation of DAEDALUS team at the WebPS-3 Task 1, regarding Web People Search. The focus of our research is to evaluate and compare the computational requirements and results achieved by different solutions... more
This paper describes the participation of DAEDALUS team at the WebPS-3 Task 1, regarding Web People Search. The focus of our research is to evaluate and compare the computational requirements and results achieved by different solutions based on the minimization of cost functions applied to clustering algorithms. Our clustering technique is based on an implementation of k-Medoids algorithm, run over a sparse term-document matrix built with the terms of the pages that are associated to each of the person names. We define an empty-cluster that holds all the individuals that are not part of any other cluster. Based on the results obtained, we can conclude that although clustering techniques play a very relevant role in the resolution of the problem of name homonymy in a set of web pages, there is a previous challenge still to solve: how to determine which contents are relevant for describing the person in that webpage, thus which are not part of the other navigational information contai...
This paper presents the 2005 MIRACLE’s team approach to Cross-Language Geographical Retrieval (GeoCLEF). The main goal of the GeoCLEF participation of the MIRACLE team was to test the effect that geographical information retrieval... more
This paper presents the 2005 MIRACLE’s team approach to Cross-Language Geographical Retrieval (GeoCLEF). The main goal of the GeoCLEF participation of the MIRACLE team was to test the effect that geographical information retrieval techniques cause to information retrieval. The baseline approach is based on the development of named entity recognition and geospatial information retrieval tools and on its combination with linguistic techniques to perform indexing and retrieval tasks.
This paper describes the participation of DAEDALUS at the LogCLEF lab in CLEF 2011. This year, the objectives of our participation are twofold. The first topic is to analyze if there is any measurable effect on the success of the search... more
This paper describes the participation of DAEDALUS at the LogCLEF lab in CLEF 2011. This year, the objectives of our participation are twofold. The first topic is to analyze if there is any measurable effect on the success of the search queries if the native language and the interface language chosen by the user are different. The idea is to determine if this difference may condition the way in which the user interacts with the search application. The second topic is to analyze the user context and his/her interaction with the system in the case of successful queries, to discover out any relation among the user native language, the language of the resource involved and the interaction strategy adopted by the user to find out such resource. Only 6.89% of queries are successful out of the 628,607 queries in the 320,001 sessions with at least one search query in the log. The main conclusion that can be drawn is that, in general for all languages, whether the native language matches the...
This paper describes the participation of DAEDALUS at ImageCLEF 2011 Plant Identification task. The task is evaluat ed as a supervised classification problem over 71 tree species from th e French Mediterranean area used as class labels,... more
This paper describes the participation of DAEDALUS at ImageCLEF 2011 Plant Identification task. The task is evaluat ed as a supervised classification problem over 71 tree species from th e French Mediterranean area used as class labels, based on visual content from scan, scan-like and natural photo images. Our approach to this task is to build a classifier based on the detection of keypoints from the images extracted us ing Lowe's Scale Invariant Feature Transform (SIFT) algorithm. Although our overall classification score is very low as compared to other participant groups , the main conclusion that can be drawn is that SIFT keypoints seem to work si gnificantly better for photos than for the other image types, so our appro ach may be a feasible strategy for the classification of this kind of vis ual content.
In this paper, we describe the algorithm that has been used to carry out our plagiarism detection within the context of PAN10 competition. Our system is based on the LempelZiv distance, which is applied to extract structural information... more
In this paper, we describe the algorithm that has been used to carry out our plagiarism detection within the context of PAN10 competition. Our system is based on the LempelZiv distance, which is applied to extract structural information from texts. Then the algorithm tries to find outliers in the vector of distances between each fragment of the text and the whole document itself.
In this paper a highly configurable, real-time analysis system to automatically record, analyze and visualize high level aggregated information of user interventions in Twitter is described. The system is designed to provide public... more
In this paper a highly configurable, real-time analysis system to automatically record, analyze and visualize high level aggregated information of user interventions in Twitter is described. The system is designed to provide public entities with a powerful tool to rapidly and easily understand what the citizen behavior trends are, what their opinion about city services, events, etc. is, and also may used as a primary alert system that may improve the efficiency of emergency systems. The citizen is here observed as a proactive city sensor capable of generating huge amounts of very rich, high-level and valuable data through social media platforms, which, after properly processed, summarized and annotated, allows city administrators to better understand citizen necessities. The architecture and component blocks are described and some key details of the design, implementation and scenarios of application are discussed.
This paper describes the participation of MIRACLE research consortium at NTCIR-7 Multilingual Opinion Analysis Task, our first attempt on sentiment analysis and second on East Asian languages. We took part in the main mandatory... more
This paper describes the participation of MIRACLE research consortium at NTCIR-7 Multilingual Opinion Analysis Task, our first attempt on sentiment analysis and second on East Asian languages. We took part in the main mandatory opinionated sentence judgment subtask (to decide whether each sentence expresses an opinion or not) and the optional relevance and polarity judgment subtasks (to decide whether a given sentence is relevant to the given topic and also the polarity of the expressed opinion). Our approach combines a semantic languagedependent tagging of the terms of the sentence and the topic and three different ad-hoc classifiers that provide the specific annotation for each subtask, run in cascade. These models have been trained with the corpus provided in NTCIR-6 Opinion Analysis pilot task.
This paper describes the participation of DAEDALUS at ImageCLEF 2011 Medical Retrieval task. We have focused on multimodal (or mixed) experiments that combine textual and visual retriev al. The main objective of our research has been to... more
This paper describes the participation of DAEDALUS at ImageCLEF 2011 Medical Retrieval task. We have focused on multimodal (or mixed) experiments that combine textual and visual retriev al. The main objective of our research has been to evaluate the effect on the med ical retrieval process of the existence of an extended corpus that is annotated w ith the image type, associated to both the image itself and also to its textual description. For this purpose, an image classifier has been developed to tag each document with its class (1st level of the hierarchy: Radiology, Micros copy, Photograph, Graphic, Other) and subclass (2nd level: AN, CT, MR, etc.). F or the textual-based experiments, several runs using different semantic expansion techniques have been performed. For the visual-based retrieval, dif ferent runs are defined by the corpus used in the retrieval process and the strate gy for obtaining the class and/or subclass. The best results are achieved in r uns that make use of t...
This paper presents the participation of the MIRACLE team at the ImageCLEF interactive search task. Basically, queries consisting on several terms can be processed combining their words using either an AND function or an OR function. The... more
This paper presents the participation of the MIRACLE team at the ImageCLEF interactive search task. Basically, queries consisting on several terms can be processed combining their words using either an AND function or an OR function. The AND approach forces the user to use precise vocabulary and query terms must exactly match the terms in the index for the target to be found. However, this is quite difficult to integrate in cross-lingual systems with automatic translation, as many terms can turn out to be ambiguous and accept different translation options. The OR approach allows less precise vocabulary and more ambiguous translations, and also relevance feedback can be used to achieve the search goals. From the user’s point of view, the AND approach seems to be more intuitive because the system responses can be made as precise as wanted, simply by adding more words to the query. On the other hand, with the OR approach, the more terms are included in the query, the more images are pr...
Este trabajo ha sido financiado parcialmente por el proyecto Ciudad2020: Hacia un nuevo modelo de ciudad inteligente sostenible (INNPRONTA IPT-20111006).
This paper describes the participation of MIRACLE research consortium at the ImageCLEFmed task of ImageCLEF 2008. The main goal of our participation this year is to compare among different topic expansion approaches: methods based on... more
This paper describes the participation of MIRACLE research consortium at the ImageCLEFmed task of ImageCLEF 2008. The main goal of our participation this year is to compare among different topic expansion approaches: methods based on linguistic information such as thesauri or knowledge bases, and statistical techniques based o n term frequency. Thus we focused on runs using text features only. First a common baseline a lgorithm was used in all experiments to process the document collection: text extraction, medical-v ocabulary recognition, tokenization, conversion to lowercase, filtering, stemming and indexing and retrieval. Then this baseline algorithm is combined with different expansion techniques. For t he semantic expansion, the MeSH concept hierarchy using UMLS entities as basic root element s was used. The statistical method consisted of expanding the topics using the apriori algorithm. R elevance-feedback techniques were also used.
There exist a great number of business rules engines in the market, each one with its own rules language. As a result, the need of a standard rules definition language is obvious. In this demo paper, the application of a combination of... more
There exist a great number of business rules engines in the market, each one with its own rules language. As a result, the need of a standard rules definition language is obvious. In this demo paper, the application of a combination of standard business rules languages to develop a business rules management tool, K-Site Rules, is explained. The main goals of this tool are: first, to integrate the development process for business rules in the software development lifecycle and, second, to make this development independent of rule engine products. K-Site Rules does not include its own rule engine implementation but acts as a broker to the preferred rule engine. In this way, business rules can be reused across different rule engines (which can change because of economical issues, performance issues, etc.) and through different projects in the company. The K-Site Rules tool is part of ITECBAN, a project devoted to the definition of a new platform core for banking applications.
This paper describes our participation at PAN 2014 author profiling task. Our idea was to define, develop and evaluate a simple machine learning classifier able to guess the gender and the age of a given user based on his/her texts, which... more
This paper describes our participation at PAN 2014 author profiling task. Our idea was to define, develop and evaluate a simple machine learning classifier able to guess the gender and the age of a given user based on his/her texts, which could become part of the solution portfolio of the company. We were interested in finding not the best possible classifier that achieves the highest accuracy, but to find the optimum balance between performance and throughput using the most simple strategy and less dependent of external systems. Results show that our software using Naive Bayes Multinomial with a term vector model representation of the text is ranked quite well among the rest of participants in terms of accuracy.
One of the proposed tasks of the ImageCLEF 2005 campaign has been an Automatic Annotation Task. The objective is to provide the classification of a given set of 1,000 previously unseen medical (radiological) images according to 57... more
One of the proposed tasks of the ImageCLEF 2005 campaign has been an Automatic Annotation Task. The objective is to provide the classification of a given set of 1,000 previously unseen medical (radiological) images according to 57 predefined categories covering different medical pathologies. 9,000 classified training images are given which can be used in any way to train a classifier. The Automatic Annotation task uses no textual information, but image-content information only. This paper describes our participation in the automatic annotation task of ImageCLEF 2005.
This paper describes the participation of DAEDALUS at the ImageCLEF 2010 Wikipedia Retrieval task. The main focus of our experiments is to evaluate the impact in the image retrieval pr ocess of the incorporation of semantic information... more
This paper describes the participation of DAEDALUS at the ImageCLEF 2010 Wikipedia Retrieval task. The main focus of our experiments is to evaluate the impact in the image retrieval pr ocess of the incorporation of semantic information extracted only from the textua l information provided as metadata of the image itself, as compared to expand ing with contextual information gathered from the document where the image is referred. For the semantic annotation, DBpedia ontology and YAGO classification schema are used. As expected, the obtained results show that, in general, the textual information attached to a given image is not able t o fully represent certain features of the image. Furthermore, the use of sema ntic information in the process of multimedia information extraction poses two hard challenges still to solve: how to automatically extract the high level features associated to a multimedia resource, and, once the resource has bee n semantically tagged, which features must be ...
Non binary learning problems can be broken down into a redundant set of binary ones by means of RECOC schemes, namely a generalization of Dietterich’s ECOC learning models involving recursive error correcting codes. The use of recursive... more
Non binary learning problems can be broken down into a redundant set of binary ones by means of RECOC schemes, namely a generalization of Dietterich’s ECOC learning models involving recursive error correcting codes. The use of recursive codes allows the modeling of distributed learning strategies by means Tanner graphs and general message passing algorithms on them. In this paper, RECOC learning based on Product Accumulated (PA) codes is analyzed.
This paper describes TASS, an experimental evaluation workshop within SEPLN to foster the research in the field of sentiment analysis in social media, specifically focused on Spanish language. The main objective is to promote the... more
This paper describes TASS, an experimental evaluation workshop within SEPLN to foster the research in the field of sentiment analysis in social media, specifically focused on Spanish language. The main objective is to promote the application of existing state-of-the-art algorithms and techniques and the design of new ones for the implementation of complex systems able to perform a sentiment analysis based on short text opinions extracted from social media messages (specifically Twitter) published by representative personalities. The paper presents the proposed tasks, the contents, format and main statistics of the generated corpus, the participant groups and their different approaches, and, finally, the overall results achieved.
This paper discusses a novel hybrid approach for text categorization that combines a machine learning algorithm, which provides a base model trained with a labeled corpus, with a rule-based expert system, which is used to improve the... more
This paper discusses a novel hybrid approach for text categorization that combines a machine learning algorithm, which provides a base model trained with a labeled corpus, with a rule-based expert system, which is used to improve the results provided by the previous classifier, by filtering false positives and dealing with false negatives. The main advantage is that the system can be easily fine-tuned by adding specific rules for those noisy or conflicting categories that have not been successfully trained. We also describe an implementation based on k-Nearest Neighbor and a simple rule language to express lists of positive, negative and relevant (multiword) terms appearing in the input text. The system is evaluated in several scenarios, including the popular Reuters-21578 news corpus for comparison to other approaches, and categorization using IPTC metadata, EUROVOC thesaurus and others. Results show that this approach achieves a precision that is comparable to top ranked methods, ...
This paper describes the participation of MIRACLE research consortium at the ImageCLEF Medical Image Annotation task of ImageCLEF 2008. A lot of effort was invested this year to develop our own image analysis system, based on MATLAB, to... more
This paper describes the participation of MIRACLE research consortium at the ImageCLEF Medical Image Annotation task of ImageCLEF 2008. A lot of effort was invested this year to develop our own image analysis system, based on MATLAB, to be used in our experiments. This system extracts a variety of global and local featu res including histogram, image statistics, Gabor features, fractal dimension, DCT and DWT coefficients, Tamura features and coocurrency matrix statistics. Then a k-Nearest Neighbour algorithm an alyzes the extracted image feature vectors to determine the IRMA code associated to a given image. The focus of our experiments is mainly to test and evaluate this system in-depth and to make a comparison among diverse configuration parameters such as number of images for the relevan ce feedback to use in the classification module.
This paper describes our participation at the RepLab 2012 profiling scenario, in both polarity classification and filtering subtasks. Our approach is based on 1) the information provided by a semantic model that includes rules and... more
This paper describes our participation at the RepLab 2012 profiling scenario, in both polarity classification and filtering subtasks. Our approach is based on 1) the information provided by a semantic model that includes rules and resources annotated for sentiment analysis, 2) a detailed morphosyntactic analysis of the input text that allows to lemmatize and divide the text into segments to be able to control the scope of semantic units and perform a fine- grained detection of negation in clauses, and 3) the use of an aggregation algorithm to calculate the global polarity value of the text based on the local polarity values of the different segments, which includes an outlier filter. The system, experiments and results are presented and discussed in the paper.
This paper discusses several techniques for the automatic generation of rules to be used in a novel hybrid method for text categorization. This approach combines a machine learning algorithm along with a different rule-based expert... more
This paper discusses several techniques for the automatic generation of rules to be used in a novel hybrid method for text categorization. This approach combines a machine learning algorithm along with a different rule-based expert systems in cascade used to filter and rerank the output of the base model provided by the previous classifier. This paper describes an implementation based on kNN algorithm and a basic rule language that expresses lists of terms appearing in the text. The popular Reuters-21578 news corpus is used for testing. Results show that the proposed methods for automatic rule generation achieve precisión valúes that are very similar to the ones achieved by manually defined rule sets, and that this hybrid approach achieves a precisión that is comparable to other top state-of-the-art methods.
This paper describes the participation of MIRACLE-GSI research consortium at the ImageCLEFphoto task of ImageCLEF 2008. For this campaign, the main purpose of our experiments was to evaluate different strategies fo r topic expansion in a... more
This paper describes the participation of MIRACLE-GSI research consortium at the ImageCLEFphoto task of ImageCLEF 2008. For this campaign, the main purpose of our experiments was to evaluate different strategies fo r topic expansion in a pure textual retrieval context. Two approaches were used: methods based on linguistic information such as thesauri, and statistical methods that use term frequency. First a common baseline algorithm was used in all experiments to process the document collection: tex t extraction, tokenization, conversion to lowercase, filtering, stemming and finally, indexin g and retrieval. Then this baseline algorithm is combined with different expansion techniques. For t he semantic expansion, we used WordNet to expand topic terms with related terms. The statisti cal method consisted of expanding the topics using Agrawal’s apriori algorithm. Relevance-feedback techniques were also used. Last, the result list is reranked using an implementation of k-Medoi ds clust...
This paper describes the participation of DAEDALUS at the LogCLEF task. The focus of our experiments was to study if the difference between the native language of the user and the int erface language could affect the way in which the user... more
This paper describes the participation of DAEDALUS at the LogCLEF task. The focus of our experiments was to study if the difference between the native language of the user and the int erface language could affect the way in which the user interacts with the search application and the success of the search queries. First, the provided log data was parsed into 194,040 sessions containing the set of sequential actions c arried out by the same user. Then, only those sessions that include at least one search query were selected, 16% of the total number of sessions. Within that se ssion set, a total number of 388,272 queries have been run, only 6.45% of which were successful, i.e. return any result, thus resulting in 10.6% of successful s essions. After a statistical correlation analysis of these figures, the main con clusion that can be drawn is that, in the general case, the fact that the native language is used or not as the interface language doesn't seem to affect to the su ccess...
The main objective of the designed experiments is testing the effects of geographical information retrieval from documents that contain geographical tags. In the designed experiments we try to isolate geographical retrieval from textual... more
The main objective of the designed experiments is testing the effects of geographical information retrieval from documents that contain geographical tags. In the designed experiments we try to isolate geographical retrieval from textual retrieval replacing all geo-entity textual references from topics with associated tags and splitting the retrieval process in two phases: textual retrieval from the textual part of the topic without geo-entity references and geographical retrieval from the tagged text generated by the topic tagger. Textual and geographical results are combined applying different techniques: union, intersection, difference, and external join based.
The hypothesis boosting concept can be understood as a kind of divide and conquer strategy for the design of low complexity classifiers. The aim of this paper is to show the feasibility of boosting algorithms in high dimension feature... more
The hypothesis boosting concept can be understood as a kind of divide and conquer strategy for the design of low complexity classifiers. The aim of this paper is to show the feasibility of boosting algorithms in high dimension feature spaces (HDFS). A recursive learning model inspired in the design of recursive error correcting codes is proposed, with the main focus on the binary classification problem
Research Interests:
Recursive ECOC (RECOC) classifiers, effectively deals with microarray data complexity by encoding multiclass labels with codewords taken from Low Density Parity Check (LDPC) codes. Not all good LDPC codes result in good microarray data... more
Recursive ECOC (RECOC) classifiers, effectively deals with microarray data complexity by encoding multiclass labels with codewords taken from Low Density Parity Check (LDPC) codes. Not all good LDPC codes result in good microarray data RECOC classifiers. A general scoring method for the identification of promising LDPC codes in the RECOC sense is presented.
This paper discusses a novel method for text categorization that combines a machine learning algorithm able to build a base model with low effort by using a labeled available cor- pus, along with a rule-based expert system in cascade used... more
This paper discusses a novel method for text categorization that combines a machine learning algorithm able to build a base model with low effort by using a labeled available cor- pus, along with a rule-based expert system in cascade used to filter and rerank the output of the previous classifier. The model can be fine-tuned by adding specific rules for those difficult classes that have not been successfully trained. We describe an implementation based on kNN algorithm and a basic rule language that expresses lists of terms appearing in the text. The sys- tem is trained and evaluated in different scenarios, including the popular Reuters-21578 news corpus for comparison to other approaches, and the IPTC and EUROVOC models. Results show that this approach achieves a precision that is comparable to other top state-of-the-art methods.

And 101 more