Supplemental material, sj-pdf-1-smx-10.1177_00811750211053370 for Language Models in Sociological... more Supplemental material, sj-pdf-1-smx-10.1177_00811750211053370 for Language Models in Sociological Research: An Application to Classifying Large Administrative Data and Measuring Religiosity by Jeffrey L. Jensen, Daniel Karell, Cole Tanigawa-Lau, Nizar Habash, Mai Oudah and Dhia Fairus Shofia Fani in Sociological Methodology
In this paper, we propose a hybrid named entity recognition (NER) approach that takes the advanta... more In this paper, we propose a hybrid named entity recognition (NER) approach that takes the advantages of rule-based and machine learning-based approaches in order to improve the overall system performance and overcome the knowledge elicitation bottleneck and the lack of resources for underdeveloped languages that require deep language processing, such as Arabic. The complexity of Arabic poses special challenges to researchers of Arabic NER, which is essential for both monolingual and multilingual applications. We used the hybrid approach to develop an Arabic NER system that is capable of recognizing 11 types of Arabic named entities: Person, Location, Organization, Date, Time, Price, Measurement, Percent, Phone Number, ISBN and File Name. Extensive experiments were conducted using decision trees, Support Vector Machines and logistic regression classifiers to evaluate the system performance. The empirical results indicate that the hybrid approach outperforms both the rule-based and th...
Figure S1. The PCoA plot of The Human Microbiome Project Consortium (2012) dataset, which is gene... more Figure S1. The PCoA plot of The Human Microbiome Project Consortium (2012) dataset, which is generated via the beta diversity through plots:py script available by QIIME Figure S2. The PCoA plot provided in the Meta-analysis of environmental microbiomes conducted by Henschel et al. (2015) Figure S3. The PCoA plot of the combined CRC dataset Figure S4. Comparison between the baseline and HFE confusion matrices when applied on CRC1 dataset (Zeller et al., 2014) for Cancer vs. Normal classification Figure S5. Comparison between the baseline and HFE confusion matrices when applied on CRC2 dataset (Zackular et al., 2014) for Cancer vs. Normal classification Figure S6. Comparison between the baseline and HFE confusion matrices when applied on CRC1 + 2 dataset for Cancer vs. Normal classification Figure S7. Comparison between the baseline and HFE confusion matrices when applied on CRC1 + 2 Figure S8. The taxonomic tree of all the informative features extracted by the HFE method for Cancer v...
Named Entity Recognition (NER) is an essential task for many natural language processing systems,... more Named Entity Recognition (NER) is an essential task for many natural language processing systems, which makes use of various linguistic resources. NER becomes more complicated when the language in use is morphologically rich and structurally complex, such as Arabic. This language has a set of characteristics that makes it particularly challenging to handle. In a previous work, we have proposed an Arabic NER system that follows the hybrid approach, i.e. integrates both rule-based and machine learning-based NER approaches. Our hybrid NER system is the state-of-the-art in Arabic NER according to its performance on standard evaluation datasets. In this article, we discuss a novel methodology for overcoming the coverage drawback of rule-based NER systems in order to improve their performance and allow for automated rule update. The presented mechanism utilizes the recognition decisions made by the hybrid NER system in order to identify the weaknesses of the rule-based component and deriv...
International Journal of Hypertension | Research Article, 2022
Emerging studies have revealed a strong link between the gut microbiome and several human disease... more Emerging studies have revealed a strong link between the gut microbiome and several human diseases. Since human gut microbiome mirrors variations in lifestyle and environment, whether associations between disease conditions and gut microbiome are consistent across populations-particularly in communities practicing traditional subsistence strategies whose microbiomes differ markedly from industrialists-remains unknown. Cardiovascular diseases are the leading cause of mortality in India affecting 55 million people, and high blood pressure is one of the primary risk factors for cardiovascular diseases. We examined associations between gut microbiome and blood pressure along with 14 other variables associated with lifestyle, dietary habits, disease conditions, and clinical blood markers in the three Assamese populations. Our analysis reveals a robust link between the gut microbiome diversity and composition and systolic blood pressure. Moreover, several genera previously associated with hypertension in non-Indian populations were also associated with systolic blood pressure in this cohort and these genera were predictors of elevated blood pressure in these populations. ese findings confer opportunities to design personalized, preventative, and targeted interventions harnessing the gut microbiome to tackle the burden of cardiovascular diseases in India.
Neural networks have become the state-of-the-art approach for machine translation (MT) in many la... more Neural networks have become the state-of-the-art approach for machine translation (MT) in many languages. While linguistically-motivated tokenization techniques were shown to have significant effects on the performance of statistical MT, it remains unclear if those techniques are well suited for neural MT. In this paper, we systematically compare neural and statistical MT models for Arabic-English translation on data preprecossed by various prominent tokenization schemes. Furthermore, we consider a range of data and vocabulary sizes and compare their effect on both approaches. Our empirical results show that the best choice of tokenization scheme is largely based on the type of model and the size of data. We also show that we can gain significant improvements using a system selection that combines the output from neural and statistical MT.
A common bottleneck for developing machine translation (MT) systems for some language pairs is th... more A common bottleneck for developing machine translation (MT) systems for some language pairs is the lack of direct parallel translation data sets, in general and in certain domains. Alternative solutions such as zero-shot models or pivoting techniques are successful in getting a strong baseline, but are often below the more supported language-pair systems. In this paper, we focus on Arabic-Japanese machine translation, a less studied language pair; and we work with a unique parallel corpus of Arabic news articles that were manually translated to Japanese. We use this parallel corpus to adapt a state-of-the-art domain/genre agnostic neural MT system via a simple automatic post-editing technique. Our results and detailed analysis suggest that this approach is quite viable for less supported language pairs in specific domains.
We present CAMeL Tools, a collection of open-source tools for Arabic natural language processing ... more We present CAMeL Tools, a collection of open-source tools for Arabic natural language processing in Python. CAMeL Tools currently provides utilities for pre-processing, morphological modeling, Dialect Identification, Named Entity Recognition and Sentiment Analysis. In this paper, we describe the design of CAMeL Tools and the functionalities it provides.
Supplemental material, sj-pdf-1-smx-10.1177_00811750211053370 for Language Models in Sociological... more Supplemental material, sj-pdf-1-smx-10.1177_00811750211053370 for Language Models in Sociological Research: An Application to Classifying Large Administrative Data and Measuring Religiosity by Jeffrey L. Jensen, Daniel Karell, Cole Tanigawa-Lau, Nizar Habash, Mai Oudah and Dhia Fairus Shofia Fani in Sociological Methodology
In this paper, we propose a hybrid named entity recognition (NER) approach that takes the advanta... more In this paper, we propose a hybrid named entity recognition (NER) approach that takes the advantages of rule-based and machine learning-based approaches in order to improve the overall system performance and overcome the knowledge elicitation bottleneck and the lack of resources for underdeveloped languages that require deep language processing, such as Arabic. The complexity of Arabic poses special challenges to researchers of Arabic NER, which is essential for both monolingual and multilingual applications. We used the hybrid approach to develop an Arabic NER system that is capable of recognizing 11 types of Arabic named entities: Person, Location, Organization, Date, Time, Price, Measurement, Percent, Phone Number, ISBN and File Name. Extensive experiments were conducted using decision trees, Support Vector Machines and logistic regression classifiers to evaluate the system performance. The empirical results indicate that the hybrid approach outperforms both the rule-based and th...
Figure S1. The PCoA plot of The Human Microbiome Project Consortium (2012) dataset, which is gene... more Figure S1. The PCoA plot of The Human Microbiome Project Consortium (2012) dataset, which is generated via the beta diversity through plots:py script available by QIIME Figure S2. The PCoA plot provided in the Meta-analysis of environmental microbiomes conducted by Henschel et al. (2015) Figure S3. The PCoA plot of the combined CRC dataset Figure S4. Comparison between the baseline and HFE confusion matrices when applied on CRC1 dataset (Zeller et al., 2014) for Cancer vs. Normal classification Figure S5. Comparison between the baseline and HFE confusion matrices when applied on CRC2 dataset (Zackular et al., 2014) for Cancer vs. Normal classification Figure S6. Comparison between the baseline and HFE confusion matrices when applied on CRC1 + 2 dataset for Cancer vs. Normal classification Figure S7. Comparison between the baseline and HFE confusion matrices when applied on CRC1 + 2 Figure S8. The taxonomic tree of all the informative features extracted by the HFE method for Cancer v...
Named Entity Recognition (NER) is an essential task for many natural language processing systems,... more Named Entity Recognition (NER) is an essential task for many natural language processing systems, which makes use of various linguistic resources. NER becomes more complicated when the language in use is morphologically rich and structurally complex, such as Arabic. This language has a set of characteristics that makes it particularly challenging to handle. In a previous work, we have proposed an Arabic NER system that follows the hybrid approach, i.e. integrates both rule-based and machine learning-based NER approaches. Our hybrid NER system is the state-of-the-art in Arabic NER according to its performance on standard evaluation datasets. In this article, we discuss a novel methodology for overcoming the coverage drawback of rule-based NER systems in order to improve their performance and allow for automated rule update. The presented mechanism utilizes the recognition decisions made by the hybrid NER system in order to identify the weaknesses of the rule-based component and deriv...
International Journal of Hypertension | Research Article, 2022
Emerging studies have revealed a strong link between the gut microbiome and several human disease... more Emerging studies have revealed a strong link between the gut microbiome and several human diseases. Since human gut microbiome mirrors variations in lifestyle and environment, whether associations between disease conditions and gut microbiome are consistent across populations-particularly in communities practicing traditional subsistence strategies whose microbiomes differ markedly from industrialists-remains unknown. Cardiovascular diseases are the leading cause of mortality in India affecting 55 million people, and high blood pressure is one of the primary risk factors for cardiovascular diseases. We examined associations between gut microbiome and blood pressure along with 14 other variables associated with lifestyle, dietary habits, disease conditions, and clinical blood markers in the three Assamese populations. Our analysis reveals a robust link between the gut microbiome diversity and composition and systolic blood pressure. Moreover, several genera previously associated with hypertension in non-Indian populations were also associated with systolic blood pressure in this cohort and these genera were predictors of elevated blood pressure in these populations. ese findings confer opportunities to design personalized, preventative, and targeted interventions harnessing the gut microbiome to tackle the burden of cardiovascular diseases in India.
Neural networks have become the state-of-the-art approach for machine translation (MT) in many la... more Neural networks have become the state-of-the-art approach for machine translation (MT) in many languages. While linguistically-motivated tokenization techniques were shown to have significant effects on the performance of statistical MT, it remains unclear if those techniques are well suited for neural MT. In this paper, we systematically compare neural and statistical MT models for Arabic-English translation on data preprecossed by various prominent tokenization schemes. Furthermore, we consider a range of data and vocabulary sizes and compare their effect on both approaches. Our empirical results show that the best choice of tokenization scheme is largely based on the type of model and the size of data. We also show that we can gain significant improvements using a system selection that combines the output from neural and statistical MT.
A common bottleneck for developing machine translation (MT) systems for some language pairs is th... more A common bottleneck for developing machine translation (MT) systems for some language pairs is the lack of direct parallel translation data sets, in general and in certain domains. Alternative solutions such as zero-shot models or pivoting techniques are successful in getting a strong baseline, but are often below the more supported language-pair systems. In this paper, we focus on Arabic-Japanese machine translation, a less studied language pair; and we work with a unique parallel corpus of Arabic news articles that were manually translated to Japanese. We use this parallel corpus to adapt a state-of-the-art domain/genre agnostic neural MT system via a simple automatic post-editing technique. Our results and detailed analysis suggest that this approach is quite viable for less supported language pairs in specific domains.
We present CAMeL Tools, a collection of open-source tools for Arabic natural language processing ... more We present CAMeL Tools, a collection of open-source tools for Arabic natural language processing in Python. CAMeL Tools currently provides utilities for pre-processing, morphological modeling, Dialect Identification, Named Entity Recognition and Sentiment Analysis. In this paper, we describe the design of CAMeL Tools and the functionalities it provides.
Uploads
Papers by Mai Oudah