PheneBank

About

Free text scientific literature has the potential to be an incredibly valuable source of data for uncovering the often hidden relationships between genes, diseases and phenotypes. Phenotypic descriptions cover abnormalities in anatomical structures, processes and behaviours. For example 'growth delay' and 'body weight loss'. Such descriptions form the basis for determining the existence and treatment of a disease but, because of their inherent complexity, have previously received less attention by the text mining community. In recent years, significant effort has been spent by a small number of expert curators to create coding systems for phenotypes (called "ontologies"), such as the Human Phenotype Ontology (HP) and the Mammalian Phenotype Ontology (MP). The PheneBank project proposes to support and speed up curation using terms discovered directly from the literature and to automatically integrate them with such standard ontologies.

The project seeks to harness texts for extracting statistically significant associations between phenotypes, diseases and genes. Earlier approaches have suffered from not providing deep semantic descriptions of the relations they tried to target. This means that association scores merge notions of genetic, pharmacological, and epidemiological relations etc. without distinction. Our parsing-based approach is an attempt to overcome this issue by discovering more precise relationships. The approach builds on ground breaking research at the European Bininformatics Institute by the PI (Collier) and the Wellcome Trust Sanger Instittue by the Co-investigator (Smedley), including terminology alignment of phenotypes using pairwise scoring of the conceptual elements that make up the phenotype.

Publications

  • Verspoor, K., Oellrich, A., Collier, N., Groza, T., Rocca-Serra, P., Soldatova, L., ... & Shah, N. (2016), "Thematic issue of the Second combined Bio-ontologies and Phenotypes Workshop", Journal of Biomedical Semantics, 7(1), 60. [html]
  • Le, H.Q., Tran, M.V., Dang, T.H. Ha, Q.T. and Collier, N. (2016), “Sieve-based coreference resolution enhances semi-supervised learning model for chemical-induced disease relation extraction”, in Database, Oxford University Press, vol. 2016: article ID baw102; DOI: 10.1093/database/baw102. [pdf]
  • Pilehvar, M. T., Camacho-Collados, J., Navigli, R. and Collier, N. (2017), "Towards a Seamless Integration of Word Senses into Downstream NLP Applications", in Proceedings of the Association of Computational Linguistics Annual Meeting (ACL 2017), Vancouver, Canada, August (in press).
  • Pilehvar, M. T. and Collier, N. (2017), "Inducing Representations for Rare Words by Leveraging Lexical Resources", in Proceedings of the European Chapter of the Association for Computational Linguistics (EACL), Valencia, Spain (in press) [pdf].
  • Pilehvar, M. T. and Collier, N. (2016), “De-conflating semantic representations of words by exploiting knowledge from semantic networks”, in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), Austin, USA, November 1st to 5th [pdf].
  • Pilehvar, M. T. and Collier, N. (2016), “Improved Semantic Representation for Domain-Specific Entities”, in Proc. BioNLP 2016 at the 2016 Annual Meeting of the Association for Computational Linguistics (ACL 2016), Berlin, Germany, August 12th to 13th [pdf] [slides].

Data Sets

Outreach

  • List of text mining tools
  • Invited talk given at the EPSRC HealTex Network Launch Event (November 2016)
  • PhenoDay 2016 was a joint workshop held with Bio-Ontologies at ISMB (July 2016)
  • Seminar at the European Bioinformatics Institute, Cambridge (September 2015)
  • Invited talk given at the Big Data in Medicine Workshop, Cancer Research UK Cambridge Institute (June 2015)

Collaborators

  • Damian Smedley (Queen Mary University London)
  • Anna Korhonen (University of Cambridge)
  • Bill Skarnes (Wellcome Trust Sanger Institute)
  • Dietrich Rebholz-Schuhmann (INSIGHT, National University of Ireland at Galway)
  • Peter Robinson (The Jackson Laboratory)
  • Hoang Quynh Le (University of Vietnam)
  • Jo McEntyre (EMBL-EBI)
  • Paul Lasco (McGill University)
  • Lawrence Hunter (University of Colorado at Denver)
  • Robert Stevens (University of Manchester)

Funding

PheneBank is funded by an Medical Research Council grant (MR/M025160/1).