Elicitation of Protein-Protein Interactions from Biomedical Literature Using Association Rule Discovery

Description:

Extracting information from a stack of data is a tedious task and the scenario is no different in proteomics. Volumes of research papers are published about study of various proteins in several species, their interactions with other proteins and identification of protein(s) as possible biomarker in causing diseases. It is a challenging task for biologists to keep track of these developments manually by reading through the literatures. Several tools have been developed by computer linguists to assist identification, extraction and hypotheses generation of proteins and protein-protein interactions from biomedical publications and protein databases. However, they are confronted with the challenges of term variation, term ambiguity, access only to abstracts and inconsistencies in time-consuming manual curation of protein and protein-protein interaction repositories. This work attempts to attenuate the challenges by extracting protein-protein interactions in humans and elicit possible interactions using associative rule mining on full text, abstracts and captions from figures available from publicly available biomedical literature databases. Two such databases are used in our study: Directory of Open Access Journals (DOAJ) and PubMed Central (PMC). A corpus is built using articles based on search terms. A dataset of more than 38,000 protein-protein interactions from the Human Protein Reference Database (HPRD) is cross-referenced to validate discovered interactive pairs. A set of an optimal size of possible binary protein-protein interactions is generated to be made available for clinician or biological validation. A significant change in the number of new associations was found by altering the thresholds for support and confidence metrics. This study narrows down the limitations for biologists in keeping pace with discovery of protein-protein interactions via manually reading the literature and their needs to validate each and every possible interaction.

Creator(s): Samuel, Jarvie John
Creation Date: August 2010
Partner(s):
UNT Libraries
Collection(s):
UNT Theses and Dissertations
Usage:
Total Uses: 260
Past 30 days: 3
Yesterday: 0
Creator (Author):
Publisher Info:
Publisher Name: University of North Texas
Place of Publication: Denton, Texas
Date(s):
  • Creation: August 2010
Description:

Extracting information from a stack of data is a tedious task and the scenario is no different in proteomics. Volumes of research papers are published about study of various proteins in several species, their interactions with other proteins and identification of protein(s) as possible biomarker in causing diseases. It is a challenging task for biologists to keep track of these developments manually by reading through the literatures. Several tools have been developed by computer linguists to assist identification, extraction and hypotheses generation of proteins and protein-protein interactions from biomedical publications and protein databases. However, they are confronted with the challenges of term variation, term ambiguity, access only to abstracts and inconsistencies in time-consuming manual curation of protein and protein-protein interaction repositories. This work attempts to attenuate the challenges by extracting protein-protein interactions in humans and elicit possible interactions using associative rule mining on full text, abstracts and captions from figures available from publicly available biomedical literature databases. Two such databases are used in our study: Directory of Open Access Journals (DOAJ) and PubMed Central (PMC). A corpus is built using articles based on search terms. A dataset of more than 38,000 protein-protein interactions from the Human Protein Reference Database (HPRD) is cross-referenced to validate discovered interactive pairs. A set of an optimal size of possible binary protein-protein interactions is generated to be made available for clinician or biological validation. A significant change in the number of new associations was found by altering the thresholds for support and confidence metrics. This study narrows down the limitations for biologists in keeping pace with discovery of protein-protein interactions via manually reading the literature and their needs to validate each and every possible interaction.

Degree:
Level: Master's
Discipline: Computer Science
Physical Description:

vii, 82 p. : ill.

Language(s):
Subject(s):
Keyword(s): association rule mining | text mining | protein-protein interactions | information extraction
Contributor(s):
Partner:
UNT Libraries
Collection:
UNT Theses and Dissertations
Identifier:
  • OCLC: 700932915 |
  • UNTCAT: b3914510 |
  • ARK: ark:/67531/metadc30508
Resource Type: Thesis or Dissertation
Format: Text
Rights:
Access: Public
License: Copyright
Holder: Samuel, Jarvie John
Statement: Copyright is held by the author, unless otherwise noted. All rights reserved.