eScholarship Repository eScholarship Repository California Digital Library
eScholarship > UCLASTAT > PAPERS > Paper 2008010902

Statistics Papers

Statistics Website

Policies

Search Statistics

Submit a Paper

Notify me of new papers

institute_logo

Department of Statistics, UCLA
University of California, Los Angeles

Statistics Papers  •  Statistics Website  •  Policies  •  Search Statistics  •  Submit a Paper

Extracting Sequence Features to Predict Protein-DNA Interactions: A Comparative Study
Qing Zhou, UCLA Department of Statistics
Jun S. Liu, Harvard University

Download the Paper (964 K, PDF file) - January 9, 2008 Tell a colleague about it.
Printing Tips: Select 'print as image' in the Acrobat print dialog if you have trouble printing.

ABSTRACT:
Predicting how and where proteins, especially transcription factors (TFs), interact with DNA is an important problem in biology. We present here a systematic study of predictive modeling approaches to the TF-DNA binding problem, which have been frequently shown to be more efficient than those methods only based on position-specific weight matrices (PWMs). In these approaches, a statistical relationship between genomic sequences and gene expression or ChIP-binding intensities is inferred through a regression framework; and influential sequence features are identified by variable selection. We examine a few state-of-the-art learning methods including stepwise linear regression, multivariate adaptive regression splines (MARS), neural networks, support vector machines, boosting, and Bayesian additive regression trees (BART). These methods are applied to both simulated datasets and two whole-genome ChIP-chip datasets on the TFs Oct4 and Sox2, respectively, in human embryonic stem cells. We find that, with proper learning methods, predictive modeling approaches can significantly improve the predictive power and identify more biologically interesting features, such as TF-TF interactions, than the PWM approach. In particular, BART and boosting show the best and the most robust overall performance among all the methods.

SUGGESTED CITATION:
Qing Zhou and Jun S. Liu, "Extracting Sequence Features to Predict Protein-DNA Interactions: A Comparative Study" (January 9, 2008). Department of Statistics, UCLA. Department of Statistics Papers. Paper 2008010902.
http://repositories.cdlib.org/uclastat/papers/2008010902

 
bar
Open Archives Initiative eScholarship is a service of the California Digital Library bepress