|
Main |
Research |
Publications |
Links |
Contact |
Research OverviewMy principal area of research is in the prediction of transcription factor binding sites in DNA sequences. This incorporates a broad spectrum of disciplines including: |
Algorithm Design and EvaluationThere is a vast number of algorithms and software packages, at last count well over 200, that aim to predict transcription factor binding sites. No one algorithm is exactly alike, as each method has its own particular approach to the prediction problem. This can include simpy scanning promoter regions with known transcription factor binding site matrices, looking for shared binding sites amongst a set of genes and/or looking for binding sites that occur only within sequence regions conserved across the same promoter region in multiple species. Each algorithm has its own particular strengths and weaknesses. I strongly believe that no one, single algorithm will be able to reliably predict binding sites in promoters and that it will be some hybrid approach, combining multiple algorithms, that offers the most fruitful approach. Towards this goal I have been working on a software framework that integrates together a number of transcription factor binding site prediction algorithms. Rather than re-inventing the wheel by implementing yet another prediction algorithm, the framework uses existing off-the-shelf algorithms. One of the major challenges of constructing the framework is simply that virtually every algorithm reports results in its own, often human-readable, formats. Simply designing wrappers around the algorithms allows for a common output format (in this case GFF) to be used, such that the performance of different algorithms can be easily evaluated. The software framework is called Mogul and is written in Perl. More details on the approach can be read here and a subset of the algorithms are available via a web interface as MotifMogul. No source code has been publicly released as yet but I am pushing hard to get accompanying papers and documentation completed early in 2008. |
Parameter OptimizationWhile the goal of the Mogul project above is to integrate a number of prediction algorithms into a common framework, the question as to what are the optimial parameter settings for the individual algorithms remains. Most off-the-shelf algorithms run with sets of default parameter values, which generate results that vary widely when they are compared against each other. An attempt to address this issue was instigated by Martin Tompa. Using the Mogul framework I am attempting to address the same issue but with a larger number of algorithms and a using multiple algorithms together, rather than evaluating the performance of a single algorithm. I have been using an simulated annealing approach thus far to identify optimal sets of parameters (a genetic algorithm approach is also on the back-burner.) |
Integration/Fusion of Genomic and Biological Data from Multiple, Distributed SourcesThe computational analysis of gene's promoter can be thought of as providing a complete, fully-connected wiring diagram as to which transcription factors play a role in regulating the gene. Knowing that a transcription factor binds to an upstream region of a gene with high probability/confidence, does not however provide any insights as to when the corresponding site is functionally active. Incorporating additional biological data allows further ellucidation into the functional significance of the predcition or perhaps even ruling the site out as a false positive prediction. Data can be in the form of multiple overlapping prediction algorithms and other sequence-derived evidence, such as ChIP-on-chip, multi-species conservation and chromatin binding. In two collaborations I have been investigating how to integrate such data sources using probabilistic methods and machine learning (neural networks, SVMs) techniques. I am also interested in the fusion of high-throughput and small-scale experimental data. It can be relatively straight forward to generate a large set of predictions from high-throughput data, e.g. clustering micro-array data and scanning promoters for transcription factor binding sites. Such predictions can then be supported of refuted by small-scale, single gene experiments. Unfortunately this means that there is no substitute to actually reading papers! However, the curation of such specific, complementary data means that it is possible to drill down on highly probable binding sites to verify experimentally. I'm interested in how such an approach can be made less ad-hoc and more reproducible. |
Genome Annotation and VisualizationIn a previous life I was a member of the EnsEMBL team at the EBI in Cambridge, UK. I still retain a keen interest in the display and visualization of genome annotations, primarily in layering on transcription factor binding site predictions and ChIP-on-chip data. I am particularly interested in how to represent the promoters of multiple genes, such that it is easy to see whether the upstream regions of genes share common transcription factors and their spatial relationships. This is a slightly different question to the typically single gene-centric view of traditional genome browsers. Efforts in this area have been customizing verions of genome browsers such as Apollo and Argo. (An ISB colleague and I worked out how to JavaWebstart Argo.) I have also dabbled with some of the visualization tools in BioPerl and also have implemented my own Perl-based annotation image generator that is used by other group members at ISB. |
Effective Strategies for Running Disparate Computational Analyses on High-performance Compute ClustersAt the ISB I am fortunate to have access to a number of compute clusters to run analyses. These are invaluable when it comes to performing genome-scale analyses and highly-repetitive analyses. My primary uses of the clusters are:
|
Integration of Software Tools into Clean, Biologist-friendly WebsitesWithout delving into the pyschology of designing slick website UIs, I by necessity get involved in how to present both genomic data and access to computational algorithms via simple, straight-forward web interfaces. This is ultimately a compromise between providing access to enough features with which to modify searches and analyses, without overwhelming the user with a multitude of options. Websites that I've created and help design include: |
|
Any truth is better than indefinite doubt.
Sherlock Holmes, The Yellow Face, Sir Arthur Conan Doyle |
| Last Update: December 2007 |