Every year about the middle of April, depending on the temperature in southern Arizona, eggs...
A Boost for Analyzing Biological Sequences
UA computer scientists John Kececioglu and Dan DeBlasio are developing improved software that provides biologists with much more accurate results when analyzing sequence data.
Imagine trying to construct a brick building with fewer than the requisite number of bricks and without a detailed blueprint.
Welcome to the world of computational biologists.
When biologists study proteins, DNA, or other biological molecules that are represented in the computer as sequences, they rely on known information but also must predict missing data. Given that reality, major challenges exist to having accurate results.
At the University of Arizona, computer scientist John Kececioglu and his collaborators have spent years developing improved computer software to aid biologists in obtaining more accurate analyses.
"It's a very common problem: How does a biologist find the right values to use when they run their analysis software? Sometimes you have hundreds of parameters that must be set," said Kececioglu, a UA associate professor of computer science and a BIO5 Institute member.
The aim: to remove the guessing game involved in tuning parameter values, while also improving the ability to obtain the best sequence alignment without having all information available.
And there is good news – Kececioglu and one of his graduate students, Dan DeBlasio, have developed a new technique that automatically tunes parameters and improves accuracy by 27 percent using an approach the team has termed "parameter advising."
With parameter advising, the software Kececioglu and DeBlasio co-developed frees biologists from having to rely on the default settings in the software tools they use for sequence analysis.
The newly developed software model is able to quickly analyze those settings, along with information about the sequence provided by the biologist, to swiftly detect the best parameters.
"It's somewhat like setting the configuration in your Internet browser to give yourself the best experience," said DeBlasio, a third-year doctoral student in the computer science department. "We're automatically finding the configuration that gives a biologist the best results when they run alignment software."
For their work, and the potential of their new techniques, Kececioglu recently received a National Science Foundation grant of nearly $500,000 through September 2015 to advance the technology further.
The underlying approach is remarkably general, and Kececioglu and DeBlasio will make their system available as open source software for widespread use – not merely for those scientists doing biological sequence alignment.
"There are so many different models of scientific phenomena, and they are invariably optimization problems – you are trying to minimize or maximize a function that usually has many parameters whose values need to be set," Kececioglu said. "If the parameters are not set to the right values, your model won’t be sound – it won't actually correspond to what you are trying to model."
Today, biologists frequently analyze proteins represented in terms of linear sequences of amino acids, searching for similar and dissimilar regions. It is a time-consuming process, and one that does not always lead to robust results.
Also, proteins that are functionally the same in humans and animals are not identical in sequence, so there exists an added challenge when software programs are not reliable enough to aid in identifying shared and dissimilar regions from the distant evolutionary past of the species.
"If the species are evolutionarily distantly related, the computed results can often be inaccurate," Kececioglu said. "For a computational model to behave correctly, it needs correct values for the model’s parameters."
The algorithm that Kececioglu and DeBlasio discovered does not randomly try parameter values, but carefully chooses the next value to use, learning in a short period of time the best values far quicker than a human possibly could.
"Conventional methods of alignment just use one default parameter setting, and take a one-size-fits-all approach when computing an alignment," said DeBlasio, who is funded by the UA's Integrative Graduate Education and Research Traineeship, or IGERT, in genomics.
"What we're doing is to increase the accuracy of the alignment that is computed by choosing from multiple parameter settings that work well," he said. "So we're opening up the space of possible solutions using a careful estimate of the quality of those alignments."
The research was well-received when DeBlasio presented an initial paper on the project and its findings at the 2012 International Conference on Research in Computational Molecular Biology in Barcelona, Spain. The full paper has since been accepted for future publication in the Journal of Computational Biology.
DeBlasio and Kececioglu affirm that the general software system they are developing will ultimately lead to improved speed and accuracy for scientific models outside the field of biology – which could provide a tremendous boost to scientific research.
"This enables a scientist to work with much more sophisticated models for the phenomena they are studying, that have huge numbers of parameters but are more accurate and more expressive," Kececioglu said. "You can now capture aspects you couldn't represent before, and work with models that would simply be impossible to tune by hand."