Sepp Hochreiter, Klaus Obermayer, Martin Heusel,
"Fast Model-based Protein Homology Detection without Alignment"
, in Bioinformatics, Vol. 23, Nummer 14, Oxford University Press, Seite(n) 1728-1736, 2007, ISSN: 1460-2059
Original Titel:
Fast Model-based Protein Homology Detection without Alignment
Sprache des Titels:
Englisch
Original Kurzfassung:
As more genomes are sequenced, the demand for fast gene classification techniques is increasing. To analyze a newly sequenced genome, first the genes are identified and translated into amino acid sequences which are then classified into structural or functional classes. The best-performing protein classification methods are based on protein homology detection using sequence alignment methods. Alignment methods have recently been enhanced by discriminative methods like support vector machines (SVMs) as well as by position-specific scoring matrices (PSSM) as obtained from PSI-BLAST.
However, alignment methods are time consuming if a new sequence must be compared to many known sequences—the same holds for SVMs. Even more time consuming is to construct a PSSM for the new sequence. The best-performing methods would take about 25 days on present-day computers to classify the sequences of a new genome (20 000 genes) as belonging to just one specific class—however, there are hundreds of classes.
Another shortcoming of alignment algorithms is that they do not build a model of the positive class but measure the mutual distance between sequences or profiles. Only multiple alignments and hidden Markov models are popular classification methods which build a model of the positive class but they show low classification performance. The advantage of a model is that it can be analyzed for chemical properties common to the class members to obtain new insights into protein function and structure.
We propose a fast model-based recurrent neural network for protein homology detection, the ‘Long Short-Term Memory’ (LSTM). LSTM automatically extracts indicative patterns for the positive class, but in contrast to profile methods it also extracts negative patterns and uses correlations between all detected patterns for classification. LSTM is capable to automatically extract useful local and global sequence statistics like hydrophobicity, polarity, volume, polarizability and ...