You may want to Download the article on SSCP (PostScript, 'tar'ed and 'compress'ed)
References:
1. Eisenhaber F., Imperiale F., Argos P., Froemmel C.
"Prediction of Secondary Structural Content of Proteins from Their
Amino Acid Composition Alone.
I. New Analytic Vector Decomposition Methods"
Proteins: Struct.,Funct.,Design, 25 (1996) N2, 157-168
2. Eisenhaber F., Froemmel C., Argos P.
"Prediction of Secondary Structural Content of Proteins from Their
Amino Acid Composition Alone.
II. The Paradox with Secondary Structural Class"
Proteins: Struct.,Funct.,Design, 25 (1996) N2, 169-179
3. Eisenhaber F., Persson B., Argos P.
"Prediction of Protein Structure. Recognition of Primary, Secondary,
and Tertiary Structural Features from Amino Acid Sequence"
Critical Reviews in Biochemistry & Molecular Biology, 30 (1995)
N1, 1-94
In the first technique, the amino acid composition of a query protein is represented by the best (in a least square sense) linear combination of the characteristic amino acid compositions of the three secondary structural types computed from a learning set of protein tertiary structures. The values of the calculated weight coefficients are utilized to predict the secondary structural content.
The second method is a generalization of the first one and also takes into account the variation in the amino acid type frequencies separately for each secondary structural type as well as possible compositional couplings between any two types of amino acids. The mathematical formulation of this approach leads to the eigenvalue/eigenvector computation of the second moment matrix describing the amino acid compositional fluctuations of secondary structural types in various proteins of a learning set. The eigenvalues are used for normalizing the coordinate differences between the query and the average compositions in the eigenvector space. Possible correlations of the principal directions of the eigenspaces with physical properties of the amino acids were also checked. For example, the first two eigenvectors of the alpha-helical eigenspace appear to correlate with the size and hydrophobicity of the residue types respectively.
The average amino acid compositions as well as the second moment matrices of compositional fluctuations which characterise the secondary structural types considered in this work were calculated from learning sets of protein tertiary structures obtained with various resolution thresholds (1.8 Å, 2.0 Å, 2.5 Å, and 3.0 Å). Best prediction results were obtained with the 2.0 Å dataset. The accuracy of the prediction algorithm has been validated with a "jackknife" procedure (prediction of one protein against the database of all other proteins). The consideration of compositional couplings improves the prediction accuracy, albeit not dramatically.
Our prediction results show that, for a majority of proteins, the secondary structural content is determined mainly by the amino acid composition. Much more complex secondary structure prediction methods utilized for the same purpose of secondary structural content prediction achieve accuracies very similar to the present analytic techniques implying that all the information beyond the amino acid composition is, in fact, mainly utilized for positioning the secondary structural state in the sequence but not for determination of the overall number of residues in a secondary structural type. Though the sequence per se determines the detailed tertiary protein structure, certain sequence properties, especially the amino acid composition, are primordial in their effect on the structure.
The prediction of the secondary structural content (i.e., the percentage of residues in different secondary structural states) might be considered a first step in analyzing a new protein sequence and predicting its tertiary architecture. A similar prediction might be attempted utilizing experimentally determined amino acid compositions. Surprisingly, only a few researchers have directly addressed this problem. Early multiple linear regression analyses have been repeated and improved by Muskal and Kim relying on a much larger database of protein structures. The amino acid composition, the molecular weight and the existence of heme in the protein were correlated with the contents of helix and strand. Muskal and Kim have also presented neural network systems for secondary structural contents prediction with the same input parameters. Heme as input parameter has been traditional since Krigbaum and Knutton and reflects a bias in the protein set to globins and cytochromes (heme correlates positively with helix content and negatively with sheet content). Other prosthetic groups were not considered. At the same time, for protein sequences obtained in genome projects and also for experimentally determined amino acid compositions of proteins, it is not obvious which type of cofactor is required for biological function.
Nishikawa et al. found that folding type, amino acid composition, biological function (enzyme or non- enzyme), intra- or extracellular location, and the number of disulphide bonds are related. The structural class (folding type) of a protein is mainly defined by ranges of secondary structural content (see section III.D.2 and Table 1 of Eisenhaber et al.). Subsequently, both analytical distance criteria in the amino acid composition space and neural network methods have been applied for the jury decision between 3-5 folding types (alpha- and beta-proteins, one or two types of mixed structures, and sometimes a group of irregular forms). It was found that sequence properties such as simple hydrophobic patterns (used in combination with the amino acid composition) add almost nothing to the accuracy of the structural class prediction.
Whereas the folding type prediction appears simpler than the estimation of the secondary structural content, the classical task of secondary structure prediction resulting in location of particular conformational states within the sequence is more complicated. At the same time, the output of secondary structure prediction algorithms may be also interpreted in terms of secondary structural content.
In this work, we investigate the impact of the amino acid composition on the secondary structural content (a-helix, b-sheet, coil) of a protein. Our approach is the first to rely purely on the amino acid composition of the query protein. We present two methods for prediction of the secondary structural content. Both are based on easily comprehended analytic vector decomposition methods. In contrast to the first method, the second does take into account different weighting of amino acids as well as possible compositional couplings between pairs of two residue types. The average amino acid composition of a secondary structure type together with the characteristic amino acid compositional fluctuations were determined from a learning set of protein structures. We show that the consideration of weighting of amino acid types and of coupling between them improves prediction accuracy.
Surprisingly, the prediction accuracy of our simple approach is similar to that of single sequence secondary structure prediction methods applied for the task of predicting the secondary structural content. This result implies that all information additional to the knowledge of amino acid composition is utilized by these methods mainly for positioning the secondary structural state in the sequence but not for determining the absolute number of residues in a specific secondary structural state.
Last modified Feb. 6, 1998