Computational Biology and Bioinformatics

Structure and Function Analysis of Amino Acid Sequences via InterNet
Statistical analysis of dependence of functional property on amino acid composition in a peptide set
Functional pattern recognition

Credits:V.M. Morozov, CHEMICAL ENZYMOLOGY DIVISION, Chemistry Department, Moscow State University.

Last modified: April 1996

NEW STATISTICAL APPROACH TO ANALYSIS OF DEPENDENCE OF FUNCTIONAL PROPERTY (ACTIVITY, STABILITY, ED50, KS, ETS.) ON AMINO ACID COMPOSITION IN A PEPTIDE SET

The approach examines the relationships between the protein activity and physic-chemical characteristics (or amino acid composition) of different regions in the their primary and tertiary structures (3D QSAR). The structure-activity analysis is based on aligned protein amino acid sequences, data on their activity (pK, ED50, Km or any other) and 3D structure of at least one of these proteins. The approach is useful also in cases when protein families are divided by evolution, functional or other criteria. The following methods are implemented:

- empirical energy calculations;
- spatial site moments calculations;
- discriminant analysis;
- multiple linear regression;
- analysis of variations;
- ANOVA;
- cluster analysis;

The methods are incorporated into the ProAnalyst package for MS-DOS which (file:prana$.exe, self extracting) is freely accessed from ftp.ebi.ac.uk /pub/soft/dos/ or NetServ@EBI.AC.UK. Data visualization (regression plots, 3D pictures, drawing of various physic-chemical profiles for sequences and 3D structures) helps to understand protein data by eyes. The package provide possibility to seek out protein sites: conservative in variations of physic-chemical characteristics (candidates to functionally important regions) and regions with high or low values of these characteristics. All obtained results including profiles can be stored on a disk.
References:

Eroshkin A.M., Fomin V.I., Zhilkin P.A., Ivanisenko V.V., Kondrakhin Y.V."PROANAL version 2: multifunctional program for analysis of multiple protein sequence alignments and for studying the structure-activity relationships in protein families."; Comput. Appl. Biosci. 11:39-44(1995).
Eroshkin A.M., Zhilkin P.A., Fomin V.I."Algorithm and computer program Pro_Anal for analysis of relationship between structure and activity in a family of proteins or peptides." Comput. Appl. Biosci. 9:491-497(1993).

MULTIPLE LINEAR REGRESSION ANALYSIS.

Using of multiple linear regression permits user to estimate the dependencies between the variables and to verify hypothesis on the nature of modulating centers that:

- includes different parts of protein structure (discrete centers);

- have more than one key amino acid property influencing protein activity (e.g. charge and volume). As the result user can find the possible relations and predict the activities from meanings of independent variables.

DISCRIMINANT ANALYSIS.

Discriminant analysis is used in the cases when protein activities are given only qualitatively (Klecka, 1986). With the using of this analysis it is possible to define the most important physic-chemical factors for activity-modulating centers. Obtained coefficients of canonical discriminant functions can be used in classification of proteins.

CROSS GROUPS VARIATION and VARIATION IN CURRENT GROUP.

Alphabetical analysis is used in the first stage of finding an activity-modulating regions. Let all protein family is divided into N group of proteins with similar activities. To calculate the inter group variability index the comparison of protein sites (sequential or spatial) being done. We calculate the number of protein pairs (each from different groups) that have the same contest of amino acid residues in given site. Then this number is divided to the common number of all possible pairs of proteins. So we get the number (varying from 0 to 1) that characterize the site variability.

The estimation of variability indexes is calculated based on parevise comparison of proteins from different groups.

I=1-log[9*Sum Ri /N +1], i=1,N

Ri= Mul r , where: ij
j=1,M
N - the number of protein pares, M - the number of position in site, r - the element of matrix of similarity between aa. ij Mul - multiplication. If r vary in the interval [0,1] then the values for I lie in interval ij [0,1]. If position is conservative, I=0.
The following matrices are implemented in the program: ONE - uniform matrix, MACLACL - physico-chemical matrix of McLachlan, MDM78 - Dyhoff's matrix for detection of distant protein relations, ESAB - matrix of closely related aa.

The intra group variability index is calculated by the same procedure but pairs of proteins are taken from the same one group.

CLUSTER ANALYSIS.

Briefly, the proteins are divided to the some clusters with the same amino acid content in the given site (sequential or spatial). Result of calculation is equal to number one or zero.

0 - in case when difference between activities of any two proteins from different clusters are no more than threshold value. Analyzing site is good candidate to activity-modulating site when threshold value is less or equal to error of measuring protein activity.

1 - other cases.

STATISTIC MAX R SQUARE (ANOVA).

The procedure of calculation of statistic max R-square is taken from (Draper and Smith, 1986). Briefly, the proteins are divided to the some groups with the same amino acid content in the given site (sequential or spatial). The variations of protein activities are calculated for each of the groups. The variation is the sum square of difference between average activity for the group and the activities of each item. Then the sum of obtained variations is calculated and divided by the common variation of the whole protein set.

The amino acid comparison in all the methods of alphabetical analysis can be done with the using of matrices of amino acid similarity (MDM78, minimal mutational distance, etc). In all used methods the sites can be determined from protein sequence as well as tertiary structure. After entering to this section user need to select the matrix of aa similarity (from suggested catalog of names). Then it is necessary to input the threshold value for estimation of site similarity.

ANALYSIS OF THE DEPENDENCE OF BIOLUMINESCENCE SPECTRA MAXIMUM ON BEETLE LUCIFERASE SEQUENCES: IDENTIFICATION OF AMINO ACIDS RESPONSIBLE FOR COLOUR DIFFERENCES

Statistical approach was applied to analysis of the dependence of bioluminescence color on amino acid changes in beetle luciferase sequences. The data set consisted from click beetle luciferase sequences (4 wild enzymes and 27 their hybrids) and bioluminescence maximum positions (BMP). It was revealed that the amino acid replacements in any pair joining 223/238 position (amino acid changes in these positions are correlated) and 247/352/358 position are responsible for 90% of BMP variation. The statistically significant (R=99.6%) correlation between BMP and overall polarizability in 223-224 and 247 (major influence) and 351-352 (minor influence) positions was observed. It was suggested that these sites are located in proximity to the emitter.
Reference:

V.M Morozov, "Localization of protein site responsible for luminescence color in beetle luciferases", Bioluminescence & Chemiluminescence: Current Status/Eds. Szalay A. et al. N.Y.: John Wiley, 1994, P.536-539
V.M. Morozov, V.V. Ivanisenko, A.M. Eroshkin, N.N. Ugarova "Computer analysis of the dependence of bioluminescence spectra maximum on beetle luciferase sequences: Identification of amino acids responsible for colour differences" (admitted to journal "Molecular Biology" ).

CONSERVED MOTIFS IN ADENYLATING PROTEINS

The search of conserved motifs was performed in the adenylating protein superfamily. The superfamily includes enzymes which activate carboxyl forming acyladenilate using ATP as AMP-donor. Besides a known motif, we have found a second conserved motif. Screening database SWISS-PROT for occurrence of the motifs have showed that both motifs are highly characteristic and occur in all adenylating proteins. The predicted secondary structure of the domain flanked these motifs is similar in two adenylating proteins without statistically significant sequence homology. It may suggest that the motif 2 as well as the motif 1 is involved in adenylating and belongs to the domain whose tertiary structure is conserved.

Reference:

V.M. Morozov, N.N. Ugarova "Conserved Motifs in Adenylating Proteins" (admitted to journal "Biochemistry" ).