Credits:V.M. Morozov, CHEMICAL ENZYMOLOGY DIVISION, Chemistry Department, Moscow State University.
Last modified: April 1996
The approach examines the relationships between the protein activity and physic-chemical characteristics (or amino acid composition) of different regions in the their primary and tertiary structures (3D QSAR). The structure-activity analysis is based on aligned protein amino acid sequences, data on their activity (pK, ED50, Km or any other) and 3D structure of at least one of these proteins. The approach is useful also in cases when protein families are divided by evolution, functional or other criteria. The following methods are implemented:
The methods are incorporated into the ProAnalyst package for MS-DOS which (file:prana$.exe, self extracting) is freely accessed from ftp.ebi.ac.uk /pub/soft/dos/ or NetServ@EBI.AC.UK.
Data visualization (regression plots, 3D pictures, drawing of various physic-chemical profiles for sequences and 3D structures) helps to understand protein data by eyes.
The package provide possibility to seek out protein sites: conservative in variations of physic-chemical characteristics (candidates to functionally important regions) and regions with high or low values of these characteristics. All obtained results including profiles can be stored on a disk.
References:
MULTIPLE LINEAR REGRESSION ANALYSIS.
Using of multiple linear regression permits user to estimate the dependencies between the variables and to verify hypothesis on the nature of modulating centers that:
DISCRIMINANT ANALYSIS.
Discriminant analysis is used in the cases when protein activities are given only qualitatively (Klecka, 1986). With the using of this analysis it is possible to define the most important physic-chemical factors for activity-modulating centers. Obtained coefficients of canonical discriminant functions can be used in classification of proteins.
CROSS GROUPS VARIATION and VARIATION IN CURRENT GROUP.
Alphabetical analysis is used in the first stage of finding an activity-modulating regions. Let all protein family is divided into N group of proteins with similar activities. To calculate the inter group variability index the comparison of protein sites (sequential or spatial) being done. We calculate the number of protein pairs (each from different groups) that have the same contest of amino acid residues in given site. Then this number is divided to the common number of all possible pairs of proteins. So we get the number (varying from 0 to 1) that characterize the site variability.
The estimation of variability indexes is calculated based on parevise comparison of proteins from different groups.
I=1-log[9*Sum Ri /N +1], i=1,N
Ri= Mul r , where: ij
j=1,M
N - the number of protein pares, M - the number of position in site, r - the element of matrix of similarity between aa.
ij Mul - multiplication. If r vary in the interval [0,1] then the values for I lie in interval
ij [0,1]. If position is conservative, I=0.
The following matrices are implemented in the program: ONE - uniform matrix, MACLACL - physico-chemical matrix of McLachlan, MDM78 - Dyhoff's matrix for detection of distant protein relations, ESAB - matrix of closely related aa.
The intra group variability index is calculated by the same procedure but pairs of proteins are taken from the same one group.
CLUSTER ANALYSIS.
Briefly, the proteins are divided to the some clusters with the same amino acid content in the given site (sequential or spatial). Result of calculation is equal to number one or zero.
0 - in case when difference between activities of any two proteins from different clusters are no more than threshold value. Analyzing site is good candidate to activity-modulating site when threshold value is less or equal to error of measuring protein activity.
STATISTIC MAX R SQUARE (ANOVA).
The procedure of calculation of statistic max R-square is taken from (Draper and Smith, 1986). Briefly, the proteins are divided to the some groups with the same amino acid content in the given site (sequential or spatial). The variations of protein activities are calculated for each of the groups. The variation is the sum square of difference between average activity for the group and the activities of each item. Then the sum of obtained variations is calculated and divided by the common variation of the whole protein set.
The amino acid comparison in all the methods of alphabetical analysis can be done with the using of matrices of amino acid similarity (MDM78, minimal mutational distance, etc). In all used methods the sites can be determined from protein sequence as well as tertiary structure. After entering to this section user need to select the matrix of aa similarity (from suggested catalog of names). Then it is necessary to input the threshold value for estimation of site similarity.
ANALYSIS OF THE DEPENDENCE OF BIOLUMINESCENCE SPECTRA MAXIMUM ON BEETLE LUCIFERASE SEQUENCES: IDENTIFICATION OF AMINO ACIDS RESPONSIBLE FOR COLOUR DIFFERENCES
Statistical approach was applied to analysis of the dependence of bioluminescence color on amino acid changes in beetle luciferase sequences. The data set consisted from click beetle luciferase sequences (4 wild enzymes and 27 their hybrids) and bioluminescence maximum positions (BMP). It was revealed that the amino acid replacements in any pair joining 223/238 position (amino acid changes in these positions are correlated) and 247/352/358 position are responsible for 90% of BMP variation. The statistically significant (R=99.6%) correlation between BMP and overall polarizability in 223-224 and 247 (major influence) and 351-352 (minor influence) positions was observed. It was suggested that these sites are located in proximity to the emitter.
Reference:
CONSERVED MOTIFS IN ADENYLATING PROTEINS
The search of conserved motifs was performed in the adenylating protein superfamily. The superfamily includes enzymes which activate carboxyl forming acyladenilate using ATP as AMP-donor. Besides a known motif, we have found a second conserved motif. Screening database SWISS-PROT for occurrence of the motifs have showed that both motifs are highly characteristic and occur in all adenylating proteins. The predicted secondary structure of the domain flanked these motifs is similar in two adenylating proteins without statistically significant sequence homology. It may suggest that the motif 2 as well as the motif 1 is involved in adenylating and belongs to the domain whose tertiary structure is conserved.
Reference:
V.M. Morozov, N.N. Ugarova "Conserved Motifs in Adenylating
Proteins" (admitted to journal "Biochemistry"