SAPS evaluates by statistical criteria a wide variety of protein sequence properties. Properties considered include compositional biases; clusters and runs of charge and other amino acid types; different kinds and extents of repetitive structures; locally periodic motifs; and anomalous spacings between identical residue types. The statistics are computed for any single (or appropriately concatenated) protein sequence input. Statistically significant sequence features highlighted by SAPS in the input sequence may suggest promising regions for experimental investigation. The pro- gram also finds application in the description of conserved features of families of proteins as well as in the inverse problem of deriving protein groupings based upon sequence features.

Short sequences are subject to larger statistical fluctua- tions than longer sequences. The statistical evaluations of SAPS are reliable only for sequences of at least about 200 residues. Shorter sequences may in some cases be appropri- ately concatenated and analyzed as a representative combined sequence (e.g., histones, or Ras family proteins).

The SAPS program was developed in the group of Prof. Samuel Karlin at Stanford University. The program is available via anonymous ftp from gnomic.stanford.edu. Correspondence relating to SAPS should be addressed to Volker Brendel at the Department of Mathematics, Stanford University, Stanford CA 94305, U.S.A.; phone: (415) 723-9256; fax: (415) 725- 2040; email: volker@gnomic.stanford.edu. Users of the pro- gram should cite the following reference:

Brendel, V., Bucher, P., Nourbakhsh, I., Blais- dell, B.E., Karlin, S. (1992) Methods and algorithms for statistical analysis of protein sequences. Proc. Natl. Acad. Sci. USA 89: 2002-2006.

Input sequence can be cut-and-pasted into the box, or a sequence file can be uploaded into the web interface. The sequence file should consist of a single sequence of max 10,000 residues in any of the following formats: Raw, Plain, EMBL, SwissProt, Genbank, PIR, Fasta, NBRF, GCG. The web interface runs the 'fmtseq' program to convert the sequence into EMBL format. This web interface to SAPS has the following options:

Output type
Normal - regular output
Terse - Limited output confined to the analysis of the charge distribution and of high scoring segments.
Verbose - A more detailed output providing additional details for several of the analysis functions.
Documented - A completely documented output that annotates each part of the program; this should be selected when SAPS is used for the first time as it provides helpful explanations with respect to the statistics being used and the layout of the output.

Species
Uses the specified species table for quantile comparisons. The residue composition of the input protein may be evaluated relative to standard sets of proteins grouped by species, size class, subcellular location, function, or other criteria. Specifically, the composition of the input protein is compared with the quantile table of residue usage for the the user-specified standard set. Extremal usages which fall in the tails of the reference distribution are indicated for individual amino acids, charged and hydropho- bic residues. . For each reference set, only proteins of lengths at least 200 residues were included; redundant entries were culled. If no species is selected, the input sequence is evaluated with the quantile table 'swp23', a random sample of proteins from SwissProt Release 23. Available species are:
human
mouse
rat
chicken
xenopus (frog)
drosophila (Drosophila melanogaster)
yeast (Saccharomyces cerevisiae)
E.coli (Escherichia Coli)
B.subt. (Bacillus subtilis)

Positive residues
By default, SAPS treats only lysine (K) and arginine (R) as positively charged residues. Alternatively, Histidine (H) can also be treated as positively charged in all parts of the program involving the charge alphabet.


Questions? Problems? Send email to webtools@helix.nih.gov
Helix Systems, CIT, NIH.

Analysis of specified amino acid distribution
Clusters of particular amino acid types may be evaluated by means of the same tests that are used to detect cluster- ing of charged residues (binomial model and scoring statis- tics). These tests are invoked by setting the `-a' flag; for example, to test (separately) for clusters of alanine (A) and serine (S), set `-a AS'. The binomial test is also pro- grammed for certain combinations of amino acids: AG (flag `-a a'), PEST (flag `-a p'), QP (flag `-a q'), ST (flag `-a s').