|
![]() |
| Motif Search | Advanced Motif Search | Protein Info | About CoSMoS | CoSMoS Help |
Give it a try and simply copy one of the queries below, open CoSMoS motif search in a new window, paste it into the "Sequence Motif" field and press the "Submit Motif" button.
Read the How to use CoSMoS motif search section for a more detailed description of the regular expressions used in these examples.
| Query | Sequence Motif |
|---|---|
| CGPC | CGPC (Thioredoxin motif) |
| CXXC or C[A-Z][A-Z]C or C[A-Z]{2}C | CXXC |
| C[A-Z]C[A-Z]{30}C[A-Z]{2}C | CXCX30CXXC (Hsp33 zinc binding motif) |
| C[A-Z]C[A-Z]{25,35}C[A-Z]{2}C | CXCX25-35CXXC |
| C[A-Z]{2,4}C[A-Z]{3}[LIVMFYWC][A-Z]{8}H[A-Z]{3,5}H | CX2-4CX3[L,I,V,M,F,Y,W or C]X8HX3-5H (C2H2 zinc finger) |
| ^[A-Z]{0,20}[ST]RR[A-Z]FL | N-term.-X0-20[S or T]RRXFL (Twin Arginine Transport signal) |
| DPC[A-Z]{2}C[A-Z]{2}[HR][A-Z]{0,40}$ | DPCXXCXX[H or R]X0-40-C-term. (cleavage site of NiFe hydrogenase specific C-Terminal endopeptidases) |
CoSMoS searches for protein sequence motifs in the proteome and ranks them by conservation, currently Escherichia coli K12 is implemented in our database. Lets look at the first example query: if you exactly know the sequence of your motif, e.g. if you are looking for a CGPC (Thioredoxin) motif just enter CGPC into the "Sequence Motif" field and submit the data.
You will get a table of all occurences of the four amino acids CGPC in a consecutive row in the E. coli proteome. The order of the table is sorted by conservation with the most conserved CGPC motif first.
In this case it is the CGPC-motif in TrxA. This protein has the NCBI RefSeq ID 16131637 (the NCBI RefSeq ID links directly to the entry for TrxA in the NCBI RefSeq database). A click on the show alignment link displays the alignment (with the motif highlighted). This alignment was used to calculate the values in the "Identical Amino Acids" column. The Protein info link displays the CoSMoS protein info entry for that protein.
2nd in the list is the CGPC-motif in TrxC (Thioredoxin 2). The ranking of the motifs is based on 2 factors: The absolute conservation and the relative conservation. The absolute conservation is the value you see displayed on the left in the "Conservation Score" column, it is calculated by dividing the sum of the values in the"Identical Amino Acids" column by the number of amino acids in the motif (in this case 4) according to the weighing string. The relative conservation score is displayed on the left in the "Conservation Score" column, it is the ratio of absolute conservation divided by the number of sequences the protein was compared to in the alignment ("Comp. to Seq. in Aln." column). The proteins are ranked seperately by both parameters and then the numerically higher of the two is used to calculate the new rank. If the resulting rank would be equal, the motif with the numerically lower relative conservation score rank wins.
Other information about the motifs are the sequence indices (column "Seq. Index") of the amino acids (column "AA"), the number of sequences the proteins were compared to in the alignment file (1736 in the case of TrxA and 1814 in the case of TrxC) (column "Comp. to Seq. in Aln.") and the number of sequences that do not have a gap at the position of the given amino acid (column "Comp. to Seq. at Pos.").
First, it is checked, if the query string consists only of the 20 standard amino acids and the wildcard "X". If this is the case, the query will be reformatted to a regular expression. Every single letter will be parenthesized to create a "backreference" and the resulting query is sent to the database. So if you are entering CGPC the resulting regular expression query will be
The parenthesis defining the backreferences are needed to later calculate the "Conservation Score". You could think of them as separators of a single part of the regular expression (you can learn more about this in the regular expression chapter). You will also find this regular expression in the "Searching CoSMoS for regular expression:" field of the output page. Then, a weighing string of the amino acids will be generated. This weighing string will influence the order of your output. The resulting weighing string will be by default one W for each captured backreference consisting of a defined amino acid resulting in
which means, each of the 4 residues is accounted for equally in the determination of the "Conservation Score". In the case of WWWW, the absolute conservation is the average of the sum of the values in the "Identical Amino Acids" column. Because all of the letters are uppercase, each amino acid of the motif will be displayed. The weighing string is also displayed on the top of the output page. If you want to use a different weighing string (e.g. to exclude certain residues from influencing the conservation score or to account for similar amino acids), use the advanced search mode.
Home - Contents of help - Contents of this howto"X" is the wildcard character. It stands for any of the 20 standard amino acids in a given sequence motif. Let's look at the second example query CXXC.
As you may have noticed, the search for this motif took the server a lot longer than it did to find the CGPC motif. There are also a lot more CXXC motifs (409) in the E. coli proteome than there are CGPC motifs (5).
If an "X" is found in the query string, it will be reformated to the regular expression [A-Z]. The query CXXC will result in the regular expression:
If the wildcard is too unspecific for you, use regular expressions to narrow down the search.
The default weighing string is:
one W for each specific amino acid and one N for each wildcard. This means that amino acids that match the X (any amino acid) will not influence the conservation score. The absolute conservation in this case is the average of the sum of the first and the last value in the "Identical Amino Acids" column. To change this behaviour, use the advanced search mode
Home - Contents of help - Contents of this howtoRegular expressions give you the possibility to specify the occurence of an amino acid. Look at example 3:
| Regular Expression | Meaning |
|---|---|
| [A-Z] | Any amino acid (literally: any uppercase letter from A to Z), identical to the wildcard X |
| [A-Z]{30} | Exactly 30 times any amino acid |
| [A-Z]{2} | Exactly 2 times any amino acid |
You will find another example to specify occurence in example 4:
| Regular Expression | Meaning |
|---|---|
| [A-Z]{25,35} | 25 to 35 occurences of any amino acid |
In example 5 you can see how to specify amino acids that are unspecific, but not as unspecific as the wildcard:
| Regular Expression | Meaning |
|---|---|
| [LIVMFYWC] | Either L, I, V, M, F, Y, W or C |
More examples:
| Regular Expression | Meaning |
|---|---|
| Q{5} | 5 times Q in a row |
| Q{3,8} | 3 to 8 times Q in a row |
| Q{5,} | at least 5 times Q in a row |
| Q{0,5} | at most 5 times Q in a row |
| [QN] | Q or N |
| [QN]{3,8} | 3 to 8 times Q or N |
| [^Q] | any amino acid but Q (literally ^ inside of [ ] brackets means any letter other than the following) |
| [^NQ] | any amino acid but N or Q |
| [^NQ]{3,8} | 3 to 8 times any amino acid but N or Q |
| ^M | N-terminal M (literally the ^ outside of [ ] brackets means "beginning of a line" and the computer treats a protein just as a long line of text) |
| ^[A-Z]{0,20}C[A-Z]{2}C | An N-terminal CXXC motif (literally: 0 to 20 random letters at the beginning of the line followed by C, 2 random letters and C). This search will generate a much more effective output if you use an appropriate weighing string (nWSSW) in the advanced search mode. |
| L$ | C-terminal L (literally the $ means "end of a line") |
| C[A-Z]{2}C[A-Z]{0,20}$ | A C-terminal CXXC motif (literally: C followed by 2 random letters, C and by 0 to 20 random letters at the end of the line). This search will generate a much more effective output if you use an appropriate weighing string (WSSWn) in the advanced search mode. |
By using regular expressions you can tell CoSMoS to search for sequence motifs that do not have a defined length. A search for
creates the regular expression
which returns the CGPC-motif of TrxA as the 1st as well as the CGC-motif of YadR at the 2nd position. The absolute "Conservation Score" is based on the average of the sum of the values in the "Identical Amino Acids" and "Matching Amino Acids" column on the result page according to the weighing string, which in this case is by default
so the "Conservation Score" and the ordering accounts for the different length of the motifs: in the case of the CGC motif, the values of "Identical Amino Acids" for the Cs are added with the "Matching Amino Acids" for the Gs and/or the Ps and divided by 3, in the case of CGPC divided by 4. Since regular expressions can contain character sets with repetition counts, the length of the motif is no longer predifined. This is the reason why the paranthesis are needed to create backreferences. The weighing string has one instruction for each backreference (and not for each amino acid in the motif, because the number of amino acids in the motif is just not known prior to the search).
Home - Contents of help - Contents of this howto| Weighing string value | Meaning |
|---|---|
| W | The value in the "Identical Amino Acids" column corresponding to the regular expression backreference enters the "Conservation Score". Please note that there might be more than one value per regular expression backreference. |
| S | The value in the "Matching Amino Acids" column corresponding to the regular expression enters the "Conservation Score". The "Matching Amino Acids" column accounts for all amino acids, that match with the similarity group for the amino acid at that position in the E. coli sequence. Note that the "Conservation Score" will therefore also be influenced by the setting of the amino acid similarity groups. The "Matching Amino Acids" column only appears, when amino acid similarity groups are specified or the weighing string "R" is used. |
| R | The value in the "Matching Amino Acids" column corresponding to the regular expression enters the "Conservation Score". The "Matching Amino Acids" column accounts for all amino acids, that match with the "regular expression" used in the motif search. The "Matching Amino Acids" column only appears, when this weighing string is used or amino acid similarity groups are specified. |
| N | The regular expression backreference is not accounted for in the "Conservation Score". |
| w | Like W but the amino acids matching this backreference of the regular expression will not be displayed in the result table. |
| s | Like S but the amino acids matching this backreference of the regular expression will not be displayed in the result table. |
| r | Like R but the amino acids matching this backreference of the regular expression will not be displayed in the result table. |
| n | Like N but the amino acids matching this backreference of the regular expression will not be displayed in the result table. |
The default weighing strings that CoSMoS uses are the following: W for each specific amino acid, N for each wildcard X or [A-Z], S for each regular expression that would match multiple amino acids, e.g. [PG]. Note that the latter will also create default similarity groups.
If those cases do not suit your particular needs, it is a good idea to provide a weighing string for the CoSMoS search in the advanced motif search mode. There you can specify an instruction for each regular expression backreference about how it should influence the "Conservation Score"
Example: when you are looking for CX1-5C motifs enter the query
Per default, CoSMoS would assign the weighing string
Click here to display the output for this example in a new window.
If you are interested in the conservation of the whole motif and not just the conservation of the cysteines, enter the weighing string
Click here to display the output for this example in a new window.
If you want to focus on the first cystein and the second cystein could be replaced by a similar amino acid during the course of evolution, provide cosmos with the weighing string
Click here to display the output for this example in a new window.
Please keep in mind, that the weighing string will not influence which motifs are found, it will only affect the ordering of the output by evolutionary significance according to your specifications. To include similar amino acids also in the motif search, refine your search with regular expressions.
Let's get back to the example:
in this case the part ^[A-Z]{0,20} only serves as an "anchor" for the actual CXXC motif, that should be located not far (0-20 amino acids) from the N-terminus. Here you might find it desirable to include those N-terminal amino acids neither in the calculation of the "Conservation Score" nor in the output table. A useful weighing string could look like this:
Click here to display the output for this example in a new window.
whereas the default weighing string
Click here to display the output for this example in a new window.
creates a more prolix result table.
Home - Contents of help - Contents of this howtoTo be able to enter similarity groups, you must use the advanced motif search feature of CoSMoS.
Similarity groups influence, in combination with the weighing string, the ordering of your output when using the "advanced search" feature of CoSMoS. The similarity groups tell CoSMoS which amino acids should be considered similar when calculating the "Matching AA" column.
If, for example, the amino acid in the motif is A, CoSMoS will look up the amino acids in the group "Similar to A" and count appearances of those amino acids as "Matching Amino Acids".
Note that the value in the "Matching Amino Acids" column also accounts for identical amino acids (in this case A), so if you need the value for similar amino acids without identical matches, substract the value of the "Identical Amino Acids" column.
When you are using the advanced search feature, you will find that there are already default similarity groups entered, which take some physicochemical properties of the side chains into account (like aromaticity, charge, polarity). Change them to suit your needs.
The similarity groups will also influence the highlighting of the motif in the alignment. While identical amino acids will be red, similar amino acids will be of orange color. Non matching amino acids and gaps are grey.
Default similarity groups will also be created if you use the "simple" motif search and use certain regular expressions of the type [ACD]. This will create a similarity group for A containing CD, similarity group for C containing AD and a similarity group for D containing AC. To override this behavior, use the advanced motif search feature of CoSMoS.
| Output column | Description |
|---|---|
| NCBI RefSeq ID1 | The ID for the NCBI RefSeq entry(Release 9). The ID Numbers link directly to the corresponding protein entry infosite. |
| Gene Name1 | The gene name according to the RefSeq Database |
| Alignment1 | Links to the alignment used to calculate the "Conservation Score" |
| Protein info1 | Links to the CoSMoS protein info page for that gene. |
| AA | Amino acid in the motif |
| Seq. Ind. | Sequence index of the amino acid in the protein. |
| Conservation Score1 | Absolute conservation score (left value): average of the "Identical AA" or "Matching AA" column entries specified by the weighing string. Relative conservation score (right value): absolute conservation score divided by the number of sequences in the alignment.. |
| Identical Amino Acids | The number of sequences in the alignment that have an amino acid identical to the one found in the E. coli protein at that position. This value is used to calculate the "Conservation Score" if the weighing string is set to W. |
| Matching Amino Acids2 | The value in this column depends on the similarity groups you can set in the advanced search mode. It represents the number of sequences in the alignment that have an amino acid identical or a similar (according to the similarity groups) to the one found in the E. coli protein at that position. This value is usewd to calculate the "Conservation Score" if the weighing string is set to S. |
| Comp. to Seq. at Pos. | Number of Sequences in the alignment, that do align with the protein in question at that position. |
| Comp. to Seq. in Aln.1 | Total number of Sequences in the alignment. |
1These are global values for the motif, which means they do not change from position to position.
2This column only appears when similarity groups are defined.
To access a CoSMoS protein info page, you need to know either the gene name or the RefSeq ID of the protein you are looking for. Choose the appropriate setting in the "Search For:" field and enter the gene name or RefSeq ID in the "Gene Name / RefSeq ID:" field. The default setting is searching by gene name. If you are looking e.g. for the E. coli chaperone HSP70, you have to enter its gene name dnaK or its RefSeq ID 16128008.
On the top of the output page you will find a link to a tab-delimited version of the gene info that can be used for export into a spreadsheet program. You will also find a link to the alignment file that is the basis for the gene info.
The gene name and the according RefSeq ID are displayed. A click on the RefSeq ID will open the RefSeq entry for dnaK.
Then the number of sequences in the alignment file is stated as well as the average number of identical amino acids per residue in this protein. The latter value is the sum of the values in the "identical AA" column divided by the number of amino acids in the protein
Below this, you will find a table with one row for each amino acid in the chosen protein. The rows are colored according to the conservation of the amino acid. Highly conserved amino acids are colored red, averagely conserved amino acids are colored green and highly variable amino acids are of light grey color according to this coloring scheme:
| Color Legend | |
|---|---|
| 60 % more | conserved than the average amino acid in this protein |
| 30 % more | |
| equally | |
| 30 % less | |
| 60 % less | |
In the case of dnaK you see a highly conserved stretch at the N-terminus (G6-T11). This is actually the ATP binding site. When you scroll further down, you will find other highly conserved stretches, e.g. D194-D201, a part of the ATPase Domain. At the bottom of the page you will find the color legend displayed above.
Home - Contents of help| Output column | Description |
|---|---|
| AA | Amino acid in the protein. |
| Seq. Index | Sequence index of the amino acid in the protein. |
| identical AA | The number of sequences in the alignment that have an amino acid identical to the one found in the E. coli protein at that position. |
| Conservation Score | The number of identical AA divided by the total number of sequences in the alignment |
The alignments for the E. coli proteins can be viewed by clicking on the alignment links on the motif search output or on the protein info page. The default view shows the alignments compressed to the 20 most diverse proteins with the end gaps trimmed. You can change the view options to see more or less sequences by entering the appropriate numbers in the "Compress view to" and "Of those, view top" fields.
As an example: if there are 1000 sequences in the alignment file and you enter 10 in the "Compress view to" field and 10 in the "Of those, view top" field, every 100th sequence in the alignment is actually displayed, showing you essentially the 10 most diverse sequences. If, on the other hand you would like to see only the 20 most similar sequences, you should enter 1000 in the "Compress view to" field and 20 in the "Compress view to" field. If you would like to trim all gaps or do not want to trim gaps at all, you can specify so by checking the appropriate radio buttons.
CoSMoS protein info - Contents of help