|
|
STEPdb: Sub-cellular Topologies of E.coli Polypeptides.
Information about STEPdb server
Index:
1. Updates
2. Search
3. Glossary
4. Topology and orientation of IM proteins
5. Internal Connections
6. Color Code
7. Multifun Terms
8. MatureP classifier
1. Updates:
STEPdb has been updated: STEPdb2.0 version is available from February 2019. To see the changes in E. coli K-12 ptoeome, please consult the downloads section. For more information: Loos et al., 2019.
2. Search:
In STEP the search function is performed locally for the displayed proteome/sub-proteome. Consequently search is performed only within the current table.
(e.g. when the user is browsing the Complexome he is able to search only for entries within this group of proteins.)
Search is for the moment not very sophisticated. Logical operators such OR, AND are not currently supported
by the search tool and text is searched intact.
(e.g. words separated with a space are not searched separately)
3. Glossary:
Basic Proteome of E.coli K-12 (MG1655)
STEP defines the "basic" proteome of E.coli K-12 (MG1655) as the fraction the proteome that is devoid of gene products that are
likely not produced, or that result from enomic insertions enriched in defective prophages, transposons, pseudogenes, integrases and
mobile elements. To define "basic" proteome we used manual annotation. We based this on EcoGene
(Rudd, 2000), Uniprot
(Dimmer et al, 2012) and other
(McClelland et al, 2001;
Ochman et al, 2000) studies. F1 plasmid-encoded proteins
were also removed from the Uniprot data of the E.coli K-12 proteome used (release version of November 2010).
According to our analysis the essential protein-coding sequences devoid of these elements encompasses only 3899 proteins.
Core Proteome of E.coli K-12
The common core proteome of all E.coli strains consists of 2073 proteins. These were derived from genome comparison with
Multi-Genome Homology Comparison Tool (Davidsen et al, 2010)
of 15 E.coli strains (Papanastasiou et al, submitted).
Protein Solubility
Protein structure is encoded in the primary amino acid sequence. However, the folding process
of a protein to its functional form, in some cases, has to overcome miss-folding states that can lead to protein
inclusion bodies.
Niwa et al 2009 have calculated protein solubility
for 3173 Escherichia coli proteins, in a chaperone-free reconstituted translation system.
The aggregation propensity of each protein is examined by centrifugation assay. Solubility is defined as the index
of aggregation propensity which is expressed as the proportion of the supernatant fraction, which is obtained after
the centrifugation of a translation mixture, to the uncentrifuged total protein. Therefore solubility is a percentage
and ranges from 0% to 100%
Sub-cellular localization special characters and formalisms
To denote multiple localization possibilities that have been experimentally established we introduced the comma "," formalism
whereas a slash "/" denotes two or more possible sub-cellular locations that have not yet been experimentally determined.
To denote subcellular location of protein complexes that span both cellular membranes we introdiced the "&" formalism
thus corresponding complexes (e.g copper / silver efflux transport system)
are annotated as "B&H".
Exportome and subclasses (Secretome, non-classical secretion etc.)
We define as exportome those proteins that are localized within the inner membrane and beyond (e.g. lipoproteins, extra-cellular proteins).
This includes the STEPdb sub-cellular classes: B, I, E, F2, F3, G, X, F4.
We divide exportome into two subclasses membranome and secretome. The membranome contains proteins that are embedded in the inner mebrane
whereas secretome referes to proteins that are fully translocated across the inner membrane.
Membranome and secretome can further divided into proteins that are substrates of the Sec and Tat secretion pathways. Finally, within secretome is included a
particular class of non-classical secretory proteins which are secreted without the presence of an apparent signal peptide motif.
4. Topology and orientation of IM proteins
Transmembrane regions of integral membrane proteins were predicted using Phobius.
In cases where Phobius failed to identify any transmembrane region the prediction of
TMHMM was used instead. The predicted orientation of the polypeptide sequences,
which equals to the location of the C-terminus (cytoplasmic or periplasmic) was reconsidered based on experimental
verification of C-terminus for 734 transmembrane proteins (Daley et al, 2005).
5. Internal Connections
There are internal connections between some of the tables. Proteins listed in the E.coli K-12 proteome
table connect to the list of the complexes they participate. Further more from the Complexome table the user is able
to view the schematically representation of each complex and through this schematic to link directly to the K-12 table.
6. Color Code
STEPdb follows a specific color code to represent proteins in the various sub-cellular locations.
The E.coli K-12 export systems and the Peripherome are draw as cartoons where each
protein is represented as a filled circle following the color code below. Additionally in the "Complexome" page,
each complex is drawn dynamically upon clicking "draw" button. The protein subunits of each complex also follow STEPdb's color code.
 |
Nucleoid (N) |
 |
Cytoplasmic (A) |
 |
Ribosomal (r) |
 |
Prepherally associated with the plasma membrane facing the cytoplasm (F1) |
 |
Inner membrane protein (B) |
 |
Prepherally associated with the plasma membrane facing the periplasm (F2) |
 |
Inner membrane lipoprotein (E) |
 |
Periplasmic (G) |
 |
Outer membrane lipoprotein (I) |
 |
Prepherally associated with the outer membrane facing the periplasm (F3) |
 |
Outer membrane protein b-barrel protein (H) |
 |
Prepherally associated with the outer membrane facing the extra-cellular space (F4) |
 |
Other |
7. Multifun Terms
Peripheral inner membrane proteins were classified in eight major categories of cellular function mainly based on Multifun Terms
(Serres & Riley, 2000). These are summarized in the table below.
| Metabolism |
MultiFun:1 Metabolism |
GO:metabolism |
GO:0008152 |
| DNA-related |
MultiFun:2.1 DNA related |
GO:DNA metabolism |
GO:0006259 |
| RNA-related |
MultiFun:2.2 RNA related |
GO:RNA metabolism |
GO:0016070 |
| Protein-related |
MultiFun:2.3 Protein related |
GO:protein biosynthesis |
GO:0006412 |
| Transport |
MultiFun:4 Transport |
GO:transport |
GO:0006810 |
| Cell division |
MultiFun:5.1 Cell division |
GO:cytokinesis |
GO:0000910 |
| Response to stress |
MultiFun:5.5 Adaptation to stress |
GO:response to stress |
GO:0006950 |
| Cell structure |
MultiFun:6 Cell structure |
GO:cellular_component |
GO:0005575 |
8. MatureP classifier
Methods
MatureP classifier predicts Sec secretory proteins over cytoplasmic ones.
Two methods are provided: 1. MatureP classifier that accepts only the mature sequences of potential secretory or
cytoplasmic proteins (i.e. known or potential signal peptide sequences must be removed)
2. SP-MatureP a combinatorial classifier that takes into account both the MatureP and the pre-protein classifiers.
In SP-MatureP method first the pre-protein classifer predicts the existance of a signal peptide sequence and then the
MatureP classifier tests the validity of the mature sequence. SP-MatureP decides whether a sequence is “cytoplasmic”,
a mature or a secretory pre-protein sequence or, more interestingly, if a sequence is a “non-secretory” (i.e. possessing
a signal peptide but having a non-compatible mature sequence).
MatureP score threshold
MatureP is a linear classifier that explores a variety of features derived from the amino acid sequence such as: amino acids,
di-peptides and tri-peptides or pairwise interaction energy. MatureP assigns a classification score to each provided
sequence. The final decision of the classifier depends on the selected score threshold above which proteins are considered
to be secretory. The most commonly used threshold is zero and following this positively scored sequences are predicted as
secretory. Score threshold can be chosen otherwise. Using the scores of the training samples we can draw the hit rate curves (y-axis)
versus score (x-axis) for both the positive and the negative classes (press button below to draw the hit rate distributions of MatureP). That is the percent of correctly predicted positive/negative
samples per selected score as a classification threshold. When the hit rate of the positive class is increased then the hit rate of
the negative class is decreased. For every classifier there exist a score threshold where the two hit rates are equal.
Datasets
Escherichia coli
505 Sec-dependent secretory and 2365 cytoplasmic sequences of the Escherichia coli K-12 proteome (STEPdb) were used
during the machine learning analysis. The class of secretory proteins includes eight sub-cellular categories of STEPdb (see table below).
Only proteins that utilize the Sec secretion system for their translocation from the cytoplasm to the periplasm were included.
39 proteins with a Tat signal peptide or the flagellar Type III were excluded.
The cleavage site of the type I signal peptides (e.g. periplasmic proteins excluding lipoproteins) were predicted using
SignalP 4.0 and Phobius.
The cleavage site of the type II signal peptides (i.e. inner and outer membrane lipoproteins) was predicted with
LipoP.
|
Peripheral inner membrane protein facing the periplasm
|
F2
|
10
|
|
Inner Membrane Lipoprotein
|
E
|
21
|
|
Periplasmic
|
G
|
295
|
|
Peripheral outer membrane protein facing the periplasm
|
F3
|
8
|
|
Outer Membrane Lipoprotein
|
I
|
94
|
|
Outer Membrane b-barrel protein
|
H
|
64
|
|
Peripheral outer membrane protein facing the extra-cellular space
|
F4
|
12
|
|
Extra-cellular
|
X
|
1
|
|
Total
|
|
505
|
|
Cytoplasmic
|
A
|
1851
|
|
Peripheral proteins
|
F1
|
514
|
|
Total
|
|
2365
|
Other Gram-negative and Gram-positive bacteria
To test if the features that MatureP selects are universal we measured its effectiveness in predicting secretory proteins
from 25 Gram- and 10 Gram+ bacteria from various phyla (7120 and 1361 secretory proteins, see table below).
These were identified as being Sec secretory proteins by combining SignalP 4.0,
LipoP and PRED-TAT.
| 1 |
Salmonella bongori N268-08 |
- |
1197719 |
4751 |
376 |
265 |
111 |
| 2 |
Yersinia pestis bv. Antiqua (strain Antiqua) |
- |
360102 |
4136 |
331 |
246 |
85 |
| 3 |
Citrobacter freundii UCI 31 |
- |
1400136 |
4932 |
472 |
351 |
121 |
| 4 |
Klebsiella pneumoniae (strain 342) |
- |
507522 |
5739 |
518 |
383 |
135 |
| 5 |
Pseudomonas fluorescens |
- |
294 |
7426 |
783 |
565 |
218 |
| 6 |
Acinetobacter baumannii (strain ACICU) |
- |
405416 |
3746 |
359 |
219 |
140 |
| 7 |
Coxiella burnetii (strain RSA 331 / Henzerling II) |
- |
360115 |
1892 |
90 |
55 |
35 |
| 8 |
Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 / ATCC 33152 / DSM 7513) |
- |
272624 |
2950 |
225 |
167 |
58 |
| 9 |
Haemophilus influenzae |
- |
727 |
3568 |
349 |
257 |
92 |
| 10 |
Neisseria gonorrhoeae |
- |
485 |
6015 |
561 |
389 |
172 |
| 11 |
Bordetella pertussis (strain CS) |
- |
1017264 |
3275 |
286 |
223 |
63 |
| 12 |
Ralstonia pickettii (strain 12J) |
- |
402626 |
4891 |
490 |
374 |
116 |
| 13 |
Bartonella quintana JK 19 |
- |
1134507 |
1308 |
27 |
0 |
27 |
| 14 |
Brucella melitensis biotype 2 (strain ATCC 23457) |
- |
546272 |
3125 |
216 |
173 |
43 |
| 15 |
Candidatus Liberibacter asiaticus str. Ishi-1 |
- |
931202 |
1068 |
31 |
10 |
21 |
| 16 |
Helicobacter pylori (strain HPAG1) |
- |
357544 |
1542 |
107 |
69 |
38 |
| 17 |
Campylobacter lari (strain RM2100 / D67 / ATCC BAA-1060) |
- |
306263 |
1545 |
96 |
58 |
38 |
| 18 |
Campylobacter coli 2548 |
- |
887315 |
1809 |
95 |
50 |
45 |
| 19 |
Chlamydia trachomatis (strain D/UW-3/Cx) |
- |
272561 |
897 |
46 |
27 |
19 |
| 20 | chlamydophila pneumoniae | - | 83558 | 4081 | 205 | 119 | 86 |
| 21 |
Bacteroides fragilis (strain YCH46) |
- |
295405 |
4598 |
856 |
385 |
471 |
| 22 |
Capnocytophaga canimorsus (strain 5) |
- |
860228 |
2395 |
324 |
126 |
198 |
| 23 |
Mycoplasma pneumoniae 309 |
- |
1112856 |
708 |
50 |
5 |
45 |
| 24 |
Streptococcus equinus (Streptococcus bovis) |
- |
1335 |
1996 |
59 |
27 |
32 |
| 25 |
Borrelia hermsii YOR |
|
|
1591 |
168 |
17 |
151 |
| Total |
7120 |
4560 |
2560 |
| 1 | Enterococcus faecalis (strain 62) | + | 936153 | 3011 | 154 | 81 | 73 |
| 2 | Streptococcus pneumoniae (strain 70585) | + | 488221 | 2179 | 74 | 31 | 43 |
| 3 | Streptococcus uberis (strain ATCC BAA-854 / 0140J) | + | 218495 | 1761 | 59 | 25 | 34 |
| 4 | Staphylococcus aureus (strain NCTC 8325) | + | 93061 | 2892 | 112 | 54 | 58 |
| 5 | Listeria ivanovii WSLC3009 | + | 1457190 | 2773 | 160 | 83 | 77 |
| 6 | Bacillus cereus (strain ATCC 10987) | + | 222523 | 5835 | 292 | 126 | 166 |
| 7 |
Clostridium tetani (strain Massachusetts / E88) |
+ |
212717 |
2416 |
117 |
47 |
70 |
| 8 | Mycobacterium tuberculosis (strain ATCC 25177 / H37Ra) | + | 419947 | 3993 | 125 | 62 | 63 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
| 9 | Mycobacterium paratuberculosis (strain ATCC BAA-968 / K-10) | + | 262316 | 4321 | 135 | 65 | 70 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
| 10 | Mycobacterium bovis | + | 1765 | 4353 | 133 | 67 | 66 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Non redundant datasets
According to Nielsen et al. the training and test sets should be non-redundant
and that similar (homologous) sequences should be discarded to avoid overestimating the predictive performance of the classifiers.
We performed redundancy reduction in the original dataset (above) following the procedures used by
SignalP using the algorithm of
Hobohm that performs iterative position specific alignments.
The blast+ suite of NCBI was utilized: makeblastdb command to convert the input fasta files into blast database files and the psiblast
command that implements the position-specific iterative basic local alignment search of Altschul et al.
This resulted in a non-redundant dataset of 1070 cytoplasmic proteins, 207 preproteins and 247 mature domain sequences.
|