STEPdb: Sub-cellular Topologies of E.coli Polypeptides.

Information about STEPdb server

  • Index:

         1. Updates
         2. Search
         3. Glossary
         4. Topology and orientation of IM proteins
         5. Internal Connections
         6. Color Code
         7. Multifun Terms
         8. MatureP classifier

  • 1. Updates:

    STEPdb has been updated: STEPdb2.0 version is available from February 2019. To see the changes in E. coli K-12 ptoeome, please consult the downloads section. For more information: Loos et al., 2019.

  • 2. Search:

    In STEP the search function is performed locally for the displayed proteome/sub-proteome. Consequently search is performed only within the current table. (e.g. when the user is browsing the Complexome he is able to search only for entries within this group of proteins.)

    Search is for the moment not very sophisticated. Logical operators such OR, AND are not currently supported by the search tool and text is searched intact. (e.g. words separated with a space are not searched separately)

  • 3. Glossary:

      Basic Proteome of E.coli K-12 (MG1655)

      STEP defines the "basic" proteome of E.coli K-12 (MG1655) as the fraction the proteome that is devoid of gene products that are likely not produced, or that result from enomic insertions enriched in defective prophages, transposons, pseudogenes, integrases and mobile elements. To define "basic" proteome we used manual annotation. We based this on EcoGene (Rudd, 2000), Uniprot (Dimmer et al, 2012) and other (McClelland et al, 2001; Ochman et al, 2000) studies. F1 plasmid-encoded proteins were also removed from the Uniprot data of the E.coli K-12 proteome used (release version of November 2010). According to our analysis the essential protein-coding sequences devoid of these elements encompasses only 3899 proteins.

      Core Proteome of E.coli K-12

      The common core proteome of all E.coli strains consists of 2073 proteins. These were derived from genome comparison with Multi-Genome Homology Comparison Tool (Davidsen et al, 2010) of 15 E.coli strains (Papanastasiou et al, submitted).

      Protein Solubility

      Protein structure is encoded in the primary amino acid sequence. However, the folding process of a protein to its functional form, in some cases, has to overcome miss-folding states that can lead to protein inclusion bodies. Niwa et al 2009 have calculated protein solubility for 3173 Escherichia coli proteins, in a chaperone-free reconstituted translation system. The aggregation propensity of each protein is examined by centrifugation assay. Solubility is defined as the index of aggregation propensity which is expressed as the proportion of the supernatant fraction, which is obtained after the centrifugation of a translation mixture, to the uncentrifuged total protein. Therefore solubility is a percentage and ranges from 0% to 100%

      Manual Curation and non-experimental qualifiers

      We follow the same experimental qualifiers with Uniprot:

        Potential: There is some logical or conclusive evidence that the given annotation could apply. This non-experimental qualifier is often used to present results from protein sequence analysis software tools, which are only annotated if the result makes sense in the biological context of a given protein.
        Probable: Indicates stronger evidence than the qualifier "Potential". This qualifier implies that there must be at least some experimental evidence, which indicates, that the information is expected to be found in the natural environment of a protein.
        By similarity: When some biological information was experimentally obtained for a given protein (or part of it), it may be transferred to other protein family members within a certain taxonomic range, dependent on the biological event or characteristic.

      Sub-cellular localization special characters and formalisms

      To denote multiple localization possibilities that have been experimentally established we introduced the comma "," formalism whereas a slash "/" denotes two or more possible sub-cellular locations that have not yet been experimentally determined.

      To denote subcellular location of protein complexes that span both cellular membranes we introdiced the "&" formalism thus corresponding complexes (e.g copper / silver efflux transport system) are annotated as "B&H".

      Exportome and subclasses (Secretome, non-classical secretion etc.)

      We define as exportome those proteins that are localized within the inner membrane and beyond (e.g. lipoproteins, extra-cellular proteins). This includes the STEPdb sub-cellular classes: B, I, E, F2, F3, G, X, F4. We divide exportome into two subclasses membranome and secretome. The membranome contains proteins that are embedded in the inner mebrane whereas secretome referes to proteins that are fully translocated across the inner membrane. Membranome and secretome can further divided into proteins that are substrates of the Sec and Tat secretion pathways. Finally, within secretome is included a particular class of non-classical secretory proteins which are secreted without the presence of an apparent signal peptide motif.

  • 4. Topology and orientation of IM proteins

    Transmembrane regions of integral membrane proteins were predicted using Phobius. In cases where Phobius failed to identify any transmembrane region the prediction of TMHMM was used instead. The predicted orientation of the polypeptide sequences, which equals to the location of the C-terminus (cytoplasmic or periplasmic) was reconsidered based on experimental verification of C-terminus for 734 transmembrane proteins (Daley et al, 2005).

  • 5. Internal Connections

    There are internal connections between some of the tables. Proteins listed in the E.coli K-12 proteome table connect to the list of the complexes they participate. Further more from the Complexome table the user is able to view the schematically representation of each complex and through this schematic to link directly to the K-12 table.

  • 6. Color Code

    STEPdb follows a specific color code to represent proteins in the various sub-cellular locations. The E.coli K-12 export systems and the Peripherome are draw as cartoons where each protein is represented as a filled circle following the color code below. Additionally in the "Complexome" page, each complex is drawn dynamically upon clicking "draw" button. The protein subunits of each complex also follow STEPdb's color code.

    Protein Symbol Protein Localization
    Nucleoid (N)
    Cytoplasmic (A)
    Ribosomal (r)
    Prepherally associated with the plasma membrane facing the cytoplasm (F1)
    Inner membrane protein (B)
    Prepherally associated with the plasma membrane facing the periplasm (F2)
    Inner membrane lipoprotein (E)
    Periplasmic (G)
    Outer membrane lipoprotein (I)
    Prepherally associated with the outer membrane facing the periplasm (F3)
    Outer membrane protein b-barrel protein (H)
    Prepherally associated with the outer membrane facing the extra-cellular space (F4)
  • 7. Multifun Terms

    Peripheral inner membrane proteins were classified in eight major categories of cellular function mainly based on Multifun Terms (Serres & Riley, 2000). These are summarized in the table below.

    Cellular Process Multifun term GO term GO id
    Metabolism MultiFun:1 Metabolism GO:metabolism GO:0008152
    DNA-related MultiFun:2.1 DNA related GO:DNA metabolism GO:0006259
    RNA-related MultiFun:2.2 RNA related GO:RNA metabolism GO:0016070
    Protein-related MultiFun:2.3 Protein related GO:protein biosynthesis GO:0006412
    Transport MultiFun:4 Transport GO:transport GO:0006810
    Cell division MultiFun:5.1 Cell division GO:cytokinesis GO:0000910
    Response to stress MultiFun:5.5 Adaptation to stress GO:response to stress GO:0006950
    Cell structure MultiFun:6 Cell structure GO:cellular_component GO:0005575
  • 8. MatureP classifier


      MatureP classifier predicts Sec secretory proteins over cytoplasmic ones. Two methods are provided: 1. MatureP classifier that accepts only the mature sequences of potential secretory or cytoplasmic proteins (i.e. known or potential signal peptide sequences must be removed) 2. SP-MatureP a combinatorial classifier that takes into account both the MatureP and the pre-protein classifiers. In SP-MatureP method first the pre-protein classifer predicts the existance of a signal peptide sequence and then the MatureP classifier tests the validity of the mature sequence. SP-MatureP decides whether a sequence is “cytoplasmic”, a mature or a secretory pre-protein sequence or, more interestingly, if a sequence is a “non-secretory” (i.e. possessing a signal peptide but having a non-compatible mature sequence).

      MatureP score threshold

      MatureP is a linear classifier that explores a variety of features derived from the amino acid sequence such as: amino acids, di-peptides and tri-peptides or pairwise interaction energy. MatureP assigns a classification score to each provided sequence. The final decision of the classifier depends on the selected score threshold above which proteins are considered to be secretory. The most commonly used threshold is zero and following this positively scored sequences are predicted as secretory. Score threshold can be chosen otherwise. Using the scores of the training samples we can draw the hit rate curves (y-axis) versus score (x-axis) for both the positive and the negative classes (press button below to draw the hit rate distributions of MatureP). That is the percent of correctly predicted positive/negative samples per selected score as a classification threshold. When the hit rate of the positive class is increased then the hit rate of the negative class is decreased. For every classifier there exist a score threshold where the two hit rates are equal.


      Escherichia coli
      505 Sec-dependent secretory and 2365 cytoplasmic sequences of the Escherichia coli K-12 proteome (STEPdb) were used during the machine learning analysis. The class of secretory proteins includes eight sub-cellular categories of STEPdb (see table below). Only proteins that utilize the Sec secretion system for their translocation from the cytoplasm to the periplasm were included. 39 proteins with a Tat signal peptide or the flagellar Type III were excluded. The cleavage site of the type I signal peptides (e.g. periplasmic proteins excluding lipoproteins) were predicted using SignalP 4.0 and Phobius. The cleavage site of the type II signal peptides (i.e. inner and outer membrane lipoproteins) was predicted with LipoP.

      Sub-cellular Location
      Stepdb nomenlature
      # Proteins
      Sec Secretory proteins
      Peripheral inner membrane protein facing the periplasm F2 10
      Inner Membrane Lipoprotein E 21
      Periplasmic G 295
      Peripheral outer membrane protein facing the periplasm F3 8
      Outer Membrane Lipoprotein I 94
      Outer Membrane b-barrel protein H 64
      Peripheral outer membrane protein facing the extra-cellular space F4 12
      Extra-cellular X 1
      Total 505
      Cytoplasmic A 1851
      Peripheral proteins F1 514
      Total 2365

      Other Gram-negative and Gram-positive bacteria

      To test if the features that MatureP selects are universal we measured its effectiveness in predicting secretory proteins from 25 Gram- and 10 Gram+ bacteria from various phyla (7120 and 1361 secretory proteins, see table below). These were identified as being Sec secretory proteins by combining SignalP 4.0, LipoP and PRED-TAT.

      # Strain (Uniprot) Gram class Organism ID (Uniprot) # Proteins # Secretory # Proteins with Type I signal peptides # proteins with Type II signal peptides
      1 Salmonella bongori N268-08 - 1197719 4751 376 265 111
      2 Yersinia pestis bv. Antiqua (strain Antiqua) - 360102 4136 331 246 85
      3 Citrobacter freundii UCI 31 - 1400136 4932 472 351 121
      4 Klebsiella pneumoniae (strain 342) - 507522 5739 518 383 135
      5 Pseudomonas fluorescens - 294 7426 783 565 218
      6 Acinetobacter baumannii (strain ACICU) - 405416 3746 359 219 140
      7 Coxiella burnetii (strain RSA 331 / Henzerling II) - 360115 1892 90 55 35
      8 Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 / ATCC 33152 / DSM 7513) - 272624 2950 225 167 58
      9 Haemophilus influenzae - 727 3568 349 257 92
      10 Neisseria gonorrhoeae - 485 6015 561 389 172
      11 Bordetella pertussis (strain CS) - 1017264 3275 286 223 63
      12 Ralstonia pickettii (strain 12J) - 402626 4891 490 374 116
      13 Bartonella quintana JK 19 - 1134507 1308 27 0 27
      14 Brucella melitensis biotype 2 (strain ATCC 23457) - 546272 3125 216 173 43
      15 Candidatus Liberibacter asiaticus str. Ishi-1 - 931202 1068 31 10 21
      16 Helicobacter pylori (strain HPAG1) - 357544 1542 107 69 38
      17 Campylobacter lari (strain RM2100 / D67 / ATCC BAA-1060) - 306263 1545 96 58 38
      18 Campylobacter coli 2548 - 887315 1809 95 50 45
      19 Chlamydia trachomatis (strain D/UW-3/Cx) - 272561 897 46 27 19
      20chlamydophila pneumoniae-83558408120511986
      21 Bacteroides fragilis (strain YCH46) - 295405 4598 856 385 471
      22 Capnocytophaga canimorsus (strain 5) - 860228 2395 324 126 198
      23 Mycoplasma pneumoniae 309 - 1112856 708 50 5 45
      24 Streptococcus equinus (Streptococcus bovis) - 1335 1996 59 27 32
      25 Borrelia hermsii YOR 1591 168 17 151
      Total 7120 4560 2560

      # Strain (Uniprot) Gram class Organism ID (Uniprot) # Proteins # Secretory # Proteins with Type I signal peptides # proteins with Type II signal peptides
      1Enterococcus faecalis (strain 62)+93615330111548173
      2Streptococcus pneumoniae (strain 70585)+4882212179743143
      3Streptococcus uberis (strain ATCC BAA-854 / 0140J)+2184951761592534
      4Staphylococcus aureus (strain NCTC 8325)+9306128921125458
      5Listeria ivanovii WSLC3009+145719027731608377
      6Bacillus cereus (strain ATCC 10987)+2225235835292126166
      7 Clostridium tetani (strain Massachusetts / E88) + 212717 2416 117 47 70
      8Mycobacterium tuberculosis (strain ATCC 25177 / H37Ra)+41994739931256263
      9Mycobacterium paratuberculosis (strain ATCC BAA-968 / K-10)+26231643211356570
      10Mycobacterium bovis+176543531336766
      Total 1361 641 720

      Non redundant datasets

      According to Nielsen et al. the training and test sets should be non-redundant and that similar (homologous) sequences should be discarded to avoid overestimating the predictive performance of the classifiers. We performed redundancy reduction in the original dataset (above) following the procedures used by SignalP using the algorithm of Hobohm that performs iterative position specific alignments. The blast+ suite of NCBI was utilized: makeblastdb command to convert the input fasta files into blast database files and the psiblast command that implements the position-specific iterative basic local alignment search of Altschul et al. This resulted in a non-redundant dataset of 1070 cytoplasmic proteins, 207 preproteins and 247 mature domain sequences.


 ©2017 Copyright KU Leuven and FORTH/ICE-HT. Last Update: February 2019.