Redox Proteomics of Oxidative Stress-Induced Cell Death
Many drug molecules induce cell death through oxidative stress. We have carried out metabolomic and redox proteomic analysis to explore the action mechanism of oxidative stress induced cell death. We have found that oxidative stress induces cell-specific apoptosis or necrosis. Using a combination of the double labeling, LC-MS/MS, and bioinformatic analysis, we identified more than 300 redox sensitive proteins. The frequently occurring amino acid residue immediately before or after the reactive cysteine in these proteins is the non-polar and neutral leucine, valine, or alanine. The thiol/disulfide ratios are dependent on the level of cellular GSH that are regulated by reactive oxygen species. Our data provide a valuable resource for deciphering the redox-regulation of proteins and for understanding oxidative stress-induced apoptosis.
Informatics Development for Mammalian Protein Turnover Studies
Protein synthesis and degradation, or protein turnover, as an essential part of living metabolism is highly regulated or controlled. The regulation of the rates of synthesis and degradation of any given protein depends on its sub-cellular functional requirements, localization, temporal participation in cellular machines and their trafficking. Gaining a global knowledge of protein dynamics on a proteome wide scale would provide a systematic basis for understanding healthy development and well-being of organisms as well as the potential causes and/or progression of diseases.
We have utilized organism-wide 15N isotopic labeling to measure the turnover rate of proteins in liver, blood and brain tissues of mice1. Utilizing high resolution LCMSMS, the 15N incorporation curves vs incorporation time points were accurately measured for tens of thousands of tryptic peptides. The measured turnover rates of over a thousand unique proteins from liver, blood, and brain tissues of mice correlate well with known biological properties of proteins in the tissues. In order to efficiently extract protein turnover data from our 15N metabolic labeling experiment, a sophisticated data processing pipeline has been developed to automatically process a large number of LCMSMS data files. The software pipeline includes established software tools such as a mass spectrometry database search engine together with several additional, novel data processing modules specifically developed for 15N metabolic labeling.
At this time, all proteomic scale studies on protein turnover rely on a simple single-exponential mechanism in stable isotope label incorporation. Such a simplistic model for protein turnover in mammalian tissues has several deficiencies including (1) the functional forms of single exponential functions derived from the simplistic incorporation mechanisms do not fit the experimental data well despite high quality isotope incorporation data are available and (2) protein turnover rate constants obtained by use of the simple model are also contaminated with other dynamic processes such as turnover of free amino acids. To elevate such deficiencies, we have established a formal connection between data obtained from elemental isotope labeling experiments and the well-known compartment modeling, and demonstrate that an appropriate application of a compartment model to turnover of proteins from mammalian tissues can indeed lead to a better fitting of the experimental data. In addition to individual protein turnover rate constants, some important dynamics information, such as values of free amino acid turnover rate constants can be simultaneously extracted from the experimental isotope incorporation data.
Protein Inference and Protein Quantification: Two Sides of the Same Coin
Motivation: In mass spectrometry-based shotgun proteomics, protein quantification and protein identification are two major computational problems. To quantify the protein abundance, a list of proteins must be firstly inferred from the sample. Then the relative or absolute protein abundance is estimated with quantification methods, such as spectral counting. Until now, researchers have been dealing with these two processes separately. In fact, they are two sides of same coin in the sense that truly present proteins are those proteins with non-zero abundances. Then, one interesting question is if we regard the protein inference problem as a special protein quantification problem, is it possible to achieve better protein inference performance?
Contribution: In this paper, we investigate the feasibility of using protein quantification methods to solve the protein inference problem. Protein inference is to determine whether each candidate protein is present in the sample or not. Protein quantification is to calculate the abundance of each protein. Naturally, the absent proteins should have zero abundances. Thus, we argue that the protein inference problem can be viewed as a special case of protein quantification problem: present proteins are those proteins with non-zero abundances. Based on this idea, our paper tries to use three very simple protein quantification methods to solve the protein inference problem effectively.
Results: The experimental results on six datasets show that these three methods are competitive with previous protein inference algorithms. This demonstrates that it is plausible to take the protein inference problem as a special case of protein quantification, which opens the door of devising more effective protein inference algorithms from a quantification perspective.
Chemical Proteomics-based Approach for Drug Target Discovery in Living Systems
High throughput drug discovery methods typically focus on protein targets which are screened in vitro against existing compounds for high specificity and affinity. This strategy, however, could result in unexpected or undetected off-target effects, leading to high abrasion rates in the later stages of drug development. Ideally, unbiased identification of proteins and associated complexes that bind to a drug or drug candidate under the physiological conditions would provide direct evaluation and therefore would be more appealing, allowing for valuable insight into target cellular functions. Typically bead-based affinity chromatography can only capture potential protein targets in vitro but not in living systems. Therefore, it is highly desirable to establish a general in situ approach to probe intracellular protein targets. Here we will present a soluble nanopolymer-based approach to probe the drug targets in vitro and in living cells. The new strategy highlights chemical and technological approaches that seek to increase the quality of information obtained from high-throughput experiments. We anticipate broad applications of this novel strategy in many important biological systems.
Predicting Seasonal Influenza Vaccine Strains
Seasonal influenza prevention and control relies largely on the availability of effective vaccines. However, timely and accurate recommendation of vaccine strains is quite challenging as evidenced by frequent antigenic mismatches between the recommended vaccine strains and circulating strains. Recently, large-scale sequencing of influenza viruses has become a routine work in influenza surveillance. Therefore, development of sequence-based computational approaches to modeling the antigenic evolution of the influenza virus holds great promise for more effective vaccine strain recommendation. We have developed several approaches that can effectively model influenza antigenic patterns from viral sequences. We have also demonstrated that the coupling of the computational modeling with large-scale HA sequencing in influenza surveillance can lead to development of an effective vaccine recommendation strategy.
CAPER: a Chromosome-Assembled human Proteome browsER
Proteomic datasets are precious resources for the complete annotation of human genome. High-throughput mass spectrometry experiments have generated large amount of proteomic datasets. Visualization of these proteomic databased on chromosomes can help us effectively integrate, organize and utilize them. Here we develop a web-based, user-friendly Chromosome-Assembled human Proteome browsER(CAPER) to visualize the human proteomic datasets together with related annotation information. To display proteomic datasets and related annotations as comprehensively as possible, we use two visualization strategies: track view mainly used to show the sequence and site information and the corresponding relationship between proteome, transcriptome, genome and chromosome, and heatmap view mainly used to show qualitative and quantitative functional annotation information. CAPER supports data browse at multiple scales by Google Map-like smooth navigation, zooming, positioning with chromosomes as the reference coordinate, and track-view and heatmap-view can mutually switch, providing high-quality interactive experience. CAPRE will contribute to the complete annotation and functional interpretation of human genome by proteomic approaches, benefiting the Chromosome-Centric Human Proteome Project and even the Human Proteome Project and human physiology and pathology researches. CAPER can be visited by http://www.bprc.ac.cn/CAPE.
CanProVar 2.0: Updated Database of Human Cancer Proteome Variation
Identification and annotation of mutated genes or proteins involved in oncogenesis and tumor progression are crucial for both cancer biology and clinical applications. A human Cancer Proteome Variation Database (CanProVar) has been developed in our previous study. As the international collaborative projects of variation scanning in cancer genome, such as TCGA, are releasing great amount of data rapidly, we updated and improved the human cancer proteome variation database in CanProVar 2.0, which is designed to store and display single amino acid alterations in the human cancers by integrating information on protein sequence variations from various public resources. In the new version of CanProVar, a total of 70,298 cancer-related variations (crVARs) in 23,816 proteins from a variety of data sources and 825,106 non-cancer specific variations (ncsVARs) in 75,137 proteins derived from the dbSNP database were collected in the database. Besides showing the residual alterations in protein sequence, cancer sample, protein differential expression, site conservation, protein-protein interactions and other functional information were also provided. User may search any reported crVARs and ncsVARs by protein/gene ID or name. Moreover, the query can be accessed by cancer type, chromosome location, pathways in CanProVar 2.0.
With the CanProVar database, the mutated peptides in cancer samples could be identified by mass spectrometry-based proteomics. A searchable protein database containing crVARs and ncsVARs was also available for download.
The Transition from Discovery to Hypothesis-Driven Science: The Role of Spectral Libraries in Proteomics
In the past decade, research in proteomics technology has been dominated by the shotgun approach using the LC-MS/MS platform. Breathtaking progress has been made in the high-throughput identification of proteins in complex samples; nowadays one can routinely identify hundreds of proteins down to femtomolar concentrations in a single hour-long MS run. This type of discovery-oriented experiments is reminiscient of genome sequencing, in which the goal is to create a complete parts' list of a living system. Parallel to the historical development in genomics, now that proteomes of interest have been mapped out, a transition from a discovery phase to a hypothesis-driven phase has gradually taken hold in proteomics in the past few years. There is an increasing emphasis on MS-based, high-throughput quantitative assays of proteins which can be utilized to answer biological questions. The main informatic challenge of this transition involves collecting, cataloguing, combining and condensing the huge amount of shotgun proteomics data gathered all over the world in all shapes and forms, making them easy to retrieve and query for meaningful reuse. This large collection of data can be captured in spectral libraries, which play a central enabling role in reducing the complexity of proteomics data analysis by spectral matching, and the development of assays based on selected reaction monitoring (SRM). This talk will first discuss the technical details of compiling and searching proteomics spectral libraries, then shift gears to discuss their potential applications in biological and medical research.
Beyond Database Search: Identifying Mutations, PTMs, and Novel Peptides
The traditional database search tools, typified by Mascot, have been the primary method for peptide identification with tandem mass spectrometry, and have made dramatic contribution to the emerging of proteomics as a scientific field. Recent developments in peptide identification focused on two aspects: (1) To make the database search more sensitive and more accurate; and (2) To identify the peptides that are not included in the database. These peptides include peptides with mutations, PTMs, as well as novel peptides. In this talk, some of the latest developments in these two directions are reviewed. A workflow for identifying both the database peptides and non-database peptides from the same mass spectrometry dataset is introduced.
Nematode Sperm Secrete Serpin to Coordinate both Sperm Motility and Sperm Competition
Sperm competition, which was initially recognized in 1970's, is a topic of broad interest and intense activity and has been widely recognized as one of most potent driving forces in the evolution of physiological, morphological, and behavioral adaptations. However, attention on the mechanism of sperm competition has been primarily drawn to the studies of some physical traits of sperm, such as the size, quality, morphology, speed, and the number of sperm inseminated and of seminal fluid produced by several accessory glands in the male body. Completion of secretory events in sperm and subsequent release of sperm components are required for reproductive success in most animal species, however, whether these released components are involved in sperm competition remains unknown. Here we show that nematode Ascaris suum sperm can actively participate in sperm competition by secreting a protease inhibitor to the seminal fluid. Specifically, by using de novo sequencing, we identified a serine protease inhibitor (As_SRP-1, a member of the Serpin superfamily) that is secreted by spermatids during sperm activation, and As_SRP-1 has two well-coordinated functions. On the one hand, As_SRP-1 functions in cis to support Major Sperm Protein (MSP)-based cytoskeletal assembly in the spermatid that releases it, thereby facilitating sperm motility and enhancing the competitiveness of the resulting spermatozoon. On the other hand, As_SRP-1 released from an activated sperm inhibits, in trans, the activation of surrounding spermatids by blocking vas deferens secreted As_TRY-5 (also identified by de novo sequencing), a trypsin-like serine protease necessary for spermiogenesis, by suicide substrate mechanism. Our data provide the first example that besides components secreted from accessory glands in male body, sperm also contribute a protein (As_SRP-1) to the seminal fluid and this protein can alter the immediate environment around a sperm to enhance its competitiveness and outcompete the rivals. These findings suggest that sperm play an active role in sperm competition and that spermiogenesis and sperm competition are mechanistically connected in nematodes. Given the fact that the exocytosis of essential components in sperm vesicle(s) is necessary to create fertilization-competent sperm in many animals, sperm-secreted components might mediate sperm competition more widely among animals than was previously appreciated.
Comparative Analysis of Different Label-Free Mass Spectrometry Based Protein Abundance Estimates and Their Correlation with RNA-Seq Gene Expression Data
An increasing number of studies involve integrative analysis of gene and protein expression data taking advantage of new technologies such as next-generation transcriptome sequencing (RNA-Seq) and highly sensitive mass spectrometry (MS) instrumentation. Thus, it becomes interesting to revisit the correlative analysis of gene and protein expression data using more recently generated data sets. Furthermore, within the proteomics community there is a substantial interest in comparing the performance of different label-free quantitative proteomic strategies. Gene expression data can be used as an indirect benchmark for such protein-level comparisons. In this work we use publicly available mouse data to perform a joint analysis of genomic and proteomic data obtained on the same organism. First, we perform a comparative analysis of different label-free protein quantification methods (intensity based and spectral count based and using various associated data normalization steps) using several software tools on the proteomic side. Similarly, we perform correlative analysis of gene expression data derived using microarray and RNA-Seq methods on the genomic side. We also investigate the correlation between gene and protein expression data, and various factors affecting the accuracy of quantitation at both levels. It is observed that spectral count based protein abundance metrics, which are easy to extract from any published data, are comparable to intensity based measures with respect to correlation with gene expression data. The results of this work should be useful for designing robust computational pipelines for extraction and joint analysis of gene and protein expression data in the context of integrative studies.
Identification of three new protein post-translational lysine modifications
Protein post-translational modifications (PTMs) at the lysine residue, such as lysine methylation, acetylation, and ubiquitination, are diverse, abundant, and dynamic. They play a key role in the regulation of diverse cellular physiology. By employing mass spectrometry-based proteomics technologies, we identifiedand verified three new forms of protein lysine modification: lysine succinylation (Ksucc), lysine malonylation (Kmal) and lysine crotonylation (Kcr). The peptide candidates bearing these new lysine modifications were initially identified by mass spectrometry and protein sequence alignment analyses.These new PTM candidates were then comprehensively validated by MS/MS and HPLC co-elution of their synthetic counterparts, Western blot analysis, and in vivo isotopic labeling. We further show that Ksucc, Kmal and Kcr are evolutionarily-conserved PTM and respond to different physiological conditions.In addition, we demonstrate that Sirt5, a member of the class III lysine deacetylases, can catalyze lysine demalonylation and lysine desuccinylation reactions both in vitro and in vivo. This result suggests the possibility of nondeacetylation activity of other class III lysine deacetylases, especially those without obvious acetylation protein substrates. We further reveal thatKsucc, Kmal and Kcr as novel post-translational modifications on histones.The unique structure and genomic localization of Kcr suggest that it is mechanistically and functionally different from histone lysine acetylation. Specifically, in both human somatic and mouse male germ cell genomes, histone Kcr marks either active promoters or potential enhancers. In male germinal cells immediately following meiosis, Kcr is enriched on sex chromosomes and specifically marks testis-specific genes, including a significant proportion of X-linked genes that escape sex chromosome inactivation in haploid cells. Taken together, our results therefore suggest that lysine succinylation, malonylation and crotonylation are likely to play important roles in cellular functions.
Eat Raw & Fresh: Introducing isotopic Mass-to-charge ratio and Envelop Fingerprinting (iMEF) and ProteinGoggle for Protein Database Search
A novel biomolecule database search algorithm, isotopic Mass-to-charge ratio (m/z) and Envelop Fingerprinting (iMEF), together with a new set of related metrics has been created and implemented in the search engine ProteinGoggle for protein database search. iMEF is the combination of isotopic Mass-to-charge ratio (m/z) Fingerprinting (iMF) and isotopic Envelop Fingerprinting (iEF). "isotopic Mass-to-charge ratio (m/z)" here is specifically the m/z of the most abundant isotopic peak within the isotopic envelop of a parent or a fragment ion, and iMF is used to "fish" parent ion or fragment ion candidates out of the database. "isotopic Envelop" consists m/z and relative abundance of all isotopic peaks of a parent ion or a fragment ion, and iEF is used to confirm the identity of the parent or fragment ion. Because it is isotopic envelops NOT masses (monoisotopic or average) of both parent and fragment ions that are directly measured in mass spectrometry, thus the "deisotoping" step as adapted in the other search engines is bypassed. The working principles of iMEF and ProteinGoggle for protein database search and the definition of the related metrics are illustrated with the dissociation of ubiquitin done in an Orbitrap XL mass spectrometer in this talk.
A systemic proteomic analysis of maize seedling during the process of de-etiolation
Abstract: To better understand light regulation of C4 plant maize development, we investigated dynamic proteomic differences between etiolated seedlings and etiolated seedlings illuminated for 6 h, 12 h, and 24 h respectively. The proteins extracted from these samples were analyzed by using a quantitative proteomic approach based on Isobaric tags for relative and absolute quantitation (iTRAQ). Among more than 4703 proteins identified, 1712 were significantly altered during etiolated maize seedling greening. Of these 1351 proteins, 108 were identified as membrane proteins that seldom had been identified with proteomic method, indicating the power of our method for membrane protein identification; These significantly changed proteins respectively involved in photosynthesis, metabolism, nitrogen assimilation, and components of chloroplastic translational machinery. These results thus define the differential regulation of distinct biological systems during greening in maize and demonstrate the usefulness of comprehensive and comparative proteome analysis for the characterization of biological processes in plant cells, particularly in C4 plant.
Proteomic identification of key regulators of heterotrophy inSynechocystis sp. PCC6803
The photosynthetic model organism cyanobacterium Synechocystissp. PCC6803 (Hereafter Synechocystis) contains aplenty of functionally important thylakoid membrane and thus is ideal for membrane proteomics study. Synechocystis can grow photoautotrophically using CO2 as the carbon source, or heterotrophically using exogenous glucose as the sole carbon source. We recently generated a heterotrophy-defective mutant in which the gene encoding a hypothetical peripheral membrane protein Slr0110 was knocked out by insertional mutation. The slr0110-deficient mutant cultured with glucose had a significant lower growth rate compared with wild type (WT) in the same growth condition. However, the mutant and WT did not show obvious difference of growth phenotypein the autotrophic growth condition, suggesting that the photosynthetic machinery is not impaired in the mutant. The mutant and WT cells grown with or without glucose under normal light intensity were fractionated into membrane and soluble fractions, from which 1781 and 1530 proteins were identified, respectively, including 1160 overlapping proteins in both fractions.TMHMM analysis revealed that 448 proteins contain at least one transmembrane (TM) domain, which is more than 50% of the total predicted TM proteins encoded by the Synechocystis genome and 2 times more than the total number of TM proteins identified for this organism in the last decade. Bioinformatic analysis revealed that many metabolism related pathways but not photosynthesis were down regulated in WT cells grown without glucoseas well as in slr0110-deficient cells grown with or without glucose compared with WT cells grown with glucose, including amino acid biosynthesis and cofactor biosynthesis. Importantly, the proteomics and western-blot data both revealed that the 12-TM domain-containing protein glucose transporter(GlcP), a protein that plays a key role in glucose utilization in Synechocystis, was significantly down regulated in the mutant. The GlcP-deficient mutant displayed the same growth phenotype as slr0110-defieicent mutant. Though the mechanism is not clear, these data strongly suggested that Slr0110 control heterotrophic growth of Synechocystis through direct or indirect regulation of GlcPexpression.
Improved proteomic analysis pipeline for LC-ETD-MS/MS
Electron transfer dissociation (ETD) is a useful and complementary activation method for peptide fragmentation in mass spectrometry. However, ETD spectra typically receive a relatively low score in the identifications of 2+ ions. To overcome this challenge, we, for the first time, systematically interrogated the benefits of combining ion charge enhancing methods (dimethylation, guanidination, m-nitrobenzyl alcohol (m-NBA) or Lys-C digestion) and differential search algorithms (Mascot, Sequest, OMSSA, pFind and X!Tandem). A simple sample (BSA) and a complex sample (AMJ2 cell lysate) were selected in benchmark tests. Clearly distinct outcomes were observed through different experimental protocol. In the analysis of AMJ2 cell lines, X!Tandem and pFind revealed 92.65% of identified spectra; m-NBA adduction led to a 5–10% increase in average charge state and the most significant increase in the number of successful identifications, and Lys-C treatment generated peptides carrying mostly triple charges. Based on the complementary identification results, we suggest that a combination of m-NBA and Lys-C strategies accompanied by X!Tandem and pFind can greatly improve ETD identification.
Computational analysis of phosphoproteomic data
Phosphorylation is one of the most essential post-translational modifications of proteins, regulates a variety of cellular signaling pathways, and at least partially determines the biological diversity. Recent progresses in phosphoproteomics have identified more than 100,000 phosphorylation sites. However, how to extract useful information from flood of data is still a great challenge. During the past seven years, we have taken great efforts in computational analysis of the phosphoproteomic data. We developed a GPS (Group-based Prediction System, http://gps.biocuckoo.org, MCP, 2008, 7:1598-608) algorithm, which can predict kinase-specific phosphorylation sites for 408 human protein kinases (PKs) in hierarchy. Together with this sequence-based algorithm, we further adopted protein-protein interaction information as a major contextual filter to reduce false-positive hits. With this strategy, we developed iGPS (http://igps.biocuckoo.org) to predict 188,288 site-specific kinase-substrate relations (ssKSRs) between 9,247 targets and 1,079 PKs for 44,290 phosphorylation sites from the phosphoproteomic data, whereas the protein phosphorylation networks (PPNs) were modeled in five eukaryotic organisms. Based on the results, we observed that the eukaryotic phospho-regulation is poorly conserved at the site and substrate levels, but preferentially conserved at the pathway levels, e.g., ribosome organization. By analyzing DNA damage response (DDR)-associated PPN, our results suggested that repair processes but not apoptosis are activated immediately after DNA damage. Furthermore, an apoptosis/anti-apoptosis balance in the human liver PPN is demonstrated. We also predicted Polo-like kinases (Plks) phospho-binding proteins from the phosphoproteomic data by developing the GPS-Polo (http://polo.biocuckoo.org/). Taken together, our efforts provide a powerful toolbox for analyzing the phosphoproteomic data.
An Integrated High-throughput Workflow for Identification of Cross-linked Peptides from Complex Samples
Chemical cross-linking of proteins coupled with mass spectrometry analysis (CXMS) can provide valuable information about protein folding and protein-protein interactions. It can help determine the overall architecture of a large protein complex by identifying direct binding partners within the complex and localizing the binding interface. Protein samples need not be highly purified for cross-linking, let alone crystallized, as is required in crystallography, thus CXMS has a potential to be used as a routine method in every biology lab.
We developed a CXMS workflow featuring the use of a commercially available and time-tested cross-linker BS3. We optimized all the components of CXMS including sample preparation, HPLC, and mass spectrometry (MS) methods. A software program called pLink was designed specifically for CXMS. In order to extract fragmentation features of cross-linked peptides to build into the program, we generated a large standard dataset containing 2077 non-redundant cross-link spectra from synthetic peptides of known sequences. pLink can search a large dataset against a large database containing thousands or tens of thousands of proteins (e.g., E. coli and C. elegans protein databases) for cross-link identifications, and in the mean time provide reliable FDR estimation. Automated annotation and display of cross-link spectra are also realized through pLink itself and another program called pLabel, which allows labeling of a spectrum by users.
We demonstrate that our integrated CXMS workflow is a powerful tool for cross-link identification, and it is a substantial and timely contribution to the fast growing proteomics field and to biomedical research in general. The CXMS results provide valuable information about protein folding, protein complex assembly or disassembly, and can be used to identify direct binding partners of a protein of interest.
Detailed Conformational Dynamics of Juxtamembrane Region and Activation Loop in c-Kit Kinase Activating Process
The stem-cell factor receptor (c-Kit) plays critical roles in initiating cell growth and proliferation. Its kinase functional abnormity has been thought to associate with some human cancers. Regulation of the kinase activity is achieved by the phosphorylation of residues Tyr568 and Tyr570 in juxtamembrane region (JMR) and subsequent conformational changes of JMR and activation loop (A-Loop) of c-Kit. However, the detailed conformational changes of JMR and A-Loop are far from clear, especially the conformational changes are coupled or not during the kinase activity transition. In this investigation, the whole conformational transition pathway is explored using a series of nanosecond conventional molecular dynamics and targeted molecular dynamics simulations [programs used: Insight II (Accelrys Inc.) and NAMD]. The dynamics simulation results show that phosphorylation of residues Tyr568 and Tyr570 in JMR weakens the interaction between the JMR and kinase domain, thus initiating the kinase activating process. During the autoinhibitory-to-activated TMD simulation, the JMR departs firstly from the autoinhibitory binding site, and then undergoes a rapid dissociation process and drift into the solvent ultimately. A large conformational change of the A-Loop and the orientational flexibility of helix αC in the N-lobe of the kinase domain take place after the dissociation of JMR, which indicates that the conformational transition of A-loop is not coupled to the changes of JMR. Our results might be helpful for the rational drug discovery targeting c-Kit and other related kinases.
New data processing methods facilitate the identification of PTM peptides
Traditional database searching strategy always uses the whole proteome database, which is very time-consuming for phosphopeptidome search due to the huge searching space resulted from the high redundancy of the database and the setting of dynamic modifications during searching. The big searching space will also lead to the increase of false positive identifications and decrease of the identification sensitivity. To tackle this challenge, a focused database searching strategy using an in-house collected human serum pro-peptidome database was used for database searching of human serum peptidome sample. It was found that 14 times shorter of searching time could be achieved using this strategy and the identification sensitivity was also improved. By combining size-selective Ti(IV)-MCM-41 enrichment, RP-RP off-line multidimensional separation, and complementary CID and ETD fragmentation with the new searching strategy, 175 unique endogenous phosphopeptides and 135 phosphosites were identified from human serum with high reliability. Although glycopeptides could be enriched with various methods and over 1000 of glycosylation sites could be identified, the concurrent identification of glycopeptide sequences and the glycan structure is very challenge. To tackle this problem, a new strategy was developed. In this strategy, the enriched glycopeptides were divided into two aliquots. One aliquot was first deglycosylated by PNGase F and then subjected to LC-MS/MS analysis with CID, and the another aliquot was directly subjected to LC-MS/MS analysis with HCD. A fully automated data processing method was then developed to integrate above two datasets. This strategy was applied to identify the membrane glycosylation of human embryonic kidney cell lines (HEK293T). In total, 238 N-glycopeptides from 101 different N-glycosites were identified with glycan structure by integration of the MS/MS spectra of native glycopeptides obtained by HCD and the N-glycosites identification by CID.
Revealing molecular mechanisms of human disease through 3D interactome network analysis
It is of considerable importance to understand molecular mechanisms of human disease and to determine genes responsible. Recently, researchers have begun to use complex cellular networks to address these problems. However, most analyses model proteins as graph-theoretical nodes, ignoring the structural details of individual proteins and the spatial constraints of their interactions. Here, we developed a novel algorithm to investigate on a genomic scale the underlying molecular mechanisms of human genetic disease by integrating 3D atomic-level protein structural genomics information with high-quality large-scale protein interaction data. This network consists of 4,222 high-quality binary protein-protein interactions with their atomic-resolution interfaces. We systematically examine relationships between 3,949 genes, 62,663 mutations and 3,453 associated disorders within the framework of this 3D network. We find that in-frame mutations (missense mutations and in-frame insertions and deletions) are enriched on the interaction interfaces of proteins associated with the corresponding disorders, and that the disease specificity for different mutations of the same gene can be explained by their location within an interface. We also predict 292 candidate genes for 694 unknown disease-to-gene associations with proposed molecular mechanism hypotheses. This work indicates that knowledge of how in-frame disease mutations alter specific interactions is critical to understanding pathogenesis. Structurally resolved interaction networks should be valuable tools for interpreting the wealth of data being generated by large-scale structural genomics and disease association studies.
FANSe2 mapping algorithm strengthens the next generation sequencing and facilitates the chromosome-centric proteomics with translatome sequencing
A prerequisite of the application of next generation sequencing is to map the sequencing reads to reference sequences both accurately and robustly. However, most of the widely-used mapping algorithms lose mappable reads unproportionally and thus create significant bias in the identification and quantification. We addressed this demand by developing FANSe2, a notably fast mapping algorithm (a billion reads per hour using normal office computers) with robust and ultimate accuracy (theoretical miss rate < 10^-5). These features allow maximal usage of data obtained from deep sequencing. We then demonstrated the application of FANSe2 algorithm to process the translatome sequencing data, revealing the quantitative correlation of the mature mRNA and the translating mRNA. Moreover, the high sensitivity and robustness of the algorithm and the translatome sequencing strategy fundamentally increase the gene coverage and completeness of the proteome atlas. This remarkably facilitates the chromosome-centric proteome studies.
Chemical Bioinformatics-From Evolution to Drug Discovery
In this omics era, informatics pervades the entire biology. Since life consists of not only macromolecules (e.g., DNA, RNA and proteins) but also small molecules (e.g., metabolites and metals), chemoinformatics, in addition to bioinformatics, should find important use in current life sciences. However, the value of chemoinformatics was largely overlooked by biologists. In the past decade, we were engaged in data mining by combinatorial use of chemoinformatics and bioinformatics, which was termed chemical bioinformatics. By using a train of thought "starting from small molecules and beginning with fundamental questions", we have performed a rather systematic research in this new discipline. The major findings include: i) Based on the power-law distribution of small-molecule ligands in protein universe, we proposed a model for protein origin and evolution that the birth of new protein architectures was facilitated (induced and/or selected) by binding of small-molecule ligands (including metals) (Ji et al. 2007). ii) By using protein structures as molecular fossils and employing the geological age information of small molecules, we established the molecular clocks of protein folds and superfamilies, which were used to trace some critical events in evolutionary history of life and earth, such as the evolution of amino acid biosynthesis, the rise of oxygen, and the origin of aerobic metabolism (Wang et al. 2011; Kim et al. 2012). iii) By means of chemoinformatics, we revealed novel molecular links between oxygen rise and biological evolution, and found a chemical basis for the metabolic network organization (Zhu et al. 2011; Jiang et al. 2012). iv) By bioinformatic analyses, we revealed evolutionary features of drug targets, which allowed us to evaluate the druggability of 500+ targets-in-research (Wang et al. 2012). Furthermore, we proposed the concept of "evolution-inspired drug discovery" to help find anticancer, antioxidant and antibacterial drugs (Zhang et al. 2010).
Share the ppb level accuracy in common LC-MS analysis
The proteomics researches are enjoying the high-precision and high-resolution of the advance of instrument and technology. It is far from enough to simply improve the hardware performance of mass spectrometry for acquiring accurate data. Optimized experiment design, routine maintenance and calibration of equipment are more important. In this work, we demonstrate that post-calibration are also an important way to improve the data accuracy. By in-depth investigating the recalibration problem of large scale proteomics datasets, an online machine learning model is implemented to improve the accuracy of the mass spectrometry (MS) data in common experiments. By introducing many new parameters and performing the feature selection automatically, this new model can calibrate the error of most observed mass to charge ratio (m/z) to ppb level, and can estimate a specific m/z error tolerance for each parent ion. Based on this new model, we updated our previously developed tool FTDR (Fourier-transform data recalibration) to a new version 2.0, which can perform automatic calibration for Thermo LTQ-FT and LTQ-Oribitrap data generated in common experiments. The results of FTDR 2.0 can be directly used in the common database search and quantification processing.
蛋白质的空间结构是其行使生物功能的基础，因此认知蛋白质结构是蛋白质研究的一大重要内容。随着基因测序技术的发展，越来越多的基因和蛋白质序列被解析，但实验上常用的解析蛋白质结构的技术如X射线晶体学、核磁共振等则相对昂贵，甚至大量的蛋白质难以结晶难以采用实验的方法来解析结构，因此结构已知的蛋白质数目远远少于序列已知的蛋白质，而且这种差距越来越大，由此促进了采用计算方法来预测结构的研究。Anfinsen 1973提出一个假设（也称为热力学假设，后来被大量的蛋白质体外折叠实验所证实），即蛋白质的空间结构是由其氨基酸序列唯一决定的。这个假设奠定了通过计算的方法预测蛋白质的空间结构的理论基础，蛋白质空间结构预测问题也逐渐成为计算生物学领域中最具挑战性的研究方向之一，吸引来生物，物理，化学，计算等不同领域的专家对此问题进行深入的研究。1994年美国科学家John Moult倡议在全球范围内举行蛋白结构预测比赛，即CASP（Critical Assessment of Techniques for Protein Structure Prediction http://predictioncenter.org/）以客观有效地评估蛋白质结构预测技术发展 水平。CASP评测结果表明，近二十年来预测蛋白质结构方法在识别远同源关系，序列对结构的比对，空间结构建模，蛋白质模型质量评估，蛋白质模型改善等各个方面均有重大进展；但目前计算方法还不能完全满足现实的需求，比如，对于远同源的序列的结构预测精度远低于X射线或者高分辨率的核磁共振结构的精度。本报告将着重介绍蛋白质结构预测的基本概念，重要计算问题及这些问题的研究发展历史和现状，同时还简要介绍本领域重要的数据库和常用的软件工具。
Data integration across omics landscapes
Advancements in high-throughput omics technologies have provided an unprecedented opportunity for biology and biomedical studies. At the same time, advanced technologies have led to an increasing gap between data generation and investigators' ability to interpret the data. To enable biologically meaningful and efficient analysis of these high-throughput data, we develop data integration approaches around basic molecular and systems biology principles. Specifically, we exploit the information flow from DNA to RNA to protein as defined by the central dogma of molecular biology and the interactions between molecules as defined by network biology to facilitate data integration. In this talk, I will give examples on the application of these approaches to protein identification in shotgun proteomics studies and multidimensional omics data integration in cancer studies.