Gerstein Lab FAQs: 2008

Monday, November 3, 2008

I have studied your article "The role of disorder in interaction networks :a structural analysis". I am trying to get a list of hub -party and date- and non-hub proteins, so I am trying to get the datasets you used; however, I cant find the datasets. And so what I am asking for is that is it possible that you guide me to get the dataset somehow?

thank you for your interest in our paper. To answer your question, we employed the datasets provided as supplemental material by Han et al.,
Bertin et al. and Batada et al. Hereafter are the reference of those papers:

Han JDJ, et al, (2004) Nature 430, 88-93

Bertin N. et al, (2007) PloS Biol, 5(6):e153

Batada NN et al. (2007) PloS Biol, 5(6):e154

From Han et al., Supplementary table S1 includes date and party hub information. In our paper, this is referred as FYI 2004. From Bertin et al, we used the filtered-HC Protein-Protein Interaction dataset, provided as Supplementary table S1; called FYI 2007 in our paper.

From the Batada et al. paper, we used the High-Confidence Interaction Dataset provided as Dataset S1 in the Supplementary information . As described in our paper, hubs are then defined as ORFs with more than 10 interacting partners. Finally, to determine which hub is a party or date
hub, we computed the co-expression correlation with their partners. Party hubs have a correlation higher than 0.25 with their interacting partners.

In order to perform the correlation analysis, we employed thecompendium dataset by Huges et al. [ Cell (2000) 102:109-126 ].

Monday, August 11, 2008

Where can I get access to the data driving PubNet?

Where can I get access to the data driving PubNet?

PubNet is based exclusively on PubMed, which you access via query or bulk download from the NCBI.

Thursday, August 7, 2008

can you provide an example script for the integrated system for studying residue coevolution in proteins?

Regarding the integrated system for studying residue coevolution in proteins, can you provide an example script that can take one input file in command-line and write result to an output file? Also, can you send me those matlab code for SCA implementation?

The system was implemented in Java. It requires the Java virtual machine, plus some additional packages in order to run locally. If you are interested in installing it on your own machine, I can help sort out the steps. The Matlab code for SCA is unfortunately licensed and we cannot redistribute it. You may contact Rama Ranganathan for it. I think they are also releasing a newer version of SCA.

What is the Protein in the Logo at the Top of the Molmovdb Page?

I've been using your molecular movement database to find candidate proteins for a functionalization experiment that my group at Oregon State University is working on. The logo at the very top of the browsing page caught my attention (http://www.molmovdb.org/images/ProtMotDB.lrg-logo.gif). Would you happen to know what protein this is?

Lactoferrin. It's also part of Figure 4 in the original database publication from a very long time ago:

http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=9722650

This has references to the specific structure publications.

Wednesday, August 6, 2008

Where is the Source Code for the Dysregulated Pathways cited in "Integrative microarray analysis of pathways dysregulated in metastatic prostate..."?

I came across your paper, "Integrative Microarray Analysis of Pathways Dysregulated in Metastatic Prostrate Cancer". I am having trouble finding an appropriate algorithm to find dysregulated pathways from large clinical expression arrays and was impressed with the method you implemented. I was wondering if you happened to have the source code you used to implement IMAP and if it was possible for me to take a look at it. If so, that would be of enormous assistance to me.

Unfortunately there is no code floating around to perform the analyses you want to do. I did the calculations for the paper you cited but I did them three years ago and the computations were done mostly interactively in an R terminal.

Tuesday, August 5, 2008

How does one get the corresponding frequencies for the normal modes for which you have generated the movies?

Regarding http://molmovdb.org/nma/, how does one get the corresponding frequencies for the normal modes for which you have generated the movies?

No normal modes are extracted in the process of making the movies

Sunday, August 3, 2008

What are the datasetsfor Hinge Atlas and Hinge Gold?

I found the database of your lab is very useful at http://www.molmovdb.org/cgi-bin/browse.cgi, which include many data sets of conformations. My current program can read the pdb files.

Could you send me with the database?

You might find the Hinge Atlas and Hinge Atlas Gold datasets the best curated and easiest to download. You can find the download links and instructions at: http://molmovdb.org/cgi-bin/sets.cgi.

Please cite our Hinge Atlas and/or FlexOracle papers if you use these datasets:

http://papers.gersteinlab.org/e-print/HingeAtlas/preprint.pdf
http://papers.gersteinlab.org/e-print/flexoracle/preprint.pdf

Tuesday, July 15, 2008

Where is the Data Used for Pseudopiped Paper?

Can you provided me with the data you used for the paper "A computational approach for identifying pseudogenes in the ENCODE region"?

The paper should point to you all the input data, which can be downloaded from the Ensembl website. If you want to know more details about the computational process, you might find them in the following paper too.

http://bioinformatics.oxfordjournals.org/cgi/content/full/22/12/1437

In fact, it might be easy for you to re-produce the computational pipeline described in the above paper.

Monday, June 30, 2008

Is there a program that prints out the coordinates of the grid representing surface?

I recently came across your programs that calculate solvent accessible surfaces. I am interested in a program that would print out the coordinates of the grid representing that surface. The msms program by Senner does something like that but it prints the solvent excluded surface. I briefly looked at your programs but I see that they will only output the surface, but not the grid. I wonder if you have or know where to get such a program?

It will output the entire volume grid if you specify EZD format. Which can be converted to CCP4 and is viewable in UCSF Chimera and PyMol. I would be happy to modify the program to suit your needs, if possible.

If you just want a bunch of points representing the SA surface, my program deals with volumes, not surfaces. The MSMS program outputs coordinates for triangles (texels) representing the surface. My program deals with voxels (3D pixels).

The solvent accessible solvent is actually relatively easy to generate mathematically, the solvent excluded on the other hand is quite tricky. So, if you just want a bunch of points representing the SA surface, I have a perl script that will do that by just creating a bunch of spheres (VDW + probe radius) and then asking whether or not the points are inside another sphere. This works great for SA surfaces, but not SE surfaces.

Wednesday, April 23, 2008

Human genome build for hg16 Tiling HMM

I read your 2006 paper that describes a supervised hidden markov model framework for tiling array data in transcriptional experiment. In order to gain some insight, the transcriptional experimental data was downloaded from the website. But I wonder what human genome version should the positions in these data be mapped to. I'll appreciate if you could tell me what human genome it is.

hg16 (NCBIv34)

Monday, April 21, 2008

Do you call mutations at key resideus of a protein pseudogenes?

If a gene is transcribed and translated, but there are some mutations at key residues of the protein to make the protein non-functional or very unstable (with very short half life), do we call these kind of gene pseudogene? If it is not, it will be less meaningful for me to classify them, because it is not under the same evolutional pressure as functional genes. If it is, computer algorithms may have difficulty to identify them.

This is a very interesting question. The definition of "gene" and "pseudogene" is extremely fuzzy. Currently, our pipeline will not call this a pseudogene because we primarily look for frame-shifts and nonsense mutations. We can identify processed pseudogenes which don't have frame-shifts or nonsense mutaions, specifically pseudogenes of multiexon genes that will appear as a single exon retrotransposed gene. We don't have a clear way of differentiating between a functional retrogene and pseudogene. We simply flag it is a processed pseudogene. If the gene of interest has many exons and this structure is retained in the non-functional entities, we will not call it a pseudogene unless we detect a frame-shift or a nonsense mutation. But we are constantly adding new features to our pipeline and will have a discussion with Prof. Gerstein and the rest of the team about this aspect.

How good is Pseudogene Identification?

How good is the current algorithm to identify pseudogenes? For the maiz example, how can we know that thousands of copies of RVP genes on transoposons are functional or not?

1. Our pipeline has specific criteria for identifying pseudogenes and the first step involves filtering out exons annotated as protein coding. Therefore, if the underlying genome annotation is incorrect, then we will miss some pseudogenes. The scenario you have described is similar to ribosomal protein pseudogenes where we observe several retrotransposed pseudogenes. In this case, we specifically modified the pipeline to not mask the exons as most of the ribosomal proteins were misannotated in databases.

2. I am not very familiar with work on maize genome or pseudogenes in plants. I will discuss more with my colleagues and get back to you if there are new insights. But based on my experience with ribosomal protein pseudogenes, most such processed pseudogenes are non-functional. While one can never be sure if something is non-functional, there are a few things that one could do

a. Compare multiple genomes at various distances to maize genome to see if that region is conserved. If it is, there is some biological preference for retaining those pseudogenes.

b. Look to see if there are known promoter elements upstream of these regiosn which could potentially enable transcription/translation.

You might want to refer to a paper we recently published on ribosomal protein pseudogenes,Comparative analysis of processed ribosomal protein pseudogenes in four mammalian genomes

Saturday, April 12, 2008

How do I get data for the Bayesian Networks Paper?

I am a graduate student in University of Science and Technology of China majoring in bioinformatics. I am to perform some experiment concerning PPI network, and i found your article A Bayesian networks approach for predicting protein-protein interactions from genomic data very useful to my work, and need the variousdatasets preprocessed by your group. Will you send me a copy or tell me the site where i can download the data?

You could find the data from the supplementary website:http://networks.gersteinlab.org/intint/supplementary.htm

Thursday, April 10, 2008

How to obtain Standalone version of Calc-surface?

I am wondering if I could get a stand-alone version of calc-surface programe, I cannot find it in the software page of Prof. Gerstein lab.

I don't release binaries, because I am not good enough at it, to make it compatible with most systems. I would be happy to compile it for you, but I need to know what type of system you have: Linux, Mac, 32bit, 64bit, G4, G5. I am sorry to say I can't do Windows. Please contact the lab

Wednesday, April 9, 2008

How do I modify source code of Calc-Surface?

I want to use calc-surface with a large probe of 14 Angstroms but calc-surface seg faults when the probe is larger than 3 to 4 Å. I would like to modify the code to handler larger probes, can you point me in the right direction as to where to edit the code? I am using version 2.3.1.

The code you should be editing is called calc-surface.main.c which is in:


libproteingeometry-2.3.1/src-prog

I would also be careful about some other scripts or sub-routines that this program calls. Most of the scripts are usually in the same dir. All the programs are in one of these three:
src-prog, src-pro2 and src-pro3.

Wednesday, April 2, 2008

We have a new integrin ectodomain structure with two molecules in the asymmetric unit. There is a small amount of breathing at what we call the headpiece-tailpiece interface when the two molecules are compared. This is very important to some molecular dynamic simulations that we are doing. The movement between the two molecules is small, a few degrees, but involves large units of the molecule. We would like to use the morph server to extrapolate, rather than interpolate, this motion. I have looked at your description of frodo lite and this seems a good approach. So, following up our meeting at Yale, we would appreciate some help with this.

By "extrapolate" I believe you mean that you want to predict an unknown conformation from one or two known ones. FRODA can in fact be run in "undirected" mode and this will sample the accessible phase space consistent with sterics and the hydrogen bonding pattern. However it does not pick out the desired conformer from the large number of generated conformers. Certain assumptions are also made about the hydrogen bonding pattern which may result in the desired conformation not being present at all in the generated ensemble.

Instead of FRODA, perhaps you want to try our soon to be announced motion prediction tool, the Conformation Explorer. It is specifically designed to predict the motion of domains. Domains are often too large and slow moving to be dynamically characterized by MD. Further complicating matters, the motions are stochastic and the MD force fields are far from perfect; therefore the motion may not be observed even when the trajectory has been computed for a period of time experimentally
known to be sufficient. We have been successful in predicting large scale domain hinge bending motions for five proteins, including biotin carboxylase, glutamine binding protein, and MurA.

We do need to know something about the conformation which is to be predicted, however, in order to pick the right conformer out of the ensemble. If it binds a small ligand, for instance, we can find the holo structure given the apo by computing stability, free energy of ligand binding, gyration radius, and other quantities. Your use of the word "extrapolate" suggests you may have some geometric information about the target conformer which we can use.

Thursday, February 21, 2008

NucProt Calculator Error in RNA Analysis

I was trying to use the program NucProt calculator for an RNA sequence of 7.4 kB. I always get an error message instead of an answer. What is the maximum length sequence allowed as input in the program.

It is actually a quite simple script. It only counts how many A, C, G, Us there are and uses the formula:
"Exact" calculation

volume (cubic Angstroms) = #A*315.45 + #C*291.285 + #G*323.028 + #U*291.285
mass (Daltons) = #A*329.2 + #C*305.2 + #G*345.2 + #U*306.2

For an RNA that size the average might just as good:
7.4kB * 304 A^3 ~= 2.25 million cubic Angstroms
7.4kB * 321.45 A^3 ~= 2.38 MD (MegaDaltons)
PSV is something like volume/mass*0.6022 ~= 0.57

Wednesday, February 13, 2008

How do I submit homology models to FlexOracle?

Does FlexOracle predict the hinge points in GPCR homology models? And has anyone done such a submission.

FlexOracle, as well as hNMb and stonehinge, assumes the protein is solvated -- this is less than ideal for you since GPCR is a membrane protein. FlexOracle works by splitting the protein into two fragments at every possible pair of points, and computing the stability of the fragments using the FoldX force field. However FoldX will not guess at the effect of lipids, it can only estimate the effect of a solvent environment. That having been said, I have had some luck with membrane proteins, if only because often people are interested in some cytosolic domain. This may be the case for you.

TLSMD, on the other hand, uses the crystallographic B-factors and so it will tell you how it flexes in the crystal, which is often similar to how it will flex in vivo. All of these analyses are done for you when you submit to the HingeMaster server on molmovdb.org, as I encourage you to do. Bear in mind that the analysis will be done on one chain only. For particularly large proteins you may have to wait a couple of days. If you don't hear back from me when you get your results you are welcome to email me and I will help you interpret the results.

Threaded models have an additional strike against them in that the structure is almost certainly not properly equilibrated and thus is likely to be full of residual strains. This further limits the ability of FlexOracle to probe stability of fragments. hNMb is less sensitive to such details and so may be more useful, especially for the cytosolic domains. TLSMD will actually not return any results, since there will be no B-factors in a theoretical structure.

It still costs nothing to make a submission to our server, and despite these various issues you may still learn something.

Monday, February 11, 2008

How Do You Save or Download pdb file?

I need any pdb file like 119l or morph name: Small G-protein Arf6 with movie. I didn’t find an option to save or download and what software does it need to work on power point presentation to show protein motion.
A small movie is made on the old morph page. This is the easiest option but you can't control the perspective and have limited rendering options. More sophisticated movie making is available on our polyview3D page, administered by Alexey. Both are linked to from the top of the morph page.

Saturday, February 9, 2008

Resolving Conflicts in More than One Complex in the Same PFam Family

if there were two known complexes in the same pfam family, how did you use the combined information. And if homologs
were involved, did you just assign the aligned residues to the interface based on what interfacial residues were identified in the x-ray structure.

If we have known yeast complexes, this always takes precedent over any information in iPfam. If there is an interaction between two proteins known which share one Pfam domain _and_ the Pfam domain has been shown to bind itself in a crystal structure (i.e. in iPfam), we annotate that these two proteins will bind each other through the interface that is seen in that structure. In the
case of asymmetric binding between the same domain, the assignment is ad hoc. Ad II.) Yes, we use Pfam assignments as a form of homology mapping. We then assign the interface based on what is seen in the crystal structure.

Monday, January 21, 2008

PARE Perl Script

I would appreciate if you could provide me the Perl script source code of PARE.

http://proteomics.gersteinlab.org/PARE_download/PARE.tar

Saturday, January 12, 2008

Missing information regarding mutation rates

I am interested in using your results about substitution rates you published in the paper: "Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes"
I need to use your results for my research however, there are some missing information in your paper specifically in Figure2-A the following mutation rates are missing:

G-> C with neighboring of C

G->T with neighboring of C

G->A with neighboring of C

Another point is that Dinucleotide effect rates are done in CpG rich or poor or general areas also is it possible you send me a table of these numbers so I get them more accurate than extracting them from that graph.

We excluded CpG di-nucleotides from our analysis since they are known to have hyper mutation rates than any other di-nucleotides, due to the mechanism of methylation–deamination of cytosine. In fact mammalian genomes are depleted of CpGs, except of CG islands.

We also mentioned this in the paper.

Friday, January 4, 2008

Trouble Running Program to Reproduce Results in Defective Clique Paper

To test out your algorithm in the paper "Predicting interactions in protein networks by completing defective cliques.", I was running the dcc code on C, and this is what I am assuming to pass in order: (repetition if needed), minimal overlap size, max size of non-overlapping parts, the network you want to find cliques, the negative Gold standard binfile, and the positive gold standard binfile, and output file to write results.

So I had some trouble reproducing the results you had in your paper. I got the negative and postive gold standard from this website: http://networks.gersteinlab.org/intint/supplementary.htm to pass in. The G+ matched up (8250), though by G- was off. These are the results I got running your large data set with the above mentioned gold standard: Initial G+ = 8250, G- = 2697594, bogus(?) = 7573, new edges = 388, initial max cliques = 4934, 61 new pos interactions, -37 neg interactions (total 98), and LR = -539.08. The results you got on your paper were as follows: G- = 2708622, new edges = 437, edges detected in gold standard = 73, in neg = 21; thus LR = 1141.3. So I definitely am passing in the wrong variables/datasets.

Can you please direct me to the datasets on gold standard you are using for the large dataset? I looked at your additional websites/supplements mentioned in your paper, but was unable to find it. If I am doing this correctly, can you then explain why I am getting different numbers? That would be very kind of you and greatly, greatly appreciated.

Furthermore, on running the 56by56 network, if I understand correctly, you just run it with out a negative gold standard? So all you are doing in your paper is comparing the maximum cliques created after clique completion, so I do not have to worry about LR results, right?

Thanks a lot for your interest in my paper. You are doing the right things. I got the same number as you did using the same gold standard sets. My only explanation is that this part were done by another co-author (Valery). I guess he must have slightly different sets. Unfortunately, I couldn't get in touch with him anymore (this is why the delay in my response). But, the numbers are in the ballpark. It does not in any way diminish the effectiveness of our method.

As for the 56x56 network, you don't need a negative set or worry about the LR results

Wednesday, January 2, 2008

Rice Genome Pseudogenes

I do not know how to identify pseudogenes of some gene family of rice, although I read many papers about it. I am very happy to see your good paper-Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation, but the Eukaryote Data databases do not include rice genome. I am not able to download PseudoPipe program from http://www.pseudogene.org as one paper mentioned. In fact, I can not find any program that lies at websever.Are you so kind to give me PseudoPipe program?

Unfortunately, we have not run our PseudoPipe program on the rice genome. The PseudoPipe software itself is rather complex and designed to run on a computing cluster. (It is computationally intensive) If you wish to try to install the software anyway, the source is located at http://www.pseudogene.org/DOWNLOADS/pipeline_codes/