Gerstein Lab FAQs: January 2008

Monday, January 21, 2008

PARE Perl Script

I would appreciate if you could provide me the Perl script source code of PARE.

http://proteomics.gersteinlab.org/PARE_download/PARE.tar

Saturday, January 12, 2008

Missing information regarding mutation rates

I am interested in using your results about substitution rates you published in the paper: "Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes"
I need to use your results for my research however, there are some missing information in your paper specifically in Figure2-A the following mutation rates are missing:

G-> C with neighboring of C

G->T with neighboring of C

G->A with neighboring of C

Another point is that Dinucleotide effect rates are done in CpG rich or poor or general areas also is it possible you send me a table of these numbers so I get them more accurate than extracting them from that graph.

We excluded CpG di-nucleotides from our analysis since they are known to have hyper mutation rates than any other di-nucleotides, due to the mechanism of methylation–deamination of cytosine. In fact mammalian genomes are depleted of CpGs, except of CG islands.

We also mentioned this in the paper.

Friday, January 4, 2008

Trouble Running Program to Reproduce Results in Defective Clique Paper

To test out your algorithm in the paper "Predicting interactions in protein networks by completing defective cliques.", I was running the dcc code on C, and this is what I am assuming to pass in order: (repetition if needed), minimal overlap size, max size of non-overlapping parts, the network you want to find cliques, the negative Gold standard binfile, and the positive gold standard binfile, and output file to write results.

So I had some trouble reproducing the results you had in your paper. I got the negative and postive gold standard from this website: http://networks.gersteinlab.org/intint/supplementary.htm to pass in. The G+ matched up (8250), though by G- was off. These are the results I got running your large data set with the above mentioned gold standard: Initial G+ = 8250, G- = 2697594, bogus(?) = 7573, new edges = 388, initial max cliques = 4934, 61 new pos interactions, -37 neg interactions (total 98), and LR = -539.08. The results you got on your paper were as follows: G- = 2708622, new edges = 437, edges detected in gold standard = 73, in neg = 21; thus LR = 1141.3. So I definitely am passing in the wrong variables/datasets.

Can you please direct me to the datasets on gold standard you are using for the large dataset? I looked at your additional websites/supplements mentioned in your paper, but was unable to find it. If I am doing this correctly, can you then explain why I am getting different numbers? That would be very kind of you and greatly, greatly appreciated.

Furthermore, on running the 56by56 network, if I understand correctly, you just run it with out a negative gold standard? So all you are doing in your paper is comparing the maximum cliques created after clique completion, so I do not have to worry about LR results, right?

Thanks a lot for your interest in my paper. You are doing the right things. I got the same number as you did using the same gold standard sets. My only explanation is that this part were done by another co-author (Valery). I guess he must have slightly different sets. Unfortunately, I couldn't get in touch with him anymore (this is why the delay in my response). But, the numbers are in the ballpark. It does not in any way diminish the effectiveness of our method.

As for the 56x56 network, you don't need a negative set or worry about the LR results

Wednesday, January 2, 2008

Rice Genome Pseudogenes

I do not know how to identify pseudogenes of some gene family of rice, although I read many papers about it. I am very happy to see your good paper-Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation, but the Eukaryote Data databases do not include rice genome. I am not able to download PseudoPipe program from http://www.pseudogene.org as one paper mentioned. In fact, I can not find any program that lies at websever.Are you so kind to give me PseudoPipe program?

Unfortunately, we have not run our PseudoPipe program on the rice genome. The PseudoPipe software itself is rather complex and designed to run on a computing cluster. (It is computationally intensive) If you wish to try to install the software anyway, the source is located at http://www.pseudogene.org/DOWNLOADS/pipeline_codes/

Gerstein Lab FAQs