Gerstein Lab FAQs: 2009

Monday, October 26, 2009

Few Questions about PeakSeq and STAT1 Data

I was hoping to use your STAT1 peak results to test a binding site
prediction program that I am developing as part of my PhD project.
I have looked at the files on your source website. For the moment, I am hoping for STAT1 and Control files, which indicate peak location, height, and width, and what processing has been done to achieve those values.
1) what is the meaning of the file names within the Mapped_Sequence_Reads directory? e.g. file names FC203E3*s_*, FC305JN_s_*, FC30*_s_*, and FC5817_s_* What sort of processing was done on these mapped files eg. masking tags with multiple locations etc.
2) within the Scored_Results directory for STAT1 there are two text files, and another directory (Extended/) containing a text file; what is the functional difference between these files?
3) what are the column headings for the STAT1 mapped files and scored files ?

1. The individual files are the eland aligned reads for each lane of data. Each lane from each flowcell (flowcell IDs are FC203E...) are kept separate even though biological replicate samples were split over multiple lanes. The reads were aligned with the Illumina software Eland with the default settings - only reads where the best alignment (least number of mismatches) is unique in the genome are reported. Reads that align to multiple locations are in the files however the
locations arent reported.
2. In the directory for STAT1 results (http://archive.gersteinlab.org/proj/PeakSeq/Scoring_ChIPSeq/Results/STAT1/STAT1_Targets/) there are two files. STAT1.final.txt contains all the binding site locations sorted by q-value. The other file STAT1.with_peak_locations.txt contains the same information except with one additional field containing the nucleotide position of the peak in each binding site region. I will explain the fields in the file in response to q3. The file in the Extended/ directory contains
all the potential binding site regions (sorted by genomic location) even those that arent enriched relative to the Input DNA control. These two additional files were added at the request of some email I had received.
3. For the mapped files the fields are as follows:
1. Sequence name (derived from file name and line number if format is not Fasta)
2. Sequence
3. Type of match:
NM - no match found.
QC - no matching done: QC failure (too many Ns basically).
RM - no matching done: repeat masked (may be seen if repeatFile.txt was specified).
U0 - Best match found was a unique exact match.
U1 - Best match found was a unique 1-error match.
U2 - Best match found was a unique 2-error match.
R0 - Multiple exact matches found.
R1 - Multiple 1-error matches found, no exact matches.
R2 - Multiple 2-error matches found, no exact or 1-error matches.
4. Number of exact matches found.
5. Number of 1-error matches found.
6. Number of 2-error matches found.
Rest of fields are only seen if a unique best match was found (i.e. the match code in field 3 begins with "U").
7. Genome file in which match was found.
8. Position of match (bases in file are numbered starting at 1).
9. Direction of match (F=forward strand, R=reverse).
10. How N characters in read were interpreted: ("."=not applicable,
"D"=deletion, "I"=insertion).
Rest of fields are only seen in the case of a unique inexact match (i.e. the match code was U1 or U2).
11. Position and type of first substitution error (e.g. 12A: base 12 was A, not whatever is was in read).
12. Position and type of first substitution error, as above.

The fields for the scored files are as follows:
1. chromosome
2. region begin
3. region end
4. tag count from ChIP sample in binding region
5. tag count from control sample in binding region (after normalization)
6. Enrichment (ratio of 4 to 5)
7. Excess tag count (difference between 4 and 5)
8 q-value (Benjamini Hochberg corrected p-value)

Sunday, July 19, 2009

Submitting confidential files to Morph Server

I would like to submit two confidential files to the protein Morph Server which contain heteroatoms – the cofactors FAD and NADP+. The server notified me to contact the site maintainers directly for this service. I’d like to calculate 20 frames between the structures.

We have the option of a private site on the submission page (http://molmovdb.org/cgi-bin/submit.cgi). In that case the created web-page won't be added to any databases and won't be searchable. The only person that knows the created link will be able to see it. In case these people won't get email with results they should try manually to open the page http://molmovdb.org/cgi-bin/morph.cgi?ID=given_id where given_id is the ID given at the submission. For example http://molmovdb.org/cgi-bin/morph.cgi?ID=585109-19896

Thursday, February 26, 2009

How to reconcile between the Pseudogene Family Database and Eukaryote Database?

I found something misleading that the size of pseudogene protein families database is quite different with the Eukaryote Database. Say,
gene ENSPTRG00000021298, it is contained in Pseudogene Families in Chimp (http://www.pseudogene.org/FAMILY/genome_seq_show.php?genome_ac=9598), but not the Eukaryote Database (http://tables.pseudogene.org/chimp). I wonder whether a gene that has an Ensemble ID is a pseudogene or not.
Which database should I depend on?

For the latest pseudogene families, you may want to take a look of our
Pseudofam database published on NAR recently (http://nar.oxfordjournals.org/cgi/content/abstract/gkn758v1).

However, pseudogene families were built upon the parent proteins of the pseudogenes (which means using Ensembl Peptide/Protein ID rather than Gene ID). Also, pseudogene families only contain pseudogenes that
can be classified into families.

As a result, if you have a set of gene IDs and you wish to see if they have any pseudogenes, I recommend you to download the Chimp's pseudogene set available at Pseudogene.org: http://tables.pseudogene.org/flatfiles/chimp.txt and search for the
gene ID annotation.

For the details of our pseudogene identification, you might want to read our paper published on Bioinformatics previously:
http://bioinformatics.oxfordjournals.org/cgi/content/full/22/12/1437

Sunday, February 15, 2009

Huge discrepancy in the numbers in "Modeling ChIP Sequencing in Silico with Applications"

In your article "Modeling ChIP Sequencing in Silico with Applications", you mentioned that the initial 2,915,382 sequence reads obtained in Robertson's experiments, but when I refer this number to the original paper, the total sequenced reads is 24.1M, which is significantlydifferent from your data. Could you clarify, please?

For our ChIP-seq analysis, we used whatever read sequences that Robertson et al sent to us upon our request, which was made well before the publication of their paper.

Tuesday, January 6, 2009

How do you submit private data to the morph server?

How do you submit private data to the morph server?

There is now a "Private" check box on the morph submission form. If checked, the submitter will still receive an email upon submission but the general public will have no way to find the morph. It will not appear when people use the search box on the front page, nor will it appear among the user-submitted morphs on the movies page. Search engines do not index morph pages since they are dynamically generated. The only way third parties could know of its existence is if they were somehow able to intercept the email sent from the server to the submitter. Thus it is highly unlikely that third parties would ever know of or find it. We trust this will provide sufficient privacy.

Gerstein Lab FAQs