Monday, October 26, 2009

Few Questions about PeakSeq and STAT1 Data

I was hoping to use your STAT1 peak results to test a binding site
prediction program that I am developing as part of my PhD project.
I have looked at the files on your source website. For the moment, I am hoping for STAT1 and Control files, which indicate peak location, height, and width, and what processing has been done to achieve those values.
1) what is the meaning of the file names within the Mapped_Sequence_Reads directory? e.g. file names FC203E3*s_*, FC305JN_s_*, FC30*_s_*, and FC5817_s_* What sort of processing was done on these mapped files eg. masking tags with multiple locations etc.
2) within the Scored_Results directory for STAT1 there are two text files, and another directory (Extended/) containing a text file; what is the functional difference between these files?
3) what are the column headings for the STAT1 mapped files and scored files ?

1. The individual files are the eland aligned reads for each lane of data. Each lane from each flowcell (flowcell IDs are FC203E...) are kept separate even though biological replicate samples were split over multiple lanes. The reads were aligned with the Illumina software Eland with the default settings - only reads where the best alignment (least number of mismatches) is unique in the genome are reported. Reads that align to multiple locations are in the files however the
locations arent reported.
2. In the directory for STAT1 results ( there are two files. contains all the binding site locations sorted by q-value. The other file STAT1.with_peak_locations.txt contains the same information except with one additional field containing the nucleotide position of the peak in each binding site region. I will explain the fields in the file in response to q3. The file in the Extended/ directory contains
all the potential binding site regions (sorted by genomic location) even those that arent enriched relative to the Input DNA control. These two additional files were added at the request of some email I had received.
3. For the mapped files the fields are as follows:
1. Sequence name (derived from file name and line number if format is not Fasta)
2. Sequence
3. Type of match:
NM - no match found.
QC - no matching done: QC failure (too many Ns basically).
RM - no matching done: repeat masked (may be seen if repeatFile.txt was specified).
U0 - Best match found was a unique exact match.
U1 - Best match found was a unique 1-error match.
U2 - Best match found was a unique 2-error match.
R0 - Multiple exact matches found.
R1 - Multiple 1-error matches found, no exact matches.
R2 - Multiple 2-error matches found, no exact or 1-error matches.
4. Number of exact matches found.
5. Number of 1-error matches found.
6. Number of 2-error matches found.
Rest of fields are only seen if a unique best match was found (i.e. the match code in field 3 begins with "U").
7. Genome file in which match was found.
8. Position of match (bases in file are numbered starting at 1).
9. Direction of match (F=forward strand, R=reverse).
10. How N characters in read were interpreted: ("."=not applicable,
"D"=deletion, "I"=insertion).
Rest of fields are only seen in the case of a unique inexact match (i.e. the match code was U1 or U2).
11. Position and type of first substitution error (e.g. 12A: base 12 was A, not whatever is was in read).
12. Position and type of first substitution error, as above.
The fields for the scored files are as follows:
1. chromosome
2. region begin
3. region end
4. tag count from ChIP sample in binding region
5. tag count from control sample in binding region (after normalization)
6. Enrichment (ratio of 4 to 5)
7. Excess tag count (difference between 4 and 5)
8 q-value (Benjamini Hochberg corrected p-value)