Gerstein Lab FAQs: pseudogene

Showing posts with label pseudogene. Show all posts

Thursday, February 26, 2009

How to reconcile between the Pseudogene Family Database and Eukaryote Database?

I found something misleading that the size of pseudogene protein families database is quite different with the Eukaryote Database. Say,
gene ENSPTRG00000021298, it is contained in Pseudogene Families in Chimp (http://www.pseudogene.org/FAMILY/genome_seq_show.php?genome_ac=9598), but not the Eukaryote Database (http://tables.pseudogene.org/chimp). I wonder whether a gene that has an Ensemble ID is a pseudogene or not.
Which database should I depend on?

For the latest pseudogene families, you may want to take a look of our
Pseudofam database published on NAR recently (http://nar.oxfordjournals.org/cgi/content/abstract/gkn758v1).

However, pseudogene families were built upon the parent proteins of the pseudogenes (which means using Ensembl Peptide/Protein ID rather than Gene ID). Also, pseudogene families only contain pseudogenes that
can be classified into families.

As a result, if you have a set of gene IDs and you wish to see if they have any pseudogenes, I recommend you to download the Chimp's pseudogene set available at Pseudogene.org: http://tables.pseudogene.org/flatfiles/chimp.txt and search for the
gene ID annotation.

For the details of our pseudogene identification, you might want to read our paper published on Bioinformatics previously:
http://bioinformatics.oxfordjournals.org/cgi/content/full/22/12/1437

Monday, April 21, 2008

Do you call mutations at key resideus of a protein pseudogenes?

If a gene is transcribed and translated, but there are some mutations at key residues of the protein to make the protein non-functional or very unstable (with very short half life), do we call these kind of gene pseudogene? If it is not, it will be less meaningful for me to classify them, because it is not under the same evolutional pressure as functional genes. If it is, computer algorithms may have difficulty to identify them.

This is a very interesting question. The definition of "gene" and "pseudogene" is extremely fuzzy. Currently, our pipeline will not call this a pseudogene because we primarily look for frame-shifts and nonsense mutations. We can identify processed pseudogenes which don't have frame-shifts or nonsense mutaions, specifically pseudogenes of multiexon genes that will appear as a single exon retrotransposed gene. We don't have a clear way of differentiating between a functional retrogene and pseudogene. We simply flag it is a processed pseudogene. If the gene of interest has many exons and this structure is retained in the non-functional entities, we will not call it a pseudogene unless we detect a frame-shift or a nonsense mutation. But we are constantly adding new features to our pipeline and will have a discussion with Prof. Gerstein and the rest of the team about this aspect.

How good is Pseudogene Identification?

How good is the current algorithm to identify pseudogenes? For the maiz example, how can we know that thousands of copies of RVP genes on transoposons are functional or not?

1. Our pipeline has specific criteria for identifying pseudogenes and the first step involves filtering out exons annotated as protein coding. Therefore, if the underlying genome annotation is incorrect, then we will miss some pseudogenes. The scenario you have described is similar to ribosomal protein pseudogenes where we observe several retrotransposed pseudogenes. In this case, we specifically modified the pipeline to not mask the exons as most of the ribosomal proteins were misannotated in databases.

2. I am not very familiar with work on maize genome or pseudogenes in plants. I will discuss more with my colleagues and get back to you if there are new insights. But based on my experience with ribosomal protein pseudogenes, most such processed pseudogenes are non-functional. While one can never be sure if something is non-functional, there are a few things that one could do

a. Compare multiple genomes at various distances to maize genome to see if that region is conserved. If it is, there is some biological preference for retaining those pseudogenes.

b. Look to see if there are known promoter elements upstream of these regiosn which could potentially enable transcription/translation.

You might want to refer to a paper we recently published on ribosomal protein pseudogenes,Comparative analysis of processed ribosomal protein pseudogenes in four mammalian genomes

Wednesday, August 22, 2007

Compiled Sequences for Human and Chimp

I am seeking sequences for all the pseudogenes listed in the flat files for (at minimum) human and chimp ( 9606.71.gtf and 9598.2.gtf). I tried to look at the assembled sets on the website but I only found compiled sequences for processed or putative pseudogenes, and not duplicate pseudogenes. I wanted to ask you if there are files somewhere on the site that have sequence data for all pseudogenes listed in the species gtf files.

Sorry, we don't have that.

Monday, June 18, 2007

Adding Original Genes with Gene Names to Pseudogenes Website

On the Pseudogenes website, only a few(~300) of the 16K pseudogene hits could be linked to a gene in the RefSeq gene list file. Would it be possible to add a list off all the original genes with their genome location on your website.

Most of the original genes are listed by Ensembl ID. You can look up their information at http://www.ensembl.org/

Tuesday, June 5, 2007

Pseudogene Sequence Data Download

We know there is a way to download the pseugogenes of each organism, the file that is downloaded comes with the name of the pseudogene, the start and end position etc. But we wanted to download the sequences of each pseudogene of each organism directly, and we didn't find a way to do that in the database. Is it possible to download the sequence? Or do we have to make a program that, given the genome of the organism and the start/end positions of each peseudogene, extract the correspondent sequence?

None of the flatfiles contain the raw sequence information. On an individual pseudogene basis, however you can query the system for either the amino acid or nucleotide sequence. Simply search for the pseudogene you're looking for and on the results page click either the red or yellow button.

(Example results page: http://www.pseudogene.org/cgi-bin/search-results.cgi?tax_id=9606&set_search=63&amp;criterion0=&operator0=&searchValue0=&all=View+All+Pseudogenes&sort=1&output=html )

To get the sequence information for a large set of pseudogenes, however, it would probably be best to write the program you suggested.

Gerstein Lab FAQs