<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art><ui>1745-6150-4-38</ui><ji>1745-6150</ji><fm>
<dochead>Discovery notes</dochead>
<bibl>
<title>
<p>Strong association between pseudogenization mechanisms and gene sequence length</p>
</title>
<aug>
<au ca="yes" id="A1"><snm>Khachane</snm><mi>N</mi><fnm>Amit</fnm><insr iid="I1"/><email>amit.khachane@mail.mcgill.ca</email></au>
<au id="A2"><snm>Harrison</snm><mi>M</mi><fnm>Paul</fnm><insr iid="I1"/><email>paul.harrison@mcgill.ca</email></au>
</aug>
<insg>
<ins id="I1"><p>Department of Biology, McGill University, Stewart Biology Building, 1205 Docteur Penfield Ave, Montreal, QC, H3A 1B1, Canada</p></ins>
</insg>
<source>Biology Direct</source>
<issn>1745-6150</issn>
<pubdate>2009</pubdate>
<volume>4</volume>
<issue>1</issue>
<fpage>38</fpage>
<url>http://www.biology-direct.com/content/4/1/38</url>
<xrefbib><pubidlist><pubid idtype="pmpid">19807910</pubid><pubid idtype="doi">10.1186/1745-6150-4-38</pubid></pubidlist></xrefbib>
</bibl>
<history><rec><date><day>21</day><month>9</month><year>2009</year></date></rec><acc><date><day>6</day><month>10</month><year>2009</year></date></acc><pub><date><day>6</day><month>10</month><year>2009</year></date></pub></history>
<cpyrt><year>2009</year><collab>Khachane and Harrison; licensee BioMed Central Ltd.</collab><note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note></cpyrt>
<abs>
<sec>
<st>
<p>Abstract</p>
</st>
<p>Pseudogenes arise from the decay of gene copies following either RNA-mediated duplication (processed pseudogenes) or DNA-mediated duplication (nonprocessed pseudogenes). Here, we show that long protein-coding genes tend to produce more nonprocessed pseudogenes than short genes, whereas the opposite is true for processed pseudogenes. Protein-coding genes longer than 3000 bp are 6 times more likely to produce nonprocessed pseudogenes than processed ones.</p>
</sec>
<sec>
<st>
<p>Reviewers</p>
</st>
<p>This article was reviewed by Dr. Dan Graur and Dr. Craig Nelson (nominated by Dr. J Peter Gogarten).</p>
</sec>
</abs>
</fm><meta>
<classifications>
<classification id="endnote" subtype="user_supplied_xml" type="bmc"/>
</classifications>
</meta><bdy>
<sec>
<st>
<p>Background</p>
</st>
<p>Pseudogenes are defective copies of genes that evolve neutrally. Pseudogenes originating from protein-coding genes lack the ability to code for proteins and bear features of coding sequence decay, such as: <it>i) </it>the presence of premature stop codon/frameshift mutations, <it>ii) </it>nonsynonymous/synonymous (Ka/Ks) substitution rates of ~1.0, and <it>iii) </it>truncation of protein domains. Pseudogenes are classified basically into two types: <it>i) </it>'Processed' or retrotransposed pseudogenes, which arise following a RNA-mediated duplication (retrotransposition) <abbrgrp>
<abbr bid="B1">1</abbr>
<abbr bid="B2">2</abbr>
<abbr bid="B3">3</abbr>
</abbrgrp> and, <it>ii) </it>'Nonprocessed' pseudogenes, which arise following a DNA-mediated duplication <abbrgrp>
<abbr bid="B4">4</abbr>
</abbrgrp>. Unlike nonprocessed pseudogenes, gene copies that arise following retrotransposition do not retain promoter regions of their parent genes. These copies are generally considered to be functionless at the time of birth ('dead on arrival') <abbrgrp>
<abbr bid="B1">1</abbr>
<abbr bid="B3">3</abbr>
</abbrgrp>. Some of these, over the time, are able to recruit new promoters to become functional <abbrgrp>
<abbr bid="B5">5</abbr>
<abbr bid="B6">6</abbr>
</abbrgrp>. Hence, in this study, we considered retrotransposition as a distinct pseudogenization mechanism.</p>
<p>An intriguing and a basic aspect that remains yet unknown is whether sequence length plays any role in the evolution of pseudogenes. If so, is such an effect common to both basic categories of pseudogenes (<it>i.e</it>., processed and nonprocessed)? Here, we addressed this question for the annotated pseudogenes of processed and nonprocessed categories from the human and mouse genomes.</p>
</sec>
<sec>
<st>
<p>Results and Discussion</p>
</st>
<p>The proportion of protein-coding genes that produced nonprocessed pseudogenes was found to increase with parental gene length (Fig. <figr fid="F1">1</figr>) with an unexplainable decrease in the mid-range in human (Fig. <figr fid="F1">1b</figr>), suggesting that following a DNA-mediated duplication event, longer protein-coding genes are generally more likely to become pseudogenes than their shorter counterparts. In contrast, the proportion of protein-coding gene transcripts that produced processed pseudogenes was found to decrease with sequence length (Fig. <figr fid="F2">2</figr>), which is in agreement with an earlier report that found that reverse-transcribed gene copies in human are of shorter length <abbrgrp>
<abbr bid="B3">3</abbr>
</abbrgrp>. The trend in the category of processed pseudogenes is the same for human and mouse genomes when analyzed separately (data not shown). Within the processed pseudogene category, only 67 cases (human and mouse combined) have parental gene length &gt;1000 amino acids (aa), whereas 421 in the case of nonprocessed category, suggesting that longer protein coding genes are ~6 times more likely to produce nonprocessed pseudogenes than processed ones.</p>
<fig id="F1"><title><p>Figure 1</p></title><caption><p>Percentage of protein-coding genes producing nonprocessed pseudogenes in the various length categories</p></caption><text>
   <p><b>Percentage of protein-coding genes producing nonprocessed pseudogenes in the various length categories</b>. (a) For human and mouse combined, (b) for human, and (c) for mouse.</p>
</text><graphic file="1745-6150-4-38-1"/></fig>
<fig id="F2"><title><p>Figure 2</p></title><caption><p>Percentage of parental proteins (human+mouse) associated with processed pseudogenes (retropseudogenes) in the various length categories</p></caption><text>
   <p><b>Percentage of parental proteins (human+mouse) associated with processed pseudogenes (retropseudogenes) in the various length categories</b>.</p>
</text><graphic file="1745-6150-4-38-2"/></fig>
<p>These trends are explainable as follows. Under a neutral evolutionary scenario, longer sequences are more likely to accumulate deleterious mutations than shorter ones. This seems to be the case in nonprocessed pseudogenes. A similar effect has been noticed in protein-coding genes associated with hereditary diseases <abbrgrp>
<abbr bid="B7">7</abbr>
</abbrgrp>. In the case of retropseudogenes, additional evolutionary forces seem to play a role. This may have to do with the higher propensity of shorter genes to undergo retrotransposition <abbrgrp>
<abbr bid="B3">3</abbr>
</abbrgrp>. Because the probability of interruption in the transcription of parent genes and subsequent reverse transcription during a retrotransposition event is higher for longer genes than for shorter ones, we anticipate seeing a larger proportion of successfully retrotransposed sequences to evolve from shorter genes. The abundance of transcripts may also influence the number of retropseudogenes arising from a gene. It has been shown that genes with retropseudogenes tend to be expressed in several tissues and generally do not tend to be tissue-specific <abbrgrp>
<abbr bid="B3">3</abbr>
</abbrgrp>.</p>
</sec>
<sec>
<st>
<p>Conclusion</p>
</st>
<p>This study demonstrates that the occurrence of pseudogenized gene copies is a function of gene length. Parental genes encoding for proteins longer than 1000 aa are 6 times more likely to produce nonprocessed pseudogenes than processed ones.</p>
</sec>
<sec>
<st>
<p>Methods</p>
</st>
<p>The annotations of pseudogenes were obtained from pseudogene.org <abbrgrp>
<abbr bid="B8">8</abbr>
</abbrgrp> on November 2007, human proteins from ENSEMBL release 47 <url>http://www.ensembl.org</url> and mouse proteins from ENSEMBL release 31 (that was also used for the annotation of pseudogenes). The total number of sequences in each category is as follows: human nonprocessed pseudogenes (1494), human processed pseudogenes (2858), human proteins (47550) and human protein-coding genes (23944); mouse nonprocessed pseudogenes (1753), mouse processed pseudogenes (2393), mouse proteins (31535) and mouse protein-coding genes (24461). The number of nonprocessed pseudogenes in each length category was normalized by the number of protein-coding genes, whereas in the case of processed pseudogenes, by the number of transcript/protein sequences because the transcripts act as direct precursors for the birth of retrotransposed copies.</p>
</sec>
<sec>
<st>
<p>Competing interests</p>
</st>
<p>The authors declare that they have no competing interests.</p>
</sec>
<sec>
<st>
<p>Authors' contributions</p>
</st>
<p>ANK performed the analyses. ANK and PMH interpreted the results and wrote the paper.</p>
</sec>
<sec>
<st>
<p>Reviewers' comments</p>
</st>
<sec>
<st>
<p>Reviewer's report 1</p>
</st>
<p>Dr. Dan Graur</p>
<p>Department of Biology and Biochemistry, University of Houston, USA</p>
<p>Accepted for publication with some stylistic suggestions (not for publication).</p>
</sec>
<sec>
<st>
<p>Reviewer's report 2</p>
</st>
<p>Dr. Craig Nelson (nominated by Dr J Peter Gogarten, University of Connecticut).</p>
<p>Molecular &amp; Cell Biology, University of Connecticut, USA.</p>
<p>In this study the authors describe a relationship between the protein coding length of a gene and number of RNA and DNA mediated duplicate pseudogenes derived from each gene. They find that long genes tend to produce fewer RNA-generated pseudogenes than do shorter genes.</p>
<p>The data presented appears sound and the core finding valid. I recommend accepting for publication following minor revision.</p>
<p>Several suggestions follow for possible improvements to the manuscript.</p>
<p>Major suggestions:</p>
<p>1) Both RNA-mediated and DNA-mediated duplication events give rise to duplicate genes that may become pseudogenized over time. Referring to DNA-mediated events as duplications and RNA-mediated events as something other than duplications does not reflect this fact. I urge the authors to change the way this is presented in the text.</p>
<p>
<b>
<it>Author's response: </it>
</b>For the sake of clarity, we have now introduced the above suggested terms.</p>
<p>Unlike gene copies that arise from DNA-mediate duplication, copies that arise following a RNA-mediated duplication or retrotransposition are essentially functionless at the time of birth, because they do not retain the parental promoter regions for their immediate transcription. Hence, in this study, we considered retrotransposition as a distinct event generating retrotransposed pseudogenes (retropseudogenes). Only some of the retrotransposed copies are able to recruit new promoters over the time to become functional.</p>
<p>2) "Processed" and "Non-processed" are not intuitive terms for those outside the field and, while these terms are correct, I suggest that the authors adopt more descriptive terms like RNA-mediated and DNA-mediated duplications, and/or retrotransposed pseudogenes.</p>
<p>
<b>
<it>Author's response: </it>
</b>We have now refined the text to make it more understandable.</p>
<p>3) No clear distinction I made between the duplication event and the pseudogenization event. Some discussion about which of these events are detected and analyzed by the authors and what impact this might have of the core finding would be welcome.</p>
<p>
<b>
<it>Author's response: </it>
</b>We have discussed the above issue in the Results and Discussion section (second paragraph). Also, refer to response to comment 1. We considered RNA-mediated duplication (retrotransposition) <it>per se </it>as an event contributing to the birth of pseudogenes. In this work, we were interested in studying whether sequence length plays any role in the evolution of the two distinct classes of pseudogenes.</p>
<p>4) In Materials and Methods section, more specific data sources and preprocessing methods should be specified. For example, which Ensembl release was used, and was any filter for pseudogenes and protein coding genes applied? The numbers of pseudogenes and protein-coding genes in the paper are quite different from the pseudogene data from Pseudogene.org and the protein-coding genes from Ensembl.org. For example, the protein-coding genes listed in Ensembl (release 55) are around 22,000 but the number in the text is 46,689.</p>
<p>
<b>
<it>Author's response: </it>
</b>We have now mentioned the Ensembl release number in the Methods section. We downloaded pseudogene data in November 2007 from Pseudogene.org. The site has been recently updated. In the pseudogene.org database, some pseudogenes are marked as 'unclassified', note that we have included only pseudogenes that are annotated as processed and nonprocessed pseudogenes.</p>
<p>The figure 46,689 is for the number of human proteins (&lt; = 2000 aa). We have now included cases with sequence length &gt;2000 aa and have corrected the number in the text accordingly.</p>
<p>5) Is the trend same with human and mouse genomes analyzed separately? Any reason to put them together?</p>
<p>
<b>
<it>Author's response: </it>
</b>Individually, they show similar trends, rising percentage values with increasing sequence length in the case of nonprocessed pseudogenes (Fig <figr fid="F1">1</figr>) and falling percentage values with increasing sequence length in the case of processed pseudogenes (Fig <figr fid="F2">2</figr>).</p>
<p>Minor suggestions:</p>
<p>1) From the figures, it is not easy to see that the longer parental genes (&gt;1000 AAs) are 6 times more prone to produce non-processed pseudogenes than processed. Figure or Table might help the cause.</p>
<p>
<b>
<it>Author's response: </it>
</b>We have now discussed the above in the text.</p>
<p>2) In Fig <figr fid="F1">1</figr> and <figr fid="F2">2</figr>, are there any parental genes longer than 2000AA? And</p>
<p>corresponding pseudogenes?</p>
<p>
<b>
<it>Author's response: </it>
</b>Yes, there are. We have now included them in the analysis.</p>
<p>3) In the discussion the authors mention: "Because the probability of interruption in</p>
<p>the transcription of parent genes and subsequent reverse transcription during a retrotransposition event is higher for longer genes than for shorter ones". It might be worth mentioning here that transcript abundance also has a large effect on this probability and that transcript abundance, gene length, and the abundance of retrotransposed pseudogenes are all correlated.</p>
<p>
<b>
<it>Author's response: </it>
</b>We agree with the comments and have included them in the discussion.</p>
</sec>
</sec>
</bdy><bm>
<ack>
<sec>
<st>
<p>Acknowledgements</p>
</st>
<p>A.N.K. and P.M.H. would like to thank the funding support from the National Science and Engineering Research Council of Canada (NSERC), and from <it>Les Fonds Qu&#233;b&#233;cois de la Recherche sur la Nature et les Technologies </it>(FQRNT).</p>
</sec>
</ack>
<refgrp><bibl id="B1"><title><p>Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome</p></title><aug><au><snm>Zhang</snm><fnm>Z</fnm></au><au><snm>Harrison</snm><fnm>PM</fnm></au><au><snm>Liu</snm><fnm>Y</fnm></au><au><snm>Gerstein</snm><fnm>M</fnm></au></aug><source>Genome Res</source><pubdate>2003</pubdate><volume>13</volume><issue>12</issue><fpage>2541</fpage><lpage>2558</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.1429003</pubid><pubid idtype="pmcid">403796</pubid><pubid idtype="pmpid" link="fulltext">14656962</pubid></pubidlist></xrefbib></bibl><bibl id="B2"><title><p>Analysis of the role of retrotransposition in gene evolution in vertebrates</p></title><aug><au><snm>Yu</snm><fnm>Z</fnm></au><au><snm>Morais</snm><fnm>D</fnm></au><au><snm>Ivanga</snm><fnm>M</fnm></au><au><snm>Harrison</snm><fnm>PM</fnm></au></aug><source>BMC Bioinformatics</source><pubdate>2007</pubdate><volume>8</volume><fpage>308</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2105-8-308</pubid><pubid idtype="pmcid">2048973</pubid><pubid idtype="pmpid" link="fulltext">17718914</pubid></pubidlist></xrefbib></bibl><bibl id="B3"><title><p>Nature and structure of human genes that generate retropseudogenes</p></title><aug><au><snm>Goncalves</snm><fnm>I</fnm></au><au><snm>Duret</snm><fnm>L</fnm></au><au><snm>Mouchiroud</snm><fnm>D</fnm></au></aug><source>Genome Res</source><pubdate>2000</pubdate><volume>10</volume><issue>5</issue><fpage>672</fpage><lpage>678</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.10.5.672</pubid><pubid idtype="pmcid">310883</pubid><pubid idtype="pmpid" link="fulltext">10810090</pubid></pubidlist></xrefbib></bibl><bibl id="B4"><title><p>Molecular fossils in the human genome: identification and analysis of the pseudogenes in chromosomes 21 and 22</p></title><aug><au><snm>Harrison</snm><fnm>PM</fnm></au><au><snm>Hegyi</snm><fnm>H</fnm></au><au><snm>Balasubramanian</snm><fnm>S</fnm></au><au><snm>Luscombe</snm><fnm>NM</fnm></au><au><snm>Bertone</snm><fnm>P</fnm></au><au><snm>Echols</snm><fnm>N</fnm></au><au><snm>Johnson</snm><fnm>T</fnm></au><au><snm>Gerstein</snm><fnm>M</fnm></au></aug><source>Genome Res</source><pubdate>2002</pubdate><volume>12</volume><issue>2</issue><fpage>272</fpage><lpage>280</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1101/gr.207102</pubid><pubid idtype="pmcid">155275</pubid><pubid idtype="pmpid" link="fulltext">11827946</pubid></pubidlist></xrefbib></bibl><bibl id="B5"><title><p>Transcribed processed pseudogenes in the human genome: an intermediate form of expressed retrosequence lacking protein-coding ability</p></title><aug><au><snm>Harrison</snm><fnm>PM</fnm></au><au><snm>Zheng</snm><fnm>D</fnm></au><au><snm>Zhang</snm><fnm>Z</fnm></au><au><snm>Carriero</snm><fnm>N</fnm></au><au><snm>Gerstein</snm><fnm>M</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>2005</pubdate><volume>33</volume><issue>8</issue><fpage>2374</fpage><lpage>2383</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/gki531</pubid><pubid idtype="pmcid">1087782</pubid><pubid idtype="pmpid" link="fulltext">15860774</pubid></pubidlist></xrefbib></bibl><bibl id="B6"><title><p>Assessing the genomic evidence for conserved transcribed pseudogenes under selection</p></title><aug><au><snm>Khachane</snm><fnm>AN</fnm></au><au><snm>Harrison</snm><fnm>PM</fnm></au></aug><source>BMC Genomics</source><pubdate>2009</pubdate><volume>10</volume><issue>1</issue><fpage>435</fpage><xrefbib><pubidlist><pubid idtype="doi">10.1186/1471-2164-10-435</pubid><pubid idtype="pmcid">2753554</pubid><pubid idtype="pmpid" link="fulltext">19754956</pubid></pubidlist></xrefbib></bibl><bibl id="B7"><title><p>Genome-wide identification of genes likely to be involved in human genetic disease</p></title><aug><au><snm>Lopez-Bigas</snm><fnm>N</fnm></au><au><snm>Ouzounis</snm><fnm>CA</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>2004</pubdate><volume>32</volume><issue>10</issue><fpage>3108</fpage><lpage>3114</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/gkh605</pubid><pubid idtype="pmcid">434425</pubid><pubid idtype="pmpid" link="fulltext">15181176</pubid></pubidlist></xrefbib></bibl><bibl id="B8"><title><p>Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation</p></title><aug><au><snm>Karro</snm><fnm>JE</fnm></au><au><snm>Yan</snm><fnm>Y</fnm></au><au><snm>Zheng</snm><fnm>D</fnm></au><au><snm>Zhang</snm><fnm>Z</fnm></au><au><snm>Carriero</snm><fnm>N</fnm></au><au><snm>Cayting</snm><fnm>P</fnm></au><au><snm>Harrrison</snm><fnm>P</fnm></au><au><snm>Gerstein</snm><fnm>M</fnm></au></aug><source>Nucleic Acids Res</source><pubdate>2007</pubdate><issue>35 Database</issue><fpage>D55</fpage><lpage>60</lpage><xrefbib><pubidlist><pubid idtype="doi">10.1093/nar/gkl851</pubid><pubid idtype="pmcid">1669708</pubid><pubid idtype="pmpid" link="fulltext">17099229</pubid></pubidlist></xrefbib></bibl></refgrp>
</bm></art>
