Drosophila Annotation Goals: Final Presentation and Written Reports

Introduction

This document describes the primary annotation goals to be included in the final oral presentation and written report for students enrolled in the Bio4342 course at Washington University in St. Louis. The final will be made up of two parts, a 10-minute oral presentation and a written report.

Oral Presentation

Give an individual final presentation in which you briefly discuss the most interesting features of your Drosophila project(s), potentially including some (probably not all) of the putative genes and non-coding RNAs, the assignment of potential Transcription Start Sites (TSS), distribution of repetitious elements, repeat element analysis, local and synteny block analysis with Drosophila melanogaster, and any other curious observations found in your region. For the talk focus on your own results but do include a summary of the results of your group and the overall annotation of your entire region. Your paper will discuss your best determination of the exact nature of each feature (possible gene) and the reasoning that led you to this conclusion. However, given the 10-minute limit on your talk you will probably want to focus on those genes and/or TSS assignments that were particularly interesting or difficult to annotate. Remember there is no right answer; this is the first time this sequence has ever been looked at by anyone. The goal is to come to the most reasonable interpretation of the data based on observations, and be able to articulate the reasoning behind your hypothesis. Show us your evidence! Critique of the protocols we used, are not required but certainly of interest. Remember that you can use this time to seek feedback on your analysis which may be useful in the final written report.

You may use PowerPoint (recommended) or other presentation tools but you will need a final file that you can submit to a Canvas assignment. Figures of some kind are required for your presentation, as they always play an important role in showing your evidence and getting your point across to your audience.

Written Report

A major part of the final will be your complete written report. Several major topics (grouped as you see fit) are recommended: Abstract, Introduction, Genes, TSS Estimates, Repeats, Synteny, and Discussion/Conclusions. TSS estimates can be reported at the end of the report on the intron/exon structure of the gene, rather in a separate section. Each section should end with a brief statement and/or table and/or figure summarizing your findings and conclusions. The annotation of the first gene you worked on should be reported in detail; subsequent cases can be reported succinctly, with reference back to the methods used in the first case, but should include the results of your search for conservation of coding sequences (table or possibly other), a table with your final exon coordinates and characteristics (reading frame, splice site phases, etc.), and Gene Model Checker output (dot plot AND aa alignment). Note that the final report should include revised versions of the first gene report and the TSS/Repeats report you have submitted earlier in the semester. We also ask that you submit an Appendix with appropriate data files (see below). Your report should be submitted to the Canvas assignment as a file upload of a Word document. You may also include a PDF version of the report as part of the submission (see the “Addendum on Writing” section for details).

Remember that your job is to communicate your findings clearly, and to convince the reader of the validity of your conclusions! The report should integrate text with figures, making a single, coherent document. The reader must be able to see the data that you present, and to understand the source. Zoom in as needed, and add arrows, boxes, etc. to the figure to highlight the features of interest for the reader. (Remember that you can use the “configure” functionality in the UCSC Genome Browser to increase the font size and reduce the image width in order to make the Browser images easier to read.)

When you discuss BLAST results, remember to specify the BLAST program, the query sequence, and the subject sequence used in the search; this information should also be in the Figure legend where you show results. Don’t leave out supportive data — cite all sources, particularly in describing the general procedure (your first gene). Be as specific and precise as possible when you describe the evidence used to support your final gene model. This document will be the basis for future research on this topic by following Bio 4342 students, will be the source of information for our joint publication, and may be used (with your permission) to demonstrate what undergraduates can accomplish in a challenging research endeavor.

Abstract

The abstract should provide a brief statement of the goals and a report of your findings. This should be similar to and about the same length as a typical abstract found at the beginning of a scientific paper (~300 words). Your abstract should be a summary of your results, NOT the process by which you achieved those results. For the D. willistoni annotation project, be sure to include the size of the project, number of Genscan predictions, protein-coding genes documented, and other verified features (e.g., novel repeats, non-coding RNAs) or unusual findings (e.g., changes in exon structure, changes in the number of isoforms), etc. Be specific in reporting what you have found and your resulting conclusions.

Introduction

One to three paragraphs: Why are we studying this problem? Why is comparative genomics a powerful approach for analyzing the features of genes and chromosomes? What kinds of results are we generating right now, and what do we hope to derive in the future aggregate analysis? Try to keep this section concise and informative but give some context — what are the goals of the overall study, and what are your goals in particular. Include a figure covering the entire sequence of your project(s) that indicates the size and position of each putative feature you will investigate. A screenshot of the GEP UCSC Genome Browser for your project region may provide a good starting point for this figure. Box and number the features that you will investigate/report on for ease of communication.

Genes

This section will be significantly larger (reflecting the number of features in your project), and it should be subdivided with a sub-section for each putative feature. You should present here a more detailed analysis of any genes, pseudogenes or partial genes you find in your project. Use the Genscan predictions (or similar) as an organizer for your presentation. Remember that while in D. melanogaster pseudogenes are rare, this may not be the case in other Drosophila species so keep an eye out for evidence suggesting the presence of pseudogenes in your project. Because the projects are partitioned into sub-regions, you may find only part of a gene at one end or the other of your project. Also be aware that no published genome is perfect — sequencing errors may occur in the genome assembly. Note that Genscan and other gene finders may miss some candidate genes, so a complete analysis will not be limited to just gene predictions; look at the other evidence tracks (homology search with D. melanogaster, RNA-seq, etc.) and examine the orthologous region of the D. melanogaster genome as annotated by the curators at FlyBase. In particular, Genscan will not predict non-coding RNAs (ncRNAs); be sure to investigate those posted for D. melanogaster, looking for conserved sequences in your species (e.g., via a blastn search of the non-coding RNA against your project region). When investigating a feature, you should include a discussion of all the information available to you: gene predictions, results from BLAST searches, RNA-seq data (e.g., splice junction predictions and assembled transcripts), and conservation tracks for the other Drosophila species (see the walk-throughs and practice examples). Be sure to discuss all the important sources of evidence. If the results are negative, ambiguous, or absent, simply mention that and move on.

For each gene in your D. willistoni projects, start by confirming the D. melanogaster orthologue; include the FlyBase ID and the name of any D. melanogaster gene that you consider likely to be orthologous in your report. After making approximate assignments based on evidence of conservation as detected using the “Align two or more sequences” functionality in BLAST, determine the exact location of each coding exon using all available evidence. Construct a model for each putative isoform. Use the Gene Model Checker to confirm your basic decisions; be sure to check both the dot plot and the peptide sequence alignment comparison to D. melanogaster. Every difference between your model and the D. melanogaster orthologue represents an assumption of evolutionary change that underlies your gene model. Based on the GEP annotation protocol, the best gene model should minimize these assumptions; as such, you should be able to account for every difference and convince yourself that there is no other acceptable gene model with fewer differences. [If possible, these differences should be supported by biological evidence (e.g., RNA-Seq data) or sequence conservation with other Drosophila species.] The dot plot and peptide sequence alignment, with a discussion of important differences (for each isoform), should be part of your report. You do not need to show multiple dot plots and alignments for isoforms that are highly similar. You may wish to include these in a “Supplemental Materials” section at the end of your final report.

Note that information from high-throughput sequencing of RNA, the RNA-Seq track and its derivatives, will be very useful. RNA-Seq sequences were generated by producing many short reads (100-125 bases) from expressed RNAs using the Illumina sequencing technology. These reads have been mapped back to the genome sequences and the number of reads that overlap each base plotted in the “RNA-Seq Coverage” tracks. In addition to using the RNA-Seq results to help you with your annotations, you should also compare the RNA-Seq data with your final gene model. Relevant questions include: How well does the RNA-Seq coverage track correspond to the locations of the individual exons in your annotation? Can you reliably use RNA-Seq coverage to identify untranslated regions (5' and 3' UTRs) for each of your genes? For every intron in your gene model, how many were supported and how many were unsupported by results in the splice junction and transcript tracks? Once you have mapped all the genes in your project region, be sure to investigate all regions that have high RNA-Seq coverage, especially in regions that do not correspond to the genes you have annotated. (However, some of the regions with high RNA-Seq coverage could be spurious if they overlap with transposon remnants or other repeats, so they should be investigated but treated skeptically.) Note that genes that are expressed only at low levels or in a restricted set of tissues or developmental stages may not display significant RNA-Seq data for our species. You can assess the expression status of the orthologous gene in detail from the data available for D. melanogaster on FlyBase.

First Gene Analysis

As described above, you should provide a detailed description of your general approach with the description of the first gene discussed in the paper. Be sure to include screenshots documenting the following:

Verification of the D. melanogaster orthologue by performing a blastp search of the Genscan prediction against the “Annotated proteins” database at FlyBase;
Exon-by-exon blastx (or tblastn) search results;
Close-up of the splice donor and acceptor sites (~20 nts), demarcating the possible splice sites you have considered and your final annotation, indicating the reading frame and phase (remember to verify that the orientation of the Base Position track is correct);
- If your first gene contains more than 6 introns, you can show in detail the supporting evidence for the annotation of the first 3 introns. For the remaining introns, you can focus on any introns that are more challenging to annotate, but be sure to include all CDS in your final model table (see below) and in the checking of your model in the gene model checker.
Translation start and stop sites.

Include the final model in a table, showing the exact exon coordinates, size, percent identity with D. melanogaster, reading frame, phase of the donor and acceptor sites. While you should include a dot plot and protein alignment from Gene Model Checker, a screenshot of the Gene Model Checker checklist (i.e., green check marks) should not be shown if every test passes. Figures like this do not hold informational content that cannot simply be described in the text (i.e., “all tests passed in Gene Model Checker”) and is evident from the dot plot and protein alignments. However, you should include a discussion if the Gene Model Checker checklist contains any warnings or errors (e.g., non-canonical splice donor), along with the evidence you used to support these exceptions in your final model. Be sure to include a detailed justification in your report if the proposed gene model for your species differs substantially from the D. melanogaster orthologue. There will be differences — evolution happens! But you should be able to explain any substantial gaps in the dot plot, particularly large vertical or horizontal gaps (which correspond to large insertions or deletions compared to the D. melanogaster orthologue, respectively) and discuss why your model is better than any alternatives.

Once you have established the general approach, you can summarize your results for subsequent genes. Focus on anything unusual or different in the results of subsequent annotations. Each gene summary should include a summary table as described above and a final dot plot and protein alignment. Do not include multiple dot plots and alignments if the coding exons are identical, just describe which isoforms are identical and mention that the dot plot and alignment figures represent the results for all of them.

Transcription Start Sites (TSS)

As we have discussed, we would like you to use the TSS Analysis Protocol to define, given the data, the smallest possible TSS search region(s) that is likely to contain the TSS. Do this for all the genes in your project region. You should give details for those promoters you have analyzed, and provide a short summary of the TSS’s for the other genes in your region (simply refer to the other reports from the group for more details). If there are multiple TSS’s annotated in D. melanogaster, you should analyze your results with the assumption that the same structure exist in your orthologous genes. If your final results indicate a change in the number of promoters, your analysis should include your best evidence to support this conclusion.

For any promoter, begin your TSS analysis by characterizing the number and shape of the promoter(s) for the orthologous gene in D. melanogaster based on the TSRchitect results. Then, use all available evidence to define the broad genomic region to search for the TSS’s in your project based on the gene structure of the ortholog in D. melanogaster, the organization of neighboring genes, the blastn search result, and any available RNA-Seq data. Remember the goal in the written report is not to give a historical record of the analysis, but to write the results in a way that best describe the evidence that pertains to your TSS annotations and logically flows from observations to your conclusion. Negative results should, at a minimum, be discussed briefly in your report to assure the reader that all sources of evidence were examined in your analysis.

Repeats

Following the Basic Instructions for Repeat Analysis, report on the repeat landscape in your individual analysis region. Outline the steps you took and the results you found in a way that convinces the reader to trust your results. Discuss the overall data for both the D. melanogaster F Element and the F Element region in D. willistoni. Look over the data for patterns or differences between the D. melanogaster F Element and the portion of D. willistoni chromosome 3 that contains most of the F Element genes.

Synteny

Start with your local synteny analysis of the genes in your project region. Generate a simplified map of your D. willistoni project region, including genes and possibly any other features you have found. Analyze the genes from your project, and assess synteny with D. melanogaster. Using one of the D. willistoni genes in your project region, identify the location of the corresponding ortholog on the D. melanogaster F element. Generate a map of the D. melanogaster genome centered around this gene which covers approximately the same number of protein-coding genes as your D. willistoni project region. Note which Muller element each gene is on in D. melanogaster.

Using these two maps, compare the relative gene order and orientation in D. willistoni and D. melanogaster. Remember that the orientation of your D. willistoni project is arbitrary (based on computer generated files), so you should focus on the relative orientations of the genes when analyzing synteny. (The project region is considered to be syntenic with D. melanogaster if the reverse complement of the D. willistoni project has the same gene order and orientation as D. melanogaster.)

If synteny has been preserved, the genes in your project will all be from the same region of the D. melanogaster genome, and your comparison will show a one-to-one map of the orthologs from one genome to the other (two horizontal lines). If the regions are completely syntenic, the genes will be in the same relative order and orientation in the two species. If synteny has not been preserved, please include a comparison based on each gene in your project to the same gene and flanking regions in D. melanogaster (several horizontal lines, stacked up). Look for evidence of events (such as inversions, transpositions, etc.) that could have easily occurred during evolution that could explain the changes in synteny. If there is no set of simple events, then consider the changes as “complex” and report as such.

The second part will delve into the synteny “Block” analysis. Write up the analysis of your specific region in detail — summarizing the results as you proceed through the protocol to the end results for your project region. Then, examine all the data in the shared Excel Online workbook to develop observations that you think are relevant to the initial question: "how likely is it that the F Element syntenic block would stay intact for over 15 million years?". Big questions that are related and can be addressed in the discussion section of your final report are: how relevant are the observations of these syntenic blocks? What are the assumptions behind this experimental design? How could you change the protocol to improve quality and reliability of the results?

Further Investigations

As time permits, we encourage you to explore one of the genes or features in your project in more detail. What is the presumed function(s) of one of the genes? Can you generate evidence in support of these functions? Are key regions of the predicted protein conserved? Looking at the “Gene Ontology” and “Protein Domains” sections of the FlyBase report, and the Clustal Omega analysis results may be informative. Is this gene part of a known biological pathway in D. melanogaster (see the “Pathways” section of the FlyBase Gene Report)? Anything is possible here but not required.

Discussion

One to three paragraphs. Start with a “final map” in this section. Use custom tracks to show the entire project region with all the final annotations of all CDS-based gene models for all members of the group. Point out the genes and TSS’s you were directly involved in annotating. Summarize the accomplishments of the group. In subsequent paragraphs, put your work in context. How do your findings support or differ from prior analyses? Refer in particular to the results of Leung et al., 2017 and questions raised in the synteny block analysis.

Given that every gene is different, your ability to use the above format may vary. If your particular results are not compatible with the above recommended format, consult with the instructors on how best to adjust your writing to fit your results. If you have any questions regarding the best way to present your observations and conclusions for your particular project, be sure to ask. Good luck and good hunting!

Addendum on Writing

As always, high standards are expected. Clarity of exposition is most important. The reader should be able to follow your reasoning without difficulty. Include screenshots with key evidence supporting your conclusions. Use highlighting, circles and arrows in your figures to focus the reader’s attention on what’s relevant. Remember that logic, not history, should dictate your presentation! Avoid “stream of consciousness” writing. Scientific writing has little in common with Faulkner. Show us the evidence you need to convince us your conclusions are correct. You do not need to show us experiments that failed, or describe ideas that did not yield interesting results, unless that excursion resolves an obvious question — but do present interesting challenges that you were able to resolve. Use the past tense to describe your actions, but the present tense to describe your project, and the D. willistoni genome — they are still in existence. Use words correctly and precisely. BLAST is not a verb! Avoid the temptation to make up a new verb to describe a whole process. Avoid slang. Remember that some of your readers do not speak English as their first language.

We use figures and tables — not graphs, charts, etc. Tables have rows/columns of numbers, and everything else is classified as a figure. Make sure that the figures are legible, specifically that one can read the labels for each axis and other printed information that is important to interpreting the figure without magnification. Include a figure title and an informative figure legend. A figure showing results of a BLAST search should always identify the query and the subject of the search, as well as the BLAST program (e.g., blastp, tblastn) used. Figure legends should be very specific; the readers should NOT have to guess what they are looking at!

Please upload the MS Word version of your report to Canvas. You might want to also upload a PDF version of the report, as multi-component figures (such as those with added arrows or boxes) can shift or come apart depending on the platform and version of Microsoft Word. Alternatively, you can create the multi-component figures in a different program (e.g., PowerPoint, Adobe Illustrator), and then insert the composite image into the Word document.

Please use the following format for references:

Smith, JD, EF Jones, and JJ Green (2010) All the world’s a phage. Science 210: 33 – 44.

Write out the names of all authors up to 12 before switching to et al. References can be cited in the text by number, or by Smith et al. (2010), or by (Smith et al. 2010). Find the papers you need on PubMed or cited in your readings; cite the journal as above; do NOT cite PubMed or the URL.

Specific Style Guide for this Paper

Fly species and gene names are always italicized, protein names are not.
Fly gene names are case sensitive — make sure that the gene names match the official FlyBase gene symbol! Fly protein names are not italicized and should be capitalized in all cases irrespective of the gene name.
Spell out the abbreviation the first time (e.g., “Drosophila melanogaster”), and then you can use abbreviations thereafter (e.g., D. melanogaster). The abstract and text are separate for this purpose.
Specify the species for the sequence you are using.
Be very careful with the use of exon names, particularly in cases of alternative splicing. I prefer “2a, 2b…” etc. rather than “3_10903_1,” although admittedly the latter is more precise, and may be necessary if the gene has a large number of alternative exons and isoforms.
Please use “splice donor” and “splice acceptor” rather than designating 5' or 3', since the latter can be confused between introns and exons and can be confounded by the strand.
Each page should have your last name and the page number in either the header or the footer.

In Scientific Writing in General

Do not be general when you can be specific (say “4 out of 5 cases” not “most cases”).
Do not use “value-loaded” language; words like “good” or “bad” have little meaning in science — support your argument with facts and observations, with estimates of error as needed.
Keep the language impersonal.
Avoid slang, jargon, and other language shortcuts. Remember that when you publish in the scientific literature, many of your readers will be people who do not speak English as their first language.

Appendix

In addition to your written report, you should also complete the GEP Annotation Report that provides the documentations for your coding regions and TSS annotations, as well as the Repeat Analysis Worksheet. The GEP Annotation Report and the Repeat Analysis Worksheet are available on Canvas.

You should also submit the combined GFF, transcript, and peptide sequence files for all the gene models in your project as part of the appendix. These files will be needed in the subsequent analysis of the compiled results. There will be a separate assignment on Canvas where you can submit these appendix files.

For any predicted gene isoform, you should have three files:

A fasta formatted file of the protein sequence.
A fasta formatted file with the nucleic acid sequence which codes for the protein (make sure it translates!)
A tab-delimited file in GFF format for every isoform of every gene found in your project.

All three files are generated by the Gene Model Checker (under the “Downloads” tab) for each isoform. Remember that you should generate GFF, transcript, and peptide sequence files for all isoforms, irrespective of whether they have the same coding regions. You can combine the individual files for each annotated gene into a single file using the Annotation Files Merger tool. If you believe you have found an error in the consensus sequence, please discuss with Wilson how to use the Sequence Updater to generate a VCF file and include this file in the appendix.

These files should be submitted to the appendix files assignment on Canvas.