KVL logo

Exercise in gene finding

The purpose of this exercise is to try different gene finding methods on a piece of provided DNA. As a bonus a correct gene structure might also be a result.

Part 1: Get the DNA

The DNA originates from Caenorhabditis elegans. This is an invertebrate, more precisely a nematode, or earth worm which is a favoured experimental organism because it only has around 1000 cells (also visible in the adult nematode) and 300 neurons. All of the cells and all of the neurons have been mapped, as well as the complete cellular development from zygote to adult nematode. More interesting to us: the entire genome has been sequenced. If you want to read more about C. elegans (for your own pleasure) you can visit the C. elegans WWW server.

Get the DNA from here.

Part 2: tRNA finding

It could be that your piece of DNA contained a tRNA. To check for tRNAs use the programme tRNAscan-SE. This programme can be found at the Pasteur Institute (go to RNA analysis). You can also find it at the home page of Sean Eddy's lab. Try to run the programme. What happened?

If you found a tRNA: Congratulations, you did not mindlessly run the programme with default settings but studied the parameters and changed the necessary.

If you did not find a tRNA. Study the parameters and change the necessary and try again. Continue until you succeed in finding a tRNA.

Lesson: Always try to understand the use of parameters (and change them if necessary) and read the documentation to learn about them, especially if you do not understand the use of certain parameters.

By the way, what amino acid does this tRNA carry?
How can tRNAscan know that?

Part 3: Gene finding using ab initio methods

Now try to find genes in the piece of DNA using different methods. REMEMBER to look at the settings and change any parameter when necessary, especially if your first try did not give any result. Try HMMgene (Hidden Markov Model) and Netgene2 (Neural Network) at Center for Biological Sequence Analysis, Genebuilder (statistical approach) at the Institute of Advanced Biomedical Technologies in Italy, and lastly try GenScan, which is much used in genome annotation of e.g. human and Arabidopsis. As you cannot find C. elegans or invertebrate among the organisms choose a vertebrate. You can also run GenScan at the Pasteur web site.

DATA TREATMENT. To facilitate a comparison between the different results, and the elucidation of the correct gene structure, store all the nucleotide positions for exon start and end sites. You can make a table for this purpose, and it could look like this.

Compare the results and consider why they are different.

Remember to save your results (you will soon need them).

If by some reason the prediction programmes will not execute properly, the results can be found here: HMMgene, Netgene2 and GeneBuilder. You have to make GenScan work.

Part 4: Gene finding using EST searches

Perform a BlastN at NCBI (if it takes too long you can also do it at the Swiss EMBnet node) against C. elegans (or invertebrate) ESTs.

Do the Blast against the first 11 kb only to avoid possible ESTs from the tRNA gene.

>From the results of your search find the nucleotide positions of the starts and ends of the exons and write these into your table, then select ESTs covering as much as possible of your genomic DNA and try to reconstitute the entire gene (if this is possible). Retrieve your ESTs from the Blast results. You simply do this by selecting the ESTs with the click boxes.

You can use the CAP programme to do the alignment and assembly of the ESTs. If the alignment does not work, try to remove the last part from your ESTs, and then try again.

Remember to save your ESTs in fasta format (the starting DNA was in the correct format).

Beware of possible frame shifts!

Translate the assembled CDS at for example EBI.

Where the sequences overlap, are they always identical?
Do the resulting contigs show perfect ORFs without stop codons? Why not?

Part 5: Finding the correct CDS

Go to your table. Look at the different exon start sites and exon end sites.

Are the five predictions identical?
Which of the five do you trust the most?
Why?
Did any of the gene finding methods arrive at the correct sequence?
Where do the gene finding methods fail?

>From the results choose the exon starts and exon ends you trust the most and write them in the last column of the table.

Give reasons for the choises you made.

Part 6: Analysing the CDS

Perform a BlastP against non redundant protein databases. You can use the GeneBuilder result or the GenScan result directly for this.

What kind of protein is this?

Many thanks to the Swiss Institute of Bioinformatics, where most of this exercise has been developed.


KVL Bioinformatics 2004. Kristian Axelsen (axe@biobase.dk) 2004-11-01