The purpose of the B exercises is to experiment with algorithms and programs for multiple alignment of sequences. Here we consider only amino acid sequences, but the ClustalX program can be used also for nucleotide sequences. You should do exercise B1, at least one of exercises B2-B4, exercise B5, exercise B6, and if you have time, it is also very instructive to do B7 and B8.
Brepresents the case where there is ambiguity between aspartate or asparigine, and
Zis the case where there is ambiguity between glutamate or glutamine.
Xrepresents an unknown, or nonstandard amino acid.
However, it may be more convenient to download and install it on your local computer, so that's what you will do in this exercise. ClustalX is available for a number of platforms from ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX/ at the Université Louis Pasteur, Strasbourg. Here are some instructions for Microsoft Windows:
cd \ClustalX unz531xN (unpack unzip) unzip clustalx.1.83.zip (unzip ClustalX)
Use Trees | Draw NJ Tree to save the Neighbour-Joining tree to a file globin.ph (after doing the alignment). You can also `bootstrap' the tree to see how good support there is for the various branching points.
To view the tree you need to open the tree file from the program C:\ClustalX\njplot. From this program you can also convert the tree to Postscript for printout by choosing File | Save Plot and naming the file globintree.ps or similar.
Take a look at the Alignment | Alignment Parameters menu. You'll see that the Gonnet substitution matrices are used by default, that one can adjust gap penalties etc.
Below is shown the amino acid sequence for an `unknown' bacteriocin (carnobacteriocin A9b) from lactic acid bacteria, and five sequences for some `known' (published) bacteriocins. Note the X's which stand for `not determined'. The `unknown' bacteriocin kills Listeria monocytogenes, which is a human-pathogenic, food-borne bacterium. The illness listeriosis attacks people that have a weak immune system, as well as foetuses (by infecting the bearer of the foetus) and the mortality rate is approximately 25 per cent. In Denmark, approximately 40 people are affected by listeriosis per year.
> carnobacteriocin A9b V N Y G N G V S X X K K X X > V1a piscicocin K Y Y G N G V S C N K N G C > V1b piscicocin A I S Y G N G V Y C N K E K C > bacteriocin B2 V N Y G N G V S C S K T K C > leucocin A K Y Y G N G V H C T K S G C > mesentericin Y105 K Y Y G N G V H C T K S G C
Download the file bacteriocin.fasta containing these sequences. Use ClustalX to align these sequences (trivial to do, even manually).
The protein family has a name. What do you think it is? Find out by searching SwissProt using protein Blast on one of the known sequences. If you search the nr (non-redundant) database instead, you may need to increase the Expect value to 1000 or 10000 to make Blast less fastidious, otherwise you will get no hits. This is because the nr database is much larger than Swissprot, and the query sequences are very short.
(Thanks to Lilian Nilsson and Lone Gram, Danish Institute for Fisheries Research).
The sequences of 12 beta-lactamases can be found in file bcII-all.fasta. Download them and align them in ClustalX. Choose Alignment | Output Format Options | Output Order | Input to prevent ClustalX from rearranging the sequences.
There are a few sites at which an amino acid is conserved in all twelve sequences. How many? (Look at the Column Score Profile, or look for a star above the alignment).
The twelve proteins fall in three families, B1, B2, and B3. The B1 family consists of the six first proteins (in the original file order), the B2 family consists of the next two proteins, and the B3 family consists of the last four proteins.
Check that this is also what ClustalX inferred, by saving and then viewing the Neighbour-Joining tree.
Print out the alignment produced by ClustalX and compare it to the alignment shown in the paper by Galleni et al (available only on paper). Reordering the sequences to fit the Galleni paper will make this task much easier.
(Thanks to Galleni Moreno, University of Liège, Belgium, and Lasse Hemmingsen, KVL. See Galleni, Moreno; Lamotte-Brasseur, Josette; Rossolini, Gian Maria; Spencer, Jim; Dideberg, Otto; Frère, Jean-Marie: Standard Numbering Scheme for Class B beta-Lactamases (guest commentary). Antimicrobial Agents and Chemotherapy 2001 - volume 45 - issue 3, 660-663).
The file carboxypep.fasta contains 18 sequences for carboxypepsidases from humans, cows, rats, pigs, etc. Download the file and align the sequences using ClustalX.
How well conserved are the sequences? Are there any sequences that seem to be outliers (more distantly related)? Does the Neighbour-Joining tree support your notion of outlier? Try removing the outliers using the Edit menu in ClustalX, and then redo the alignment.
(Thanks to Morten Bjerrum, Department of Chemistry, KVL).
The file chloroperoxidase.fasta contains eight sequences for proteins that are chloroperoxidases, or related to chloroperoxidases. (Thanks to former bioinformatics student Line Albertsen and others).
Use ClustalX to find a multiple alignment for these sequences. Because some sequences are short and others very long, ClustalX fails to find the common motif in the sequences. For instance, look at the multiple alignment at positions 509ff. Here apparently there is a PAYPSGHAT motif in three of the sequences (Curvularia, Drechslera, Embellisia). Then look at position 604ff. Here the highly similar PSYPSGHAT motif appears, this time in the other sequences Ascophyllum, Fucus and Deinococcus, and to a lesser extent in Nostoc and Corallina.
The positions indicated above are for the downloadable version of ClustalX. The ClustalX server at EBI produces a slightly different alignment, perhaps because it runs ClustalX 1.82 instead of 1.83, but it contains the same mistake.
Now use T-COFFEE instead of ClustalX to make a multiple aligment of the same eight sequences. You can run T-COFFEE at http://igs-server.cnrs-mrs.fr/Tcoffee/. Inspect the HTML or PDF version of the alignment. You'll see that T-COFFEE finds the P[AS]YPSGHAT motif without problems. (A somewhat bizarre feature of T-COFFEE is that high-similarity regions in the multiple aligment are marked yellow/red in the HTML format, and blue/green in the PDF format).
If you want, you can download T-COFFEE and install it on your local computer. See http://igs-server.cnrs-mrs.fr/~cnotred/Projects_home_page/t_coffee_home_page.html.
How many new significant database hits do you get in the second iteration? How many carboxypeptidases are there among them? How many new significant hits do you get in the third iteration?
Note that in window (B) you can tick a box to determine which hits should be used to build the next iteration's position-specific score matrix. Thus you can ensure biologically more meaningful results by leaving out irrelevant hits (if you known what you are doing).
Note also that PSI-Blast can give you the position-specific score matrix (after the second iteration and later); choose PSSM on the Format page. You can save the matrix on your computer and later feed it back into PSI-Blast (under Options, in the PSSM text area).
For your convenience I've compiled an executable for MS Windows; it requires a DLL called cygwin1.dll. Download both of them to the directory C:\windows\.
Here are three sample sequences in FASTA format:
>Sequence 1 VLSPADKTNVKAAWGKVGAHAGEYGAEALE RMFLSFPTTKTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNALSALSDLHAHK LRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKFLASVSTVLTSKYR >Sequence 2 VHLTPEEKSAVTALWGKVNVDEVGGEALGR LLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGKKVLGAFSDGLAHLDNLKGTFATLSE LHCDKLHVDPENFRLLGNVLVCVLAHHFGK EFTPPVQAAYQKVVAGVANALAHKYH >Sequence 3 GLSDGEWQLVLNVWGKVEADIPGHGQEVLI RLFKGHPETLEKFDKFKHLKSEDEMKASED LKKHGATVLTALGGILKKKGHHEAEIKPLA QSHATKHKIPVKYLEFISECIIQVLQSKHP GDFGADAQGAMNKALELFRKDMASNYKELG FQG
Download the file SampleData.fasta containing these sample sequences to directory C:\tmp, say.
Now you should be able to run MSA inside a DOS-box:
cd C:\tmp msa SampleData.fasta
To capture the output, pipe it to a file (e.g. result.txt), using
msa SampleData.fasta > result.txtThen look at the file C:\tmp\result.txt using an editor (e.g. Wordpad).
Using a text editor (Wordpad), select the six closely related B1 family sequences from the beta-lactamase file above, and save to a new file bcII-B1.fasta. Save in text format, not in Word format, otherwise your file will be incomprehensible to MSA. Run MSA on the file. (Note that the friendly Microsoft programmers decided that a text file saved by Wordpad should have the extension .txt, even if you have already given it the extension .fasta). This alignment should take between 2 and 10 seconds depending on your computer.
Save the alignment to a file. Align the same six sequences using ClustalX, and compare the alignment found by ClustalX (which is not necessary optimal) with that found by MSA (which is optimal). Edit using Wordpad and print it out in a tiny fixed-width font (such as Courier) if that makes the comparison easier.
This exercise serves to demonstrate that optimal multiple alignment may work for a limited number of closely related sequences, but is not feasible in general. This is true not only for the MSA program, but for all attempts at optimal multiple alignment. Therefore researchers developed ClustalX and T-COFFEE and similar approximate and non-optimal but fast programs.
Use the mouse to slide (drag) sequences left and right, thus inserting (or closing) gaps at the residue you point at. You'll find it pretty tiresome to try to do better than ClustalX.
Jalview can also be downloaded from http://www.jalview.org/ if you want to install it locally.
Then go back to the Prosite entry page and click on ScanProsite. Take one of the sequences and paste it into the `Scan a protein for Prosite matches', check the `Exclude patterns with a high probability of occurrence' checkbox, and start the scan. Hopefully you will get exactly the two Zinc carboxypeptidase patterns. You will also get a Graphical summary showing the two matching sequence fragments. Move the red vertical rulers so that they eactly enclose a fragment, and the press Zoom. Then you'll get the exact matching subsequences.
Instead of just Prosite you may search Interpro which integrates Prosite, Pfam, Prints, and several other protein motif databases. Use the server at http://www.ebi.ac.uk/interpro/: click Sequence Search, then enter the sequence and your email address. In addition to the two Prosite patterns you now get hits from Pfam and Prints.
KVL Bioinformatics. Peter Sestoft (email@example.com) 2001-06-22, 2004-09-17