Instructions:
1. We will use the VAST+ program to find homologs of the SARS-CoV2 Spike Protein
(PDB ID 6VXX). Go to https://www.ncbi.nlm.nih.gov/Structure/vastplus/vastplus.cgi and
enter 6VXX for the PDB ID. The NCBI structural citation for this PDB file will come up.
For one of these hits, click the little "+" box near the PDB ID. Click Visualize 3D
Structure Superposition and then iCn3D.
2.
3. Take a look at some of the tools for analysis and visualization. You can learn quite a bit
about the relationship between the aligned structures using these tools.
4. Go back to your search and use the filters to find a match around 80% identity, and one
with around 30% identity.
5. Save a picture of each of your superimposed alignments. To do this go to Style then set
the background to white. Then click the little camera (or go File to Save File to iCn3D
PNG image). Paste the image inline into this homework.
Questions:
1. What were the name, sequence identity, and RMSD of each of the hits you chose?
2. How did they differ both in terms of sequence and in terms of structure?
3. Is there anything special about the regions of the protein that have weaker structural
alignment?
4. How might a tool like this be used to learn something about the coronavirus?
5. Why is structural alignment different from sequence alignment and why might it be
useful?
6. Starting with this page:
https://www.ncbi.nlm.nih.gov/Structure/vastplus/docs/vastplus_help.html, do some
research and explain how does VAST+ works (in your own words, do not copy and paste
anything!).
Fig: 1
Part 1: Smith-Waterman Algorithm Instructions: 1. Copy the "STS protein query sequence" from the week 3 links page (This is the one letter code for the protein encoded by the Resveratrol synthase gene, the gene that catalyzes the final step in the resveratrol synthesis pathway). Be sure to include the ">STS query" on the first line, or the program won't accept it. 2. Go to https://www.ebi.ac.uk/Tools/sss/ and choose SSEARCH, then "protein". This will do a Smith-Waterman local alignment on your protein sequence. 3. Choose the UniProtKB/TrEMBL database (under Step 1 on the page) 4. Paste the STS protein query sequence into the paste window 5. Choose SSEARCH under step 3, then click "More options". Here you will find a number of parameters including the substitution matrix. Search UniProt using three different substitution matrices: • BLOSUM 50 • PAM 120 • PAM 250 Be patient, the calculation will take a while. Questions: 1. What is the name and score of the best hit for each matrix? 2. What is the e-value of the best hit for each matrix? 3. Why do you think the results may have changed? 4. What do e-values mean and how do we interpret them?
Part I: Learning about molecular phylogenies 1. What is the basic assumption underlying a molecular phylogeny? Why must we distinguish between gene trees and species trees? 2. 3. Why don't genes always evolve by a series of bifurcations (i.e., by a series of single base changes)? 4. What are the four steps to constructing a molecular phylogeny? 5. What is an orthologous sequence? 6. What is a paralogous sequence? 7. What is a xenologous sequence? 8. Which type of sequences should you use for a species phylogeny? 9. What is the difference between multiple sequence alignments to discover motifs, etc., vs for constructing phylogenies? 10. Why is Clustal W not a very good choice for constructing species phylogenies?/n10. Why is Clustal W not a very good choice for constructing species phylogenies? 11. Please use the supplemental material on the links page to answer the following questions. What is a phylogenetic tree composed of? What is the difference between rooted and unrooted phylogenetic trees? What are the two major groups of analyses used to examine phylogenetic relationships? 0 0 0 What is a paraphyletic grouping? What happens if a multiple alignment is poor? What is the best way to deal with parts of an alignment that are uncertain due to gaps? What sorts of phylogenies are best constructed using DNA sequence alignments? What sorts of phylogenies are best constructed using protein alignments? What sorts of phylogenies are best constructed using ribosomal RNA sequence alignments? 0 0 0 0 0 What is a homoplasy? 0 Why can't we simply construct all possible trees, score each one, then pick the one with the best score? 0
Part III: Maximum parsimony methods 1. 2. 3. 4. What is the key assumption of maximum parsimony methods? How does this differ from distance matrix methods? What are the advantages of maximum parsimony methods? What are the disadvantages of maximum parsimony methods?
Part V: Tree evaluation 1. 2. 3. 4. 5. What are the three basic ways to resample the data for tree-building? What is jackknife resampling? What is bootstrap resampling? How does it differ from jackknife resampling? Recreate one of your ML trees except use Bootstrap Resampling as a method of tree evaluation. 6. How did your tree change?
Part II: Distance matrix methods 1. Answer the following questions: What is the general approach used by distance matrix methods to construct a phylogeny? a. b. 2. 3. 4. a. 5. a. b. 6. What are the main differences between UPGMA and neighbor-joining methods? Take your protein sequences from the links page and import them into Mega. Align them via Clustal W and save the alignment. Use your alignment to construct UGMA and Neighbor-Joining Trees. What are the differences and similarities between the trees? Repeat the above with an alignment based on MUSCLE instead of ClustalW. How does this change the results? Why do you think the results are different? Include labeled screen shots of your different trees.
Part II: Needleman-Wunsch Algorithm Instructions: 1. Go to https://www.ebi.ac.uk/Tools/sss/ and select GGSEARCH then protein. This will do a Needleman-Wunsch global alignment on your protein sequence. 2. Choose UniProtKB/TrEMBL as your database (step 1) 3. Paste in your STS protein query sequence (step 2) 4. Click More options... on step 3 and choose BLOSUM50 as your scoring matrix and then click Submit. Questions: 1. What is the name and score of your top hit? 2. How do these results differ from what you got with the Smith-Waterman algorithm (SSEARCH)? 3. Why do you think the results differed?
3. RNAFold also uses partition-function methods (AKA thermodynamic ensemble methods). What is a partition function and how is it related to free energy? What are the advantages and disadvantages of calculating partition functions?
Part III: FASTA Algorithm Instructions: 1. Go to https://www.ebi.ac.uk/Tools/sss/ and select FASTA, then protein. 2. Choose UniProtKB/TrEMBL as your database (step 1) 3. Paste in your STS protein query sequence (step 2) 4. Click More options... on step 3 and run your queries for the following choices of parameters: Scoring Matrix BLOSUM 50 BLOSUM 80 BLOSUM 80 BLOSUM 80 Gap Open -10 -10 0 -64 Gap Extend -2 -2 0 -16/nQuestions: 1. Did these calculations take as long as the Smith-Waterman search? If so, why? 2. Were the results different from the Smith-Waterman search? Why do you think this happens? 3. What is the effect of changing to a higher cutoff BLOSUM matrix (i.e. from 50 to 80) and what does it mean? 4. What is the effect of changing the Gap Open and Gap Extend parameters? Why do you think you observed what you did?
Recombinant DNA You have received a PCR product from an unknown organism that caused fever and a strange rash in young child. The family of the child has two kittens, and the physician treating the child suspects an infection with Bartonella. Strangely, the growth characteristics do not match those of known Bartonella species. The clinical lab has amplified an 825 base pair fragment of the unknown organism's genome and has asked you to use the PCR product to investigate the identity of the unknown bacteria. Once you received the DNA you decided to have the PCR product sent out for sequencing. Bioinformatics 1) Using a Blastn search determine the top 3 most similar DNA sequences and using Blastx determine the top 3 most similar proteins. What gene did the clinical lab PCR amplify? • What is the function of the protein? • Based on the results from this gene what organism did you isolated? And what species is your organism most similar to?
PCR/Primer Design and Cloning Now that you have learned a little bit about the genetics of the unknown organism you have been asked to create a real-time PCR assay to detect the unknown in other patient's blood samples using the gene that was amplified by the clinical lab. To create your standards for your real-time PCR assay you need to clone your DNA fragment into a plasmid (pUC19) using standard PCR techniques. 2) Using the 825 bp sequence for your unknown fragment, please create a primer pair that will amplify a 500 base pair fragment of the gene. Additionally, please add restriction sites to the ends of your primers. Please provide the following: • Sequences of the primers • Primer Tm • The location of the primers in the sequence • The restriction sites you used. Remember: Check to make sure that the restriction sites will only cut the primers and are not found anywhere else in the DNA sequence but are found in pUC19.