PRE-LAB CONCEPTS

Phylogenetics

Phylogenetics is the study of evolutionary relatedness among biological organisms. Phylogenetic trees are generally based on molecular data (DNA or amino acid sequence) and use tree-like branching patterns to illustrate evolutionary histories, as shown below for Wolbachia. The tips of each branch, often referred to as leaves, represent the operational taxonomic units (OTUs) that are being compared. These might include genes, individuals, species, or populations. If the OTUs represent a formal taxonomic group, such as a species, they are termed taxa (singular: taxon).  Each node on the tree represents the common ancestor for all taxa branching out of that node. Clusters of taxa that originate from the same ancestral node are called clades.

Types of Trees

Rooted trees feature a distinct node, or root, that serves as the ancestral group for all taxa in the tree. The most common way to root a tree is by using an ancestral outgroup, a taxon that is known to be more distantly related than all other taxa in the tree. Unrooted trees, however, are necessary when ancestry is unknown (Fig 1). In the case of Wolbachia Supergroups, the ancestral strain is unknown so most trees will be unrooted. We can, however, include genera such as Ehrlichia or Anaplasma as outgroups because they are closely related yet outside the group of interest (Wolbachia). While this may not provide concise ancestral information (a true root), it will create a meaningful tree showing the relationship of all Wolbachia taxa relative to closely related taxa (Fig 2). Finally, unrooted trees are sometimes midpoint rooted (Fig 3).  The hypothetical root can be placed midpoint in the tree if (i) the tree is balanced, and closely related clades are separated by a long branch or (ii) taxa are evolving at the same rate.

Alignments

The length of each branch on a phylogenetic tree represents the evolutionary distance between sequences. There are various approaches to constructing phylogenetic trees based on the dataset being analyzed; many methods begin with a distance estimation between nucleotide or amino acid sequences. The first step in distance-based phylogeny is the calculation of the pairwise distance, or p-distance, between two sequences. In simplest forms, this can be calculated by aligning two DNA sequences and dividing the number of nucleotide differences (nd) by sequence length, or the total number of nucleotide sites in the sequence (n).

p = nd/n

In the example below, there is one base pair substitution (C – T) across 10 nucleotide sites.

p = 1/10 = 0.10

Typically, an alignment would assess multiple sequences, each with hundreds to thousands of nucleotides.

A higher p-distance (e.g., 0.30) would indicate that the sequences are more divergent, or different, relative to the above pairwise alignment. On the other hand, a p-distance of 0 would indicate that the sequences are identical with 100% nucleotide similarity.

The above alignment features representatives from four Wolbachia Supergroups (labeled A-D). Within the B-Supergroup, notice unique base pair substitutions at positions 2 and 19 relative to all other Supergroups. These mutations, or changes in the DNA sequence, are specific to the B-Supergroup. Position 13, however, is variable within Supergroup B.

 

Point Mutations

Beyond the basic p-distance, many other variables in an alignment can inform evolutionary analyses. For example, point mutations are genetic mutations in which a single nucleotide in a genome is altered. Nucleotides can be substituted, inserted, or deleted in a genome.

Nucleotide Substitutions

Transitions involve the interchange of a purine with a purine (A ⇄ G) or a pyrimidine with a pyrimidine (C ⇄ T). Transversions, on the other hand, involve the interchange of a purine and a pyrimidine. As shown in the figure below, transversions have a greater impact on genomic structure because a two-ring structure is exchanged with a one-ring structure, or vice versa.

Both types of nucleotide substitutions – transitions and transversions – can influence the resulting protein sequence. Nonsynonymous substitutions are point mutations resulting in a different amino acid sequence. If the point mutation encodes the same amino acid, it is termed a synonymous substitution. The figure below illustrates the variation in amino acid sequences encoded by two different DNA sequences. Each codon, a sequence of three nucleotides that correlates to a specific amino acid, is marked #1-10 (see Codon Table tab). Codon #3 encodes a synonymous substitution because the correlating amino acid, threonine (Thr), remains the same even though the coding sequences are slightly different (ACA vs ACG) between the two nucleotide sequences. Codons #5, #7, and #10, however, encode nonsynonymous substitutions because the change in nucleotide sequence has an impact on the resulting amino acid products. A simple change in the amino acid sequence can impact both the structure and function of the resulting protein.

Nucleotide Insertions & Deletions

Insertions and deletions in a nucleotide alignment are often referred to as indels (insertions-deletions) because the term does not require information about the direction of mutation. When comparing two nucleotide sequences, for example, it is likely unknown if the indel is due to an insertion (gain of a nucleotide) in one sequence or a deletion (loss of a nucleotide) in the aligned sequenced. Indels are incredibly important to evolutionary genetics because they influence phenotypic traits, including human disease.  

Indels will result in a frameshift mutation unless all three nucleotides in a codon are similarly inserted or deleted. As the term implies, the reading frame of the nucleotide sequence is shifted such that the amino acid sequence is altered. The figure below illustrates an indel at codon #6. Without comprehensive phylogenetic information, it is unknown whether this indel represents a deletion in the top sequence or an insertion in bottom sequence. In scenario (a), a single indel has drastic consequences for the downstream amino acid sequence and will result in a different protein product. In scenario (b), the insertion/deletion of an entire codon (3 nucleotides) results in only one amino acid change across the sequence. The phenotypic impact of this seemingly minor indel will depend on the specific role of the amino acid (glutamic acid) in overall structure and/or function of the protein.

Together, point mutations occur at different evolutionary rates and have profoundly different impacts on the genome. When building phylogenetic trees, bioinformaticians must weigh the cost/benefit of using an advanced method with long computational time that better accounts for evolutionary variation vs. a simpler, less robust method that delivers fast results. For the purpose of this lab, we will focus on the latter.

Codon Table

The genetic code of DNA is transcribed into messenger RNA (mRNA) and then translated into proteins via codons, three consecutive nucleotides that correlate to specific amino acids. The table below summarizes the association between codons (from the coding strand of DNA) and amino acids.

  1. ATG (methionine) is the most common start codon; therefore, methionine is often the first amino acid in a protein sequence.
  2. TAA, TAG, and TGA encode stop codons. This terminates translation of the protein. If a frameshift mutation prematurely encodes a stop codon, the protein will be truncated.
  3. Many codon tables display the genetic code from mRNA to amino acid (translation). In this case, thymine (T) would be replaced with uracil (U).
  4. The genetic code is nearly universal across all forms of life. Some organisms, such as bacteria, feature slight deviations in the genetic code and may utilize alternative start codons.

Sanger Sequencing Files

The phylogenetic trees constructed in this lab activity are based on the insect barcoding gene, mitochondrial cytochrome c oxidase I (CO1). All sequences have been obtained by DNA extractions from individual arthropods followed by Sanger sequencing. A few simple steps, covered in Activity 2, are needed to convert the raw data file (.ab1) to a usable format (.fasta) for phylogenetic analysis. Before we begin, let’s review some basic concepts:

File Extensions

A file extension typically comes after the period in a file name and indicates the format and/or software program that generated the file. Below are a few common bioinformatics file extensions:

  • .ab1 (ABI sequencer data file): Known as the trace file, it includes raw data that has been output from Applied Biosystems’ Sequencing Analysis Software. .ab1 files include quality information about the base calls, the chromatogram (also called the electropherogram), and the DNA sequence.
  • .scf (Standard Chromatogram Format): Like .ab1 files, .scf files are also trace files that include quality information about the base calls, the chromatogram (also called the electropherogram), and the DNA sequence.
  • .fastaA text-based format for representing either nucleotide or amino acid sequences. The file often starts with a description or header line that begins with ‘>’ and provides information about the sequence.

Chromatograms

Sanger sequencing (.ab1) files are visualized as chromatograms. Each peak corresponds with a unique base call, or nucleotide, and quality score. Chromatograms can be trimmed and modified to generate a text-based FASTA file.

Quality Scores

Quality scores indicate the probability that an individual base, or nucleotide, is called incorrectly during DNA sequencing. For this lab, we recommend a Q score ≥ 40.

FASTA Files

In bioinformatics, FASTA is a text-based format for representing either nucleotide (DNA/RNA) or amino acid (protein) sequences. The file must have a top line that begins with ‘>’ and include a sequence name and/or short description. The actual sequence comprises the rest of the file. For example:

  • This file contains DNA sequence information for the red imported fire ant. The first line contains ‘>’ followed by the name of the sequence. A line break separates the name from the actual DNA sequence.

  • This file contains three separate DNA sequences. The ‘>’ symbol designates the beginning of a new sequence within a FASTA file. Names, or descriptors, are located on the top line, followed by a line break and the corresponding DNA sequence.

FASTA File Name

Just as PDF documents are identified with a .pdf file extension, FASTA files use .fasta  or .fa at the end of the file name.

Creating and Modifying a FASTA File

Any bioinformatics program (such as MEGA, SnapGene, or Geneious) can create and modify FASTA files. Alternatively, a FASTA file may be manually edited using a basic text editing program (i.e., TextEdit for Mac or Notepad for PC). Text can be added and deleted as long as it retains the FASTA format (above).

PRE-LAB ACTIVITY

LAB GUIDES

DNA SEQUENCES

Introduction to Bioinformatics

Example DNA Chromatograms

Phylogeny (.fasta file)

NCBI Taxonomy & BLAST Searching

 

FAQs

A. Many variables might contribute to a low-quality Sanger run. These include, but are not limited to:

  • Non-specific primer binding
  • Contamination with other samples during DNA extraction and/or PCR
  • Arthropod-specific: Amplification of both the COI gene and nuclear mitochondrial pseudogenes (numts)
  • Wolbachia-specific: The arthropod is infected with more than one Wolbachia strain (co-infection)
  • Not enough DNA template
  • DNA degradation
  • Inhibitory contaminants

If both directions are low-quality, it is recommended that you search the literature for your specific arthropod to confirm if: (i) numts are a known issue; (ii) optimized primers have been developed; (iii) Wolbachia co-infections have been described.

A. This depends on the overall goal of your project.

  • If the sequences are intended for an online data repository (such as GenBank) or publication, both forward and reverse reactions are highly recommended – but not required – to validate quality. You can align the two sequences to generate a consensus sequence.
  • If the sequences are part of an informal and/or pilot study, sequence data from one direction is recommended to reduce costs.

A. This is most likely caused by the sequence not aligning well in the Multiple Sequence Alignment step. To confirm, visually inspect the alignment to make sure that all nucleotides are properly aligned across the length of the sequence. If it is does not match the other sequences, it may be oriented in the reverse direction. You can use a program such as MAFFT (see Appendix B.3 in the Introduction to Bioinformatics Project Guide) and select the option to orient all sequences based on the first sequence in the file. This will make sure that all sequences are in the same orientation/direction. If this still doesn’t fix the problem, BLAST the sequence (using NCBI BLAST) to make sure that it is a reciprocal best BLAST to your other sequences. It should match the same portion of the gene/genome as the other sequences. As a last resort, you may need to remove the sequence from your input FASTA file if it is not critically important to the phylogeny.