Processing PURC Clusters¶

The crunch_clusters script takes the output from PURC and will determine the haplotype configurations for your samples using the sizes of the resulting clusters. These clusters are the output from PURC. To see all of the options for running the script, type crunch_clusters -h.

During this part of the pipeline, we use the taxon-ploidy table to determine the number of haplotypes that should be output (e.g., a diploid should have 2, a tetraploid 4, etc.). We also use the per locus error rates table to calculate the probability that a given cluster is a sequencing error. The Haplotyping Tutorial provides details on the mathematical model that we use for this step.

cd output-PURC/

crunch_clusters --input_fasta loc1_clustered_reconsensus.afa \
                --species_table ../../output-taxon-table.txt \
                --error_rates ../../output-locus-err.txt --locus_name loc1

We can also treat the locus as haploid by specifying the --haploid flag. This can be used for chloroplast or mitochondrial loci, as well as for nuclear loci when all we want is the primary cluster.

To go through all of the clustered loci, we can use a bash script and a for loop to analyze each locus. If the loci under consideration are haploid, add the --haploid flag.

# List all of the loci using the error rates file
for l in $(tail +2 ../../output-locus-err.txt | awk '{print $1}')
do
  crunch_clusters -i ${l}_clustered_reconsensus.afa -s ../../output-taxon-table.txt \
                  -e ../../output-locus-err.txt -l $l
done

Some other useful options during this step include realigning the sequences using Mafft (just add --realign to your command). We can remove gappy sites using Phyutility as well. This can be done by adding the --clean <%> flag. Just substitute the percent of gaps allowed per site that you want to use for cleaning. We also also have an option to only return unique haplotypes using the --unique_haps flag.

Note

We have run into some issues with species’ names when using Phyutility. First, it doesn’t like the semicolons that PURC uses to delimit the different parts of the sequence identifier (species names, cluster #, cluster size). We use sed to substitute underscores for the semicolons automatically. Other characters such as dashes have created issues as well because they get automatically substituted for underscores and no longer match the original names. If you are running into issues with not getting any output from the crunch_clusters script, make sure you check the names of the species in all of the files it writes to make sure that things are being inadvertently changed.