Quick Start¶

This Quick Start Tutorial will walk you through every step of downloading, installing, and running the Fluidigm2PURC pipeline. The details of each step can be found in the main documentation.

Requirements:

Python (we suggest using Miniconda)
Python modules: pandas, numpy, biopython, cython
C, C++ compilers (Linux should be good, Mac OSX needs Xcode and the Command Line Tools)
zlib (needed to compile Sickle; may already be present)
PURC (available on Bitbucket)

Note

We have tested our scripts on Python 2.7, 3.5, and 3.6. However, PURC has only been tested with Python 2.7. We have also worked with others researchers who had trouble getting things run with Python 3. Therefore, we recommend using Python 2.7.

1. Downloading and Installation¶

Python¶

The code below will walk you through downloading and installing a Python distribution using Miniconda, as well as all of the Python packages that needed to use Fluidigm2PURC.

# Get Miniconda for your operating system (Mac or Linux)
# Answer yes to the questions the Installer asks
# These commands will download Python 2.7 for Mac OSX
curl -O https://repo.continuum.io/miniconda/Miniconda2-latest-MacOSX-x86_64.sh
bash Miniconda2-latest-MacOSX-x86_64.sh

# Install packages with conda or pip command
conda install numpy pandas biopython cython
# pip install numpy pandas biopython cython

PURC¶

PURC is available on Bitbucket and can be cloned and installed using the code below.

git clone https://bitbucket.org/crothfels/purc.git
cd purc && ./install_dependencies.sh

# while in the PURC directory, add it to your PATH
# It's best to add the PATH to your .bash_profile
export PATH=$(pwd):$PATH

If you are on a Linux computer, you may have to run the install_dependencies_linux.sh script instead. The Bitbucket repository for PURC has more details about installation as well.

We have also included a modified version of the purc_recluster.py script as part of our pipeline (purc_recluster2.py). The only difference is that it conducts fewer iterations of the chimera detection and clustering steps. If you would like to use it, make sure that move or copy it from the Fluidigm2PURC folder into the main PURC folder.

Note

For the PURC scripts to work, they need to be present in the main PURC folder that was cloned from Bitbucket. The reason for this is that the scripts reference all its dependencies using file paths that are relative to the main PURC folder. These scripts also need to be made available in your bash PATH variable (see code above).

Fluidigm2PURC¶

Fluidigm2PURC is available on GitHub and can be cloned and installed using the code below.

git clone https://github.com/pblischak/fluidigm2purc.git
cd fluidigm
make && sudo make install

The haplotyping script, crunch_clusters, can optionally call the programs Mafft and Phyutility. If you would like to use these tools, make sure that you install them on them your machine and add them to your PATH.

2. Running fluidigm2purc¶

The fluidigm2purc script will process a set of paired-end FASTQ files that have been demultiplexed using the program dbcAmplicons and will output a single FASTA file for each locus present using sequence header information in the format required by PURC. As an example, let’s say that we have our paired-end data in the files FluidigmData_R1.fastq.gz and FluidigmData_R2.fastq.gz. To run these data through the script, all we would need to run is:

fluidigm2purc -f FluidigmData

This will filter/trim the reads using the program Sickle, merge the paired-ends (if possible) using FLASH2, and then write everything to a FASTA file in a new directory named output-FASTA/. If we want to tweak some of the settings for the parameters that are used to filter/merge reads, we can specify them using command line flags (type fluidigm2purc -h to see options). In addition to the FASTA files, the fluidigm2purc script outputs two other files: (1) a table containing all individuals where their ploidy level can be specified (output-taxon-table.txt) and (2) a table with per locus error rates (output-locus-err.txt).

3. Running PURC¶

If we cd into the output-FASTA directory, we can run PURC using its purc_recluster.py script to do sequence clustering and PCR chimera detection. If you want to use the purc_recluster2.py script, make sure you move or copy it into the main PURC folder. Also, because purc_recluster2.py only does three iterations of chimera detection and clustering, it only requires that two clustering thresholds be specified using the -c argument (rather than the usual four).

The code below will loop through all of the FASTA files in the output-FASTA directory and will write all of the output to a new directory named output-PURC/.

cd output-FASTA

for f in *.fasta
do
  purc_recluster.py -f $f -o output-PURC \
                    -c 0.975 0.99 0.995 0.997 -s 2 5 --clean
done

4. Processing PURC clusters¶

The script to infer haplotypes from the clusters returned by PURC is called crunch_cluster. If you cd into the directory where we wrote all of the PURC output, you can loop through each locus and analyze each one in turn. If you know the ploidy levels for your organism, you can add them to the output-taxon-table.txt file.

The code below will use the locus names in the output-locus-err.txt file to loop through all of the output files from PURC to infer haplotypes. It will also realign the sequences clustering Mafft (--realign), clean the sequences using Phyutility (--clean 0.4), and will only return unique haplotypes for each sample.

cd output-PURC

for l in $(tail +2 ../../output-locus-err.txt | awk '{print $1}')
do
  crunch_clusters -i ${l}_clustered_reconsensus.afa -s ../../output-taxon-table.txt \
                  -e ../../output-locus-err.txt -l $l --realign --clean 0.4 --unique_haps
done

5. Downstream¶

Once all of the loci have been haplotyped, some of them may still contain an excessive amount of gaps from being aligned to bad clusters (or because reads never merged). We can use Phyutility to clean these up one more time.

Example:

# Remove sites with more than 40% gaps
phyutility -clean 0.4 loc1_crunched_clusters.fasta