#!/bin/bash
## FASTAptamer Cluster will "cluster" all sequences from a single Count file into families/clusters of closely related/similar sequences. This is useful if a candidate sequence is identified, so that other similar sequences in the same cluster can also be identified as potential candidates as well.
## Can specify the number of allowed mismatches and/or indels for a sequence to be added to an existing cluster (distance), the "filter" cut-off to only cluster sequences that have a given number of reads (the number specified is in number of reads, not in RPM), and the maximum number of clusters to use. The last two can be used to prevent the script from running for too long.
## This is likely the most computationally costly step of the entire analysis and the length of time it takes to complete will vary widely based on how heterogenous or homogenous a given pool is (more unique and unrelated sequences will take longer). Typical run-times are many hours if not days. Can use the "filter" to not cluster the lowest abundance reads to improve run times, but this will mean any of these low abundance reads will not be part of any further analysis that uses these Cluster files.
## Requires installation of the FASTAptamer program, found here: https://burkelab.missouri.edu/fastaptamer.html
## FASTAptamer Publication: https://doi.org/10.1038%2Fmtna.2015.4

## FASTAptamer Cluster Variables
distance=7               ## Number of allowed indels/mismatches to join a cluster.
filter=0                 ## Cut-off to only cluster seqs > this number of reads (NOT RPM).
max_clusters=100000      ## Max number of clusters (default is 10^17).
fastapt_dir=/full/path/to/dir           ## Directory where FASTAptamer Perl Scripts are located.
clust_input_path=full/path/to/file.count.fasta     ## Input File (must be a Count file).
clust_output_path=/full/path/to/file.clust.fasta   ## Output File for FASTAptamer Cluster.

perl "$fastapt_dir/fastaptamer_cluster" -i "$clust_input_path" -o "$clust_output_path" -d "$distance" -f "$filter" -c "$max_clusters"