INPUT FILE INFORMATION:
Test_data.fa alias Sequence_DATA.fa →  raw data file produced by 454 sequencer (fasta format)
Enzyme.tsv alias Restriction_Enzme.tsv →  tab separated values of vector name, sample name, and restriction motif  
Demultiplexing_Trimming_blunt_GTAC.tsv alias Sequence_Motifs.tsv →   tab separated information of MIDs sequences and vector sequences
representative_results
All the text in the about file should be uppercase 
Do not use special characters for names
initial_count_without_homopolymer_correction
For running the scripts following things are needed.
Minimum system requirement 8 core CPU and 32GB RAM
Scripts are designed for linux OS and coded in python version 2.7
Download mouse genome version mm9 and human genome version hg19 from UCSC web server
remove random chromosomes to avoid multiple mapping into random chromosomes 
install blast 
(sudo apt-get install blast2)
Python Modules required by the script 
sqlite3
pandas
numpy
biopython
joblib
pygr (https://pypi.python.org/pypi/pygr/0.8.2) 
h5py


STEP 1:- DEMULTIPLEXING STEPS
at terminal change directory to the current directory (i.e. Test_script)
To start dimultiplexing of the data, run the shell script pipe2runer.sh with following command at terminal

../Test_script$ chmod 777 pipe2runer.sh
 
../Test_script$./pipe2runer.sh

once the script run stops use the following commands

../Test_script$python pipe03_gather_add_counts.py all_reads/raw_data all_reads/

../Test_script$python pipe04_assign_sample_identifier.py

to calculate the reads count per sample run the following script 

../Test_script$python pipe04_assign_sample_identifier_statistics.py all_reads/pipe4_out >stats_pipe4.txt

text file stats_pipe4.txt contains information bout number sequences present in each sample.

Now run the following command to separate BLT and IPSC sequences.

../Test_script$python pipe05_gather_sequences_two.py

This creates three .fa (fasta) files at locations
/Test_script/BLT_data/BLT_data.fa
/Test_scrip/IPSC_data/IPSC_data.fa
/Test_scrip/all_reads.fa

STEP 2:- MAPPING STEP
For BLT data 
run blat with following command
../Test_script$./blat -t=dna -q=dna -stepSize=5 -repMatch=2253 -noHead -minScore=15 -minIdentity=30 /location/of/hg19.fa all_reads/BLT_data/BLT_data.fa all_reads/BLT_data/BLT_data.psl

For IPSC data
../Test_script$./blat -t=dna -q=dna -stepSize=5 -repMatch=2253 -noHead -minScore=15 -minIdentity=30 /location/of/mm9.fa all_reads/IPSC_data/IPSC_data.fa all_reads/IPSC_data/IPSC_data.psl

Once the mapping step is complete run the sequence counting script

STEP 3:- COUNTING CLONES
Run the sequence counting python script as follows
../Test_script$python Full_script_Left_Right.py Demultiplexing_Trimming_file Enzyme_file ref_genome_fasta blat_output_file demultiplexed_fasta_file

For example
../Test_script$python Full_script_Left_Right.py Demultiplexing_Trimming_blunt_GTAC.tsv Enzyme.tsv /location/of/hg19.fa all_reads/BLT_data/BLT_data.psl all_reads/BLT_data/BLT_data.fa

This script creates three files in the directory where the demultiplexed fasta file is stored (for our test script location is all_reads/BLT_data/)
1) proceesedpsl_db 
This is sql data base that stores all the intermediate output of the counting script
2) initial_count_without_homopolymer_correction.txt
This files contains sequence mapping VIS location and count information. Sequences counts provides in this file are not corrected for homopolymer errors
3) Final_count.txt
This files contains sequence mapping VIS location and count information. Sequences counts provides in this file are corrected for homopolymer errors 
 
Information about the columns in initial_count_without_homopolymer_correction.txt and Final_count.txt
Column name		Description
Unnamed:0 	row index values this column can be ignored
CLONE_ID	Identification code given to each VIS
VECTOR_ID 	Vector identification code
R_TYPE		Right side type (Single hit - Single; R|Multi/NoHit/NoGoodSpn, blank if sequences in from left side)
R_STRAND	orientation of the query sequence in the genome (+/- strand information of mapped sequences blank if not mapped or multi-mapped)
R_QLEN		length of query sequence
R_QSEQ		query sequence 
R_GLEN		the expected length of the integration site sequence calculated based on the distance from the nearest available GTAC motif in the genome to the insertion site. R_GLEN <450bp are qualified for quantitative analyses
R_GSEQ	  	genomic sequence to which query sequence mapped and sequence end decided by restriction site
L_TYPE		Left side type (Single hit - Single; L|Multi/NoHit/NoGoodSpn, blank if sequences in from right side)
L_STRAND	orientation of the query sequence in the genome (+/- strand information of mapped sequences blank if not mapped or multi-mapped)
L_QLEN		length of query sequence
L_QSEQ		query sequence 
L_GLEN		the expected length of the integration site sequence calculated based on the distance from the nearest available GTAC motif in the genome to the insertion site. R_GLEN <450bp are qualified for quantitative analyses
L_GSEQ	  	genomic sequence to which query sequence mapped and sequence end decided by restriction site
TOTAL_COUNT 	total sequence count summed over all the the samples
CHR_NO		chromosome name/number
CHR_SITE1	VIS start location on the chromosome
CHR_SITE2	VIS end location on the chromosome

These initial columns are fixed. After these sample columns start providing information about the VIS count in sample.