INPUT FILE INFORMATION: Test_data.fa alias Sequence_DATA.fa → raw data file produced by 454 sequencer (fasta format) Enzyme.tsv alias Restriction_Enzme.tsv → tab separated values of vector name, sample name, and restriction motif Demultiplexing_Trimming_blunt_GTAC.tsv alias Sequence_Motifs.tsv → tab separated information of MIDs sequences and vector sequences representative_results All the text in the about file should be uppercase Do not use special characters for names initial_count_without_homopolymer_correction For running the scripts following things are needed. Minimum system requirement 8 core CPU and 32GB RAM Scripts are designed for linux OS and coded in python version 2.7 Download mouse genome version mm9 and human genome version hg19 from UCSC web server remove random chromosomes to avoid multiple mapping into random chromosomes install blast (sudo apt-get install blast2) Python Modules required by the script sqlite3 pandas numpy biopython joblib pygr (https://pypi.python.org/pypi/pygr/0.8.2) h5py STEP 1:- DEMULTIPLEXING STEPS at terminal change directory to the current directory (i.e. Test_script) To start dimultiplexing of the data, run the shell script pipe2runer.sh with following command at terminal ../Test_script$ chmod 777 pipe2runer.sh ../Test_script$./pipe2runer.sh once the script run stops use the following commands ../Test_script$python pipe03_gather_add_counts.py all_reads/raw_data all_reads/ ../Test_script$python pipe04_assign_sample_identifier.py to calculate the reads count per sample run the following script ../Test_script$python pipe04_assign_sample_identifier_statistics.py all_reads/pipe4_out >stats_pipe4.txt text file stats_pipe4.txt contains information bout number sequences present in each sample. Now run the following command to separate BLT and IPSC sequences. ../Test_script$python pipe05_gather_sequences_two.py This creates three .fa (fasta) files at locations /Test_script/BLT_data/BLT_data.fa /Test_scrip/IPSC_data/IPSC_data.fa /Test_scrip/all_reads.fa STEP 2:- MAPPING STEP For BLT data run blat with following command ../Test_script$./blat -t=dna -q=dna -stepSize=5 -repMatch=2253 -noHead -minScore=15 -minIdentity=30 /location/of/hg19.fa all_reads/BLT_data/BLT_data.fa all_reads/BLT_data/BLT_data.psl For IPSC data ../Test_script$./blat -t=dna -q=dna -stepSize=5 -repMatch=2253 -noHead -minScore=15 -minIdentity=30 /location/of/mm9.fa all_reads/IPSC_data/IPSC_data.fa all_reads/IPSC_data/IPSC_data.psl Once the mapping step is complete run the sequence counting script STEP 3:- COUNTING CLONES Run the sequence counting python script as follows ../Test_script$python Full_script_Left_Right.py Demultiplexing_Trimming_file Enzyme_file ref_genome_fasta blat_output_file demultiplexed_fasta_file For example ../Test_script$python Full_script_Left_Right.py Demultiplexing_Trimming_blunt_GTAC.tsv Enzyme.tsv /location/of/hg19.fa all_reads/BLT_data/BLT_data.psl all_reads/BLT_data/BLT_data.fa This script creates three files in the directory where the demultiplexed fasta file is stored (for our test script location is all_reads/BLT_data/) 1) proceesedpsl_db This is sql data base that stores all the intermediate output of the counting script 2) initial_count_without_homopolymer_correction.txt This files contains sequence mapping VIS location and count information. Sequences counts provides in this file are not corrected for homopolymer errors 3) Final_count.txt This files contains sequence mapping VIS location and count information. Sequences counts provides in this file are corrected for homopolymer errors Information about the columns in initial_count_without_homopolymer_correction.txt and Final_count.txt Column name Description Unnamed:0 row index values this column can be ignored CLONE_ID Identification code given to each VIS VECTOR_ID Vector identification code R_TYPE Right side type (Single hit - Single; R|Multi/NoHit/NoGoodSpn, blank if sequences in from left side) R_STRAND orientation of the query sequence in the genome (+/- strand information of mapped sequences blank if not mapped or multi-mapped) R_QLEN length of query sequence R_QSEQ query sequence R_GLEN the expected length of the integration site sequence calculated based on the distance from the nearest available GTAC motif in the genome to the insertion site. R_GLEN <450bp are qualified for quantitative analyses R_GSEQ genomic sequence to which query sequence mapped and sequence end decided by restriction site L_TYPE Left side type (Single hit - Single; L|Multi/NoHit/NoGoodSpn, blank if sequences in from right side) L_STRAND orientation of the query sequence in the genome (+/- strand information of mapped sequences blank if not mapped or multi-mapped) L_QLEN length of query sequence L_QSEQ query sequence L_GLEN the expected length of the integration site sequence calculated based on the distance from the nearest available GTAC motif in the genome to the insertion site. R_GLEN <450bp are qualified for quantitative analyses L_GSEQ genomic sequence to which query sequence mapped and sequence end decided by restriction site TOTAL_COUNT total sequence count summed over all the the samples CHR_NO chromosome name/number CHR_SITE1 VIS start location on the chromosome CHR_SITE2 VIS end location on the chromosome These initial columns are fixed. After these sample columns start providing information about the VIS count in sample.