-- SNP identification for typing (SNIT) README --

-- DESCRIPTION
  SNIT is a simple, fast pipeline to compare a set of bacterial genomes 
to identify the nearest neighbor for each input genome. SNIT uses MUMmer 
to perform pairwise alignments between the genomes.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
NOTE: Detailed instructions for installing the pipeline are in the README.pdf file. 
Please follow the instructions in README.pdf to install the GUI version of the
pipeline. The following instructions only describe the command line version.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 

-- Running SNIT from the graphical interface --
To launch the graphical interface for SNIT, execute the following command:
> $SNIT_HOME/bin/snit

-- Running SNIT  from the command line --

1. Preparing the input files
  - SNIT expects all the chromosomes/plasmids of a single genome in one 
    input fasta file. If each chromosome is in a separate file, concatenate 
    them into one single fasta file. 
         eg: "cat ecoli_str_9999_chr*.fasta > ecoli_str_9999.fasta"

  - Optionally, if quality scores are avialable in phred format for any of the 
    input sequences, you can mask the low quality  bases by running the 
    masking script in SNIT:
        perl $SNIT_HOME/src/MaskLowQualBases.pl input_seq_file input_qual_file masked_seq_file [quality_cut_off_score]

2. Prepare a configuration file for your run

  - Specifying the reference:
    Snit takes the first genome in the configuration file as the reference.
    Therefore, if you prefer to use a specific genome as reference, use it 
    as the first line of your configuration file. If you are trying to find 
    the nearest neighbor for a specific genome, it is recommended that you 
    use that genome as the reference. This will ensure that no significant 
    region of that genome is ignored in computing the SNPs.
   
   - Format of the configuration file:
     The configuration file contains one line for each input genome:
       FILE=/home/user/genomes/ecoli/ecoli_str_9999.fasta, NAME=str_9999

      **Note: Make sure there are no leading spaces or other characters any where in the line.
              Also, do not include any spaces in the NAME. Refer to sample.slist in the distribution.

3. Run the SNIT pipeline

   - to view all the command-line options, just execute:

>perl $SNIT_HOME/src/SNIT.pl

This will result in the following output:


USAGE
snp_analysis_pipeline.pl -i <config file> -o <output prefix>
OPTIONS
        -h                   print usage
        -i    string         configuration file name
        -o    string         output prefix
	-w    string         output (work) directory
        -r    string         output (result directory)
        -f    [0/1]          Screen input files with trf 0- no, 1 yes [default: 1]
        -t    string         trf flags default: "2 5 5 80 10 50 500 -h" (ignored if run with -f 0)
        -c    int            minimum MUMmer cluster length [default: 201 ]
        -m    int            minimum MUMmer exact match length [default: 100]
        -g    int            maximum MUMmer gap length for extension of alignments [default: 1]
        -d    int            min indel size to report as insertion/deletion [default: 100]
        -s    int            min conserved flank length on either side of a SNP [default: 100]
        -e    int            min distance of a SNP from the edges of a contig [default: 250]
        -q    [0/1]          interpret smaller case bases as low quality bases and ignore SNPs in these bases 0-no, 1-yes [default: 0]


4. Output of SNIT pipeine

   The out_prefix.tbl.final contains the list of all the SNPs/small indels discovered 
   from the pairwise alignments. This file is tab-delimited, and can be opened using
   excel for better viewing.
    - 1st column gives the position of the polymorphism in the reference genome
    - 2nd column given the type of variation: SNP, INS(insertion), or DEL(deletion) w.r.to the reference
    - a column for each input genome, giving the variants in ech genome. 
    - last column contains the id of the sequence/chromosome in the reference genome 


   The out_prefix.tbl.dmtx gives a (kxk) matrix in which each entry gives the number of 
   SNPs/small indels in which each pair of the k genomes differ from each other. A smaller
   number indicates a closer genome.
	

