USEFUL COMMANDS FOR BIOINFORMATICS WORK

Converting FASTQ file to FASTA file
Convert SAM file to BAM
Sort BAM file
Index BAM file
Get subset of sequence from FASTA file
Get particular record from multi-FASTA file
Filter records based on the sequence length in FASTQ
Local BLAST output format options
Removing new lines from multi-FASTA file
Filtering reads over 11Kb in length
Removing duplicate lines
Create a histogram of list of numbers
Convert lowercase FASTA records to uppercase
Compressing and indexing VCF file
Sorting a VCF file based on chromosome and position
Count the number of reads in FASTQ file
Docker post-installation steps

Converting FASTQ file to FASTA file

sed -n '1~4s/^@/>/p;2~4p' input.fastq > output.fasta

Convert SAM file to BAM file

samtools view -b input.sam > output.bam

Sort BAM file

samtools sort input.bam > output.sorted.bam

Index BAM file

samtools index input.sorted.bam NOTE: This generates input.sorted.bam.bai file.

Get subset of sequence from FASTA file

awk -v start=$start -v end=$end -v name="name_here" '$0~name{getline seq; print substr(seq,start,end-start)}' input_sequence.fasta NOTE: change values for start and end accordingly.

Get particular record from multi-FASTA file

awk '/^>contig_1$/ {print;getline;print}' multi.fasta NOTE: change contig_1 accordingly.

Filter records based on the sequence length in FASTQ

awk 'BEGIN {FS = "\t"; OFS = "\n"} {header = $0; getline seq; getline qheader; getline qseq; if (length(seq)) >= 11000) { print header,seq,qheader,qseq}}' < input.fastq > filtered.fastq

Local BLAST output format options

Syntax for blastn: blastn -db {db_name} -query {query.fasta} -out {output_file} -outfmt {output_format} -num_threads {num_threads}

OUTPUT FORMAT

Alignment View Optiosn:
0 = pairwise
1 = query-anchored showing identities
2 = query-anchored no identities
3 = flat query-anchored, show identities
4 = flat query-anchored, no identities
5 - XML Blast output
6 - tabular
7 = tabular with comment lines
8 = Text ASN.1
9 = Binary ASN.1
10 - Comma-separated values
11 = BLAST archive format (ASN.1)

Options 6, 7, and 10 can be additionally configured to produce a custom format 
specified by space delimited format specifiers. The supported format 
specifiers are:
           qseqid means Query Seq-id
              qgi means Query GI
             qacc means Query accesion
          qaccver means Query accesion.version
             qlen means Query sequence length
           sseqid means Subject Seq-id
        sallseqid means All subject Seq-id(s), separated by a ';'
              sgi means Subject GI
           sallgi means All subject GIs
             sacc means Subject accession
          saccver means Subject accession.version
          sallacc means All subject accessions
             slen means Subject sequence length
           qstart means Start of alignment in query
             qend means End of alignment in query
           sstart means Start of alignment in subject
             send means End of alignment in subject
             qseq means Aligned part of query sequence
             sseq means Aligned part of subject sequence
           evalue means Expect value
         bitscore means Bit score
            score means Raw score
           length means Alignment length
           pident means Percentage of identical matches
           nident means Number of identical matches
         mismatch means Number of mismatches
         positive means Number of positive-scoring matches
          gapopen means Number of gap openings
             gaps means Total number of gaps
             ppos means Percentage of positive-scoring matches
           frames means Query and subject frames separated by a '/'
           qframe means Query frame
           sframe means Subject frame
             btop means Blast traceback operations (BTOP)
          staxids means Subject Taxonomy ID(s), separated by a ';'
        sscinames means Subject Scientific Name(s), separated by a ';'
        scomnames means Subject Common Name(s), separated by a ';'
       sblastnames means Subject Blast Name(s), separated by a ';'
                (in alphabetical order)
       sskingdoms means Subject Super Kingdom(s), separated by a ';'
                (in alphabetical order) 
           stitle means Subject Title
       salltitles means All Subject Title(s), separated by a '&lt;&gt;'
          sstrand means Subject Strand
            qcovs means Query Coverage Per Subject
          qcovhsp means Query Coverage Per HSP

Removing new lines from multi-FASTA file

awk '/^[>;]/ { if (seq) { print seq }; seq=""; print } /^[^>;]/ { seq = seq $0 } END { print seq }' input_file.fasta > outputfile.fasta

Filtering reads over 11KB in length

awk 'BEGIN {FS = "\t" ; OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) >= 11000) {print header,seq,qheader,qseeq}}' < input.fastq > output.fastq

Removing duplicate lines

awk !x[$1]++ file > output_file

Create a histogram of list of numbers

awk -v size=20 '{ b=int($1/size); a[b]++; bmax=b>bmax?b:bmax; bmin=b<bmin?b:bmin } END { for(i=bmin;i<=bmax;++i) print i*size,(i+1)*size,a[i] } <file> NOTE: change bin size accordingly.

Convert lowercase FASTA records to uppcase

awk 'BEGIN{FS=" "}{if(!/>/){print toupper($0)}else{print $1}}' input.fna > output.fna

Compressing and indexing VCF file

bgzip -c file.vcf > file.vcf.gz
tabix -p vcf file.vcf.gz

Sorting a VCF file based on chromosome and position

sort -k1,1V -k2,2n input.vcf > output.vcf

The -k1,1V option tells sort to sort by the first column, using "version" sort, which is natural sort of (version) numbers within text

Count the number of reads in FASTQ file

echo "$(( $(wc -l < your_file.fastq) / 4 ))"

Docker post-installation steps

Create a docker group

sudo groupadd docker

Add your user to the docker group

sudo usermod -aG docker $USER

Log out and log back in or activate the changes to groups by running:

newgrp docker

tayabsoomro/BINF_COMMANDS.md