Skip to content

Instantly share code, notes, and snippets.

@tayabsoomro
Last active December 20, 2023 05:13
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tayabsoomro/8f2d546cb7142f5d6e4c2ca897efb8b3 to your computer and use it in GitHub Desktop.
Save tayabsoomro/8f2d546cb7142f5d6e4c2ca897efb8b3 to your computer and use it in GitHub Desktop.
Useful UNIX commands for my bioinformatics work

USEFUL COMMANDS FOR BIOINFORMATICS WORK

  1. Converting FASTQ file to FASTA file
  2. Convert SAM file to BAM
  3. Sort BAM file
  4. Index BAM file
  5. Get subset of sequence from FASTA file
  6. Get particular record from multi-FASTA file
  7. Filter records based on the sequence length in FASTQ
  8. Local BLAST output format options
  9. Removing new lines from multi-FASTA file
  10. Filtering reads over 11Kb in length
  11. Removing duplicate lines
  12. Create a histogram of list of numbers
  13. Convert lowercase FASTA records to uppercase
  14. Compressing and indexing VCF file
  15. Sorting a VCF file based on chromosome and position
  16. Count the number of reads in FASTQ file
  17. Docker post-installation steps

Converting FASTQ file to FASTA file

sed -n '1~4s/^@/>/p;2~4p' input.fastq > output.fasta

Convert SAM file to BAM file

samtools view -b input.sam > output.bam

Sort BAM file

samtools sort input.bam > output.sorted.bam

Index BAM file

samtools index input.sorted.bam NOTE: This generates input.sorted.bam.bai file.

Get subset of sequence from FASTA file

awk -v start=$start -v end=$end -v name="name_here" '$0~name{getline seq; print substr(seq,start,end-start)}' input_sequence.fasta NOTE: change values for start and end accordingly.

Get particular record from multi-FASTA file

awk '/^>contig_1$/ {print;getline;print}' multi.fasta NOTE: change contig_1 accordingly.

Filter records based on the sequence length in FASTQ

awk 'BEGIN {FS = "\t"; OFS = "\n"} {header = $0; getline seq; getline qheader; getline qseq; if (length(seq)) >= 11000) { print header,seq,qheader,qseq}}' < input.fastq > filtered.fastq

Local BLAST output format options

Syntax for blastn: blastn -db {db_name} -query {query.fasta} -out {output_file} -outfmt {output_format} -num_threads {num_threads}

OUTPUT FORMAT

Alignment View Optiosn:
0 = pairwise
1 = query-anchored showing identities
2 = query-anchored no identities
3 = flat query-anchored, show identities
4 = flat query-anchored, no identities
5 - XML Blast output
6 - tabular
7 = tabular with comment lines
8 = Text ASN.1
9 = Binary ASN.1
10 - Comma-separated values
11 = BLAST archive format (ASN.1)

Options 6, 7, and 10 can be additionally configured to produce a custom format 
specified by space delimited format specifiers. The supported format 
specifiers are:
           qseqid means Query Seq-id
              qgi means Query GI
             qacc means Query accesion
          qaccver means Query accesion.version
             qlen means Query sequence length
           sseqid means Subject Seq-id
        sallseqid means All subject Seq-id(s), separated by a ';'
              sgi means Subject GI
           sallgi means All subject GIs
             sacc means Subject accession
          saccver means Subject accession.version
          sallacc means All subject accessions
             slen means Subject sequence length
           qstart means Start of alignment in query
             qend means End of alignment in query
           sstart means Start of alignment in subject
             send means End of alignment in subject
             qseq means Aligned part of query sequence
             sseq means Aligned part of subject sequence
           evalue means Expect value
         bitscore means Bit score
            score means Raw score
           length means Alignment length
           pident means Percentage of identical matches
           nident means Number of identical matches
         mismatch means Number of mismatches
         positive means Number of positive-scoring matches
          gapopen means Number of gap openings
             gaps means Total number of gaps
             ppos means Percentage of positive-scoring matches
           frames means Query and subject frames separated by a '/'
           qframe means Query frame
           sframe means Subject frame
             btop means Blast traceback operations (BTOP)
          staxids means Subject Taxonomy ID(s), separated by a ';'
        sscinames means Subject Scientific Name(s), separated by a ';'
        scomnames means Subject Common Name(s), separated by a ';'
       sblastnames means Subject Blast Name(s), separated by a ';'
                (in alphabetical order)
       sskingdoms means Subject Super Kingdom(s), separated by a ';'
                (in alphabetical order) 
           stitle means Subject Title
       salltitles means All Subject Title(s), separated by a '&lt;&gt;'
          sstrand means Subject Strand
            qcovs means Query Coverage Per Subject
          qcovhsp means Query Coverage Per HSP

Removing new lines from multi-FASTA file

awk '/^[>;]/ { if (seq) { print seq }; seq=""; print } /^[^>;]/ { seq = seq $0 } END { print seq }' input_file.fasta > outputfile.fasta

Filtering reads over 11KB in length

awk 'BEGIN {FS = "\t" ; OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) >= 11000) {print header,seq,qheader,qseeq}}' < input.fastq > output.fastq

Removing duplicate lines

awk !x[$1]++ file > output_file

Create a histogram of list of numbers

awk -v size=20 '{ b=int($1/size); a[b]++; bmax=b>bmax?b:bmax; bmin=b<bmin?b:bmin } END { for(i=bmin;i<=bmax;++i) print i*size,(i+1)*size,a[i] } <file> NOTE: change bin size accordingly.

Convert lowercase FASTA records to uppcase

awk 'BEGIN{FS=" "}{if(!/>/){print toupper($0)}else{print $1}}' input.fna > output.fna

Compressing and indexing VCF file

bgzip -c file.vcf > file.vcf.gz
tabix -p vcf file.vcf.gz

Sorting a VCF file based on chromosome and position

sort -k1,1V -k2,2n input.vcf > output.vcf

The -k1,1V option tells sort to sort by the first column, using "version" sort, which is natural sort of (version) numbers within text

Count the number of reads in FASTQ file

echo "$(( $(wc -l < your_file.fastq) / 4 ))"

Docker post-installation steps

  1. Create a docker group
sudo groupadd docker
  1. Add your user to the docker group
sudo usermod -aG docker $USER
  1. Log out and log back in or activate the changes to groups by running:
newgrp docker
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment