Skip to content

Instantly share code, notes, and snippets.

@elsherbini
Last active October 12, 2017 16:02
Show Gist options
  • Save elsherbini/e2e30507bb49f435deb47b157c643cd1 to your computer and use it in GitHub Desktop.
Save elsherbini/e2e30507bb49f435deb47b157c643cd1 to your computer and use it in GitHub Desktop.
cutadapt adapter trimming

A quick and somewhat untested pipeline to run cutadapt.

This will take paired reads corresponding to a single genome (already demultiplexed) and perform quality filtering and adapter trimming.

input (2 files)

  1. forward reads
  2. reverse reads

output:

  1. forward reads with no adapter found tmp/{strain}/{strain}.trim_adapter.f.fq
  2. reverse reads with no adapter found tmp/{strain}/{strain}.trim_adapter.r.fq
  3. forward reads where adpater was found or quality was low tmp/{strain}/{strain}.trim_adapter.f.discard.fq
  4. reverse reads where adpater was found or quality was low tmp/{strain}/{strain}.trim_adapter.r.discard.fq
  5. (inscrutable) log file tmp/{strain}/{strain}.trim_adapter.log.txt

###Notes:

  1. I added XXXX to the end of the adapters we were trimming to avoid internal matches. See http://cutadapt.readthedocs.io/en/stable/recipes.html#avoid-internal-adapter-matches
  2. I used --untrimmed-output and --untrimmed-paired-output to get all the reads that had no adapter found in a separate file. I did this for debugging purposes. The default, if you leave those off, is to put the trimmed reads and reads that were fine without trimming in the same file. So the file I call "discard" would actually be the file you keep, it'd contain reads where the adapter was removed and untouched reads.

folder structure:

input
|
  -> {strain}
  |
    -> {strain}.raw_reads.f.fq
Snakefile
config.yaml
cutadapt_params:
quality_cutoff: 10
quality_base: 64 # Illumina 1.5
overlap: 3 #
STRAINS = glob_wildcards("input/{strain1}/{strain2}.raw_reads.f.fq").strain1
configfile: 'config.yaml'
print(config)
rule all:
input:
expand("tmp/{strain}/{strain}.trim_adapter.{direction}.fq", strain=STRAINS, direction=['f', 'r'])
rule remove_suffix:
input:
fastq = "input/{strain}/{strain}.raw_reads.{direction}.fq"
output:
fastq = "tmp/{strain}/{strain}.remove_suffix.{direction}.fq"
shell:
"sed -e '0~4N;s/;[0-9]$//g' {input.fastq} > {output.fastq}"
rule cutadapt:
input:
f = "tmp/{strain}/{strain}.remove_suffix.f.fq",
r = "tmp/{strain}/{strain}.remove_suffix.r.fq"
output:
f = "tmp/{strain}/{strain}.trim_adapter.f.fq",
r = "tmp/{strain}/{strain}.trim_adapter.r.fq",
f_removed = "tmp/{strain}/{strain}.trim_adapter.f.discard.fq",
r_removed = "tmp/{strain}/{strain}.trim_adapter.r.discard.fq",
log = "tmp/{strain}/{strain}.trim_adapter.log.txt"
params:
overlap = config["cutadapt_params"]["overlap"],
quality_base = config["cutadapt_params"]["quality_base"],
quality_cutoff = config["cutadapt_params"]["quality_cutoff"]
shell:
"""
cutadapt -a CTGTCTCTTATACACATCTXXXXXX -a CAAGCAGAAGACGGCATACXXXXXX \\
--pair-filter=both \\
--info-file {output.log} \\
--untrimmed-output {output.f} \\
--untrimmed-paired-output {output.r} \\
--overlap={params.overlap} \\
--quality-base={params.quality_base} \\
--quality-cutoff={params.quality_cutoff} \\
-o {output.f_removed} \\
-p {output.r_removed} \\
{input.f} {input.r}
"""
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment