Skip to content

Instantly share code, notes, and snippets.

@kgutwin
Last active March 2, 2021 16:44
Show Gist options
  • Save kgutwin/44ef541d2b0945ffa2cee2ec54cf48e9 to your computer and use it in GitHub Desktop.
Save kgutwin/44ef541d2b0945ffa2cee2ec54cf48e9 to your computer and use it in GitHub Desktop.
Nextflow scaling test using AWS CloudShell
#!/bin/bash
mkdir -p ~/.aws
curl -s -H "Authorization: $AWS_CONTAINER_AUTHORIZATION_TOKEN" \
$AWS_CONTAINER_CREDENTIALS_FULL_URI \
| jq -r '["[default]", "aws_access_key_id = " + .AccessKeyId, "aws_secret_access_key = " + .SecretAccessKey, "aws_session_token = " + .Token] | join("\n")' \
> ~/.aws/credentials

Nextflow on AWS CloudShell

This is a quick overview for running a Nextflow job backed by AWS Batch from within an AWS CloudShell console. This is particularly useful for smoke-testing a Genomics Workflow Core deployment, and can also be used for quick compute runs where all of your data and/or code is already stored in S3. Because of the limitations of CloudShell size and session persistence, it is not a good solution for long-term/large-scale data analysis or job preparation.

Install Java

The CloudShell environment does not have Java pre-installed, so this must be done first:

sudo yum -y install java

NOTE: Because operating system-level changes are not persisted between CloudShell sessions, you must run this command on every new CloudShell session. All of the following commands will persist between sessions.

Install Nextflow

Nextflow and its dependencies can easily be installed locally with another command:

wget -qO- https://get.nextflow.io | bash

Run Nextflow to verify it has installed successfully:

./nextflow -v

Retrieve this gist and update the config

The contents of this gist can be downloaded via Git.

git clone https://gist.github.com/44ef541d2b0945ffa2cee2ec54cf48e9.git quick-test
cd quick-test

Next, edit nextflow.config to update the process.queue to point to your desired Batch queue and the workdir to point to your S3 bucket

vi nextflow.config

Retrieve current AWS credentials

Nextflow is currently unable to access the credentials assigned to the CloudShell session because those credentials are provided by the ECS per-container credential service, which is not offered at the standard IMDS endpoint. To work around this problem, run the ecs-get-creds.sh script to retrieve the ECS credentials and store them in ~/.aws/credentials.

bash ecs-get-creds.sh

NOTE: For long-running sessions, you may need to re-run this command if the credentials expire.

Run the test Nextflow job

Once your credentials are installed, the Nextflow job can be simply started with nextflow run:

../nextflow run test-scale.nf

It may take some time for the first jobs to start running as Batch/ECS needs to spin up container hosts first. The number of parallel tasks is parameterized with the --n_tasks parameter:

../nextflow run test-scale.nf --n_tasks 5000

It normally does not produce any output, but you can view the individual task logs either through S3 or through the Batch dashboard.

workDir = "s3://my-bucket/_nextflow/runs"
process.executor = 'awsbatch'
process.queue = 'my-batch-queue'
aws.region = 'us-east-1'
aws.batch.cliPath = '/usr/local/aws-cli/v2/current/bin/aws'
// default number of tasks
params.n_tasks = 50
n_tasks = params.n_tasks
Channel.from( 1..n_tasks ).set { test_index }
process test {
container "public.ecr.aws/amazonlinux/amazonlinux:latest"
input:
val(v) from test_index
"""
echo $v
date
"""
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment