This is a quick overview for running a Nextflow job backed by AWS Batch from within an AWS CloudShell console. This is particularly useful for smoke-testing a Genomics Workflow Core deployment, and can also be used for quick compute runs where all of your data and/or code is already stored in S3. Because of the limitations of CloudShell size and session persistence, it is not a good solution for long-term/large-scale data analysis or job preparation.
The CloudShell environment does not have Java pre-installed, so this must be done first:
sudo yum -y install java
NOTE: Because operating system-level changes are not persisted between CloudShell sessions, you must run this command on every new CloudShell session. All of the following commands will persist between sessions.
Nextflow and its dependencies can easily be installed locally with another command:
wget -qO- https://get.nextflow.io | bash
Run Nextflow to verify it has installed successfully:
./nextflow -v
The contents of this gist can be downloaded via Git.
git clone https://gist.github.com/44ef541d2b0945ffa2cee2ec54cf48e9.git quick-test
cd quick-test
Next, edit nextflow.config to update the process.queue to point to your desired Batch queue and the workdir to point to your S3 bucket
vi nextflow.config
Nextflow is currently unable to access the credentials assigned to the CloudShell session because those credentials are provided by the ECS per-container credential service, which is not offered at the standard IMDS endpoint. To work around this problem, run the ecs-get-creds.sh script to retrieve the ECS credentials and store them in ~/.aws/credentials.
bash ecs-get-creds.sh
NOTE: For long-running sessions, you may need to re-run this command if the credentials expire.
Once your credentials are installed, the Nextflow job can be simply started with nextflow run:
../nextflow run test-scale.nf
It may take some time for the first jobs to start running as Batch/ECS needs to spin up container hosts first. The number of parallel tasks is parameterized with the --n_tasks parameter:
../nextflow run test-scale.nf --n_tasks 5000
It normally does not produce any output, but you can view the individual task logs either through S3 or through the Batch dashboard.