CF Documentation

This documentation is written in markdown and comes bundled with the Cluster Flow source code.

Introduction

Cluster Flow is workflow manager designed to run bioinformatics pipelines. It is operated through a single command cf, which can be used to launch, configure, monitor and cancel pipelines.

Tutorial Videos

To get you started with Cluster Flow, there are a few tutorial videos on YouTube. Click here to watch.

Installation

Requirements

Cluster Flow is designed to work with a computing cluster. It currently supports the Sun GRIDEngine, LSF and SLURM job managers (not PBS, Torque or others).

If you don't have a cluster with a supported manager, you can run Cluster Flow on any command-line machine in local mode. This writes a bash script and runs it as a job in the background.

To run analyses, you will also need the required tools to be installed. Cluster Flow is designed to work with the environment module system and load tools as required, but if software is available on the PATH it can work without this.

Cluster Flow itself is written in Perl. It has minimal dependencies, all of which are core Perl packages.

Environment Module

If you are a user on a HPC cluster, you may already have Cluster Flow installed on your cluster as an environment module. If so, you may be able to load it using:

module load clusterflow

Manual Installation

Cluster Flow is a collection of stand-alone scripts, mostly written in Perl.

  1. Download Cluster Flow (see the releases page)
    wget https://github.com/ewels/clusterflow/archive/v0.4.tar.gz
  2. Extract the files
    tar -C clusterflow -zxvf v0.4.tar.gz
  3. Create & configure the site-wide configuration file
    cd clusterflow
    cp clusterflow.config.example clusterflow.config
    vi clusterflow.config

You must specify your environment in the config file (@cluster_environment: local, GRIDEngine, SLURM or LSF), most other things are optional.

The cf executable must be in your system PATH, so that you can run it easily from any directory. Ensure that you run the Configuration Wizard (described below) so that this config is created in your ~/.bashrc file.

If you prefer, you can symlink the cf executable to ~/bin.

Configuration Wizard

Once Cluster Flow has been set up site-wide, you need to configure it for your personal use:

cf --setup

This will launch a wizard to write a config file for you, with details such as e-mail address and notification settings.

Adding Reference Genomes

Most analysis pipelines need a reference genome. This can exist in a central location or in your personal setup (or both).

If you're using the Swedish UPPMAX cluster, please see these instructions.

You can add your reference genome paths with the following wizard:

cf --add_genome

Do a test run!

That should be it! Log out of your session and in again to activate any new bash settings. Then try launching a test run:

cf --genome GRCh37 sra_bowtie ftp://ftp.ddbj.nig.ac.jp/ddbj_database/dra/sralite/ByExp/litesra/SRX/SRX031/SRX031398/SRR1068378/SRR1068378.sra

This will download SRR1068378 (Human H3K4me3 ChIP-Seq data), convert to FastQ, run FastQC, Trim Galore! and align with bowtie.

Usage

Listing what's available

Once Cluster Flow is up and running, you can list available pipelines, modules and reference genomes which are available using the following commands:

cf --pipelines            # List pipelines
cf --modules              # List modules
cf --genomes              # List reference genomes

Getting help

To get instructions for how to use Cluster Flow on the command line, use:

cf --help

You can also use this command to find out more information about pipelines and modules:

cf --help [module-name]
cf --help [pipeline-name]

Starting a run

In its most basic form, analyses are run as follows:

cf [pipeline] [files]

Single modules can also be specified instead of a pipeline:

cf [module] *.bam

Most pipelines and modules will need a reference genome, specified using --genome:

cf --genome GRCh37 sra_bowtie *.sra

The ID following --genome is the ID assigned when adding the reference genome to Cluster Flow. This can be seen when listing genomes with cf --genomes.

Module Parameters

The default execution of different tools can be modified by using module parameters. These can be set within pipeline scripts or on the command line. Specifying --param [example] will apply the [example] parameter to every module in the pipeline.

Different module support different parameters. Some are flags, some are key pairs. To find out more, see the Modules documentation.

Typical things you can do are to set adapter trimming preferences with TrimGalore!:

cf --genome GRCh37 --param clip_r1=6 --param min_readlength=15 sra_bowtie *sra

or run Bismark in PBAT mode:

cf --genome GRCm38 --param pbat fastq_bismark *.fastq.gz

When setting in pipeline scripts, simply add the paramters after the module names (tab-delimited). For example, this is the trim_bowtie_miRNA pipeline:

#trim_galore adapter=ATGGAATTCTCG
    #bowtie mirna

This sets a custom adapter for trimming and tells the bowtie module to use the mirna parameter.

Filename checking

When launching Cluster Flow, a number of filename checks are performed. If input files are FastQ and the filenames look like paired-end files, it launches in paired-end mode (this can be overridden with --single).

If a mixture of file types or paired end / single end FastQ files are found, Cluster Flow will show an error and exit. This step can be skipped by using the --no_fn_check parameter.

If @merge_regex is configured in the configuration file, matching input files will be merged before processing.

Downloading files

As well as supplying Cluster Flow with input files, you can give URLs. This will cause Cluster Flow to add the cf_download module to the start of your pipeline to download the data.

Cluster Flow will recognise anything starting with http, https or ftp as a URL. Downloads are processes in series to avoid overwhelming the internet connection.

If using the --file-list parameter you can also specify a filename for each download. This should be added after the download URL, separated by a tab character. This is particularly useful when downloading arbitrarily named SRA files and is compatible with the Labrador Dataset Manager. See also the stand-alone SRA-Explorer tool if you don't have Labrador installed.

Avoiding cluster overload

Cluster Flow has a number of features built in to avoid swamping your cluster with jobs.

Firstly, Cluster Flow limits the number of parallel runs created. Defaults are set in the config file with @split_files (default 1) and @max_runs (default 12). @split_files defines the minimum number of files per run, @max_runs defines the maximum number of parallel runs and adds more files per run if needed.

Cluster Flow also try to intelligently limit the memory usage and number of cores each module uses. The config options @total_cores and @total_mem specify the maximum resources to be used by each Cluster Flow pipeline. These are split up amongst the max simultaneous jobs and presented to each module. The modules can then request resources, making use of optional parallelisation where available.

Command Line

Cluster Flow Command Line Reference

Cluster Flow pipelines are launched as follows:

cf [flags] <pipeline> <input-files>

These flags are used to customise run-time parameters for the pipeline that Cluster Flow will launch.

--genome

Default: none

Some pipelines which carry out a reference genome alignment require a genome directory path to be set. Requirements for format may vary between modules.

--paired

Default: Auto-detect

If specified, Cluster Flow will send two files to each run, assuming that the order that the file list is supplied in corresponds to two read files. If an odd number of files is supplied, the final file is submitted as single end.

--single

Default: Auto-detect

If specified, Cluster Flow will ignore its auto-detection of paired end input files and force the single end processing of each input file.

--no_fn_check

Default: none

Cluster Flow will make sure that all of the input files have the same file extension to avoid accidentally submitting files that aren’t part of the run. Specifying this parameter disables this check.

--file_list

Default: none

If specified, you can define a file containing a list of filenames to pass to the pipeline (one per line). This is particularly useful when supplying a list of download URLs.

--params

Default: none

Pipelines and their modules are configured to run with sensible defaults. Some modules accept parameters which change their behaviour. Typically, these are set within a pipeline config file. By using --params, you can add extra parameters at run time. These will be set for every module in the pipeline (though they probably won’t all recognise them).

--split_files

Default: (config file - typically 1)

Cluster Flow generates multiple parallel runs for the supplied input files when run. This is typically a good thing, the cluster is designed to run jobs in parallel. Some jobs may involve many small tasks with a large number of input files however, and 1:1 parallelisation may not be practical. In such cases, the number of input files to assign to each run can be set this flag.

--max_runs

Default: none

It can sometimes be a pain to count the number of input files and work out a sensible number to use with --split_files. Cluster Flow can take the --max_runs value and divide the input files into this number of runs, setting --split_files automatically.

A default can be set for --max_runs in the clusterflow.config file, and this value is set to 12 if no value is found in the config files. Set to 0 to disable.

This parameter will override anything set using --split_files.

--runfile_prefix

Default: none

Optional custom prefix for run file filenames. This is useful if you are running multiple instances of Cluster Flow with the same input file in the same directory, as it avoids potential clashes / mixups. For example:

cf --runfile_prefix bt1 --genome GRCh37 fastq_bowtie1 my_sample.fq
cf --runfile_prefix bt2 --genome GRCh37 fastq_bowtie2 my_samplefq

--ref

Default: none

Specify a reference genome without adding to the genomes.config file. Should be in the format <ref_type>=<path>, eg:

cf --ref bowtie=/path/to/bowtie/index <pipeline> <files>

--dry_run

Default: false

Do everything except for actually launching cluster jobs. Useful for testing and checking that jobs will be created properly.

Customising Behaviour

Typically, Cluster Flow settings are set in static configuration files. However, sometimes it can be useful to specify parameters on a one-off basis on the command line.

--email

Default: (config file)

Cluster Flow can send notification e-mails regarding the status of runs. Typically, e-mail address should be set using @email in ~/clusterflow.config (see above). This parameter allows you to override that setting on a one-off basis.

--priority

Default: (config file - typically -500)

Many cluster managers can use a priority system to manage jobs in the queue. Typically, GRIDEngine priorities can be set ranging from -1000 to 0.

--cores

Default: (config file - typically 64)

Override the maximum number of cores allowed for each Cluster Flow pipeline, typically set in the Cluster Flow config file. For more information see Avoiding cluster overload.

--mem

Default: (config file - typically 128G)

Setting --mem allows you to override the maximum amount of simultaneously assigned memory. For more information see Avoiding cluster overload.

--time

Default: (config file - typically none)

Override the maximum requested time assigned to jobs. For more information, see Avoiding cluster overload.

--project

Default: (config file - typically none)

Specify the project to use on the cluster for this run.

--qname

Default: (config file - typically none)

Specify a custom cluster queue to use for this run.

--environment

Default: (config file - custom)

Override the default environment to use for this pipeline run. Useful for testing or small jobs, can run using bash commands instead of submitting cluster jobs. For example:

cf --environment local test_pipeline *.txt

--notifications

Default: (config file - typically cea)

Cluster Flow can e-mail you notifications about the progress of your runs. There are several levels of notification that you can choose using this flag. They are:

  • c - Send notification when all runs in a pipeline are completed
  • r - Send a notification when each run is completed
  • e - Send a notification when a cluster job ends
  • s - Send a notification if a cluster job is suspended
  • a - Send a notification if a cluster job is aborted

Setting these options at run time with the --notifications flag will override the settings present in your clusterflow.config configuration files. Note: setting the s flag when using many input files with a long pipeline may cause your inbox to be flooded.

Other Functions

These flags instruct Cluster Flow to do something other than submit a pipeline.

--qstat / --qstatall

When you have a lot of jobs running and queued, the qstat summary can get a little overwhelming. To combat this and show job hierarchy in an intuitive manner, you can enter into the console cf --qstat. This parses qstat output and displays it nicely. cf --qstatall does the same but for all jobs by all users.

You'll probably find that you want to run this command quite a lot. To make it a little less clumsy, you can create aliases in your .bashrc or .bash_profile scripts, which run every time you log in.

alias qs='cf --qstat'
alias qsa='cf --qstatall'

To append these lines to your .bashrc script you can use the following command:

echo -e "alias qs='cf --qstat'\nalias qsa='cf --qstatall'" >> ~/.bashrc

Note: These tools don't work with LSF, as I don't have a LSF testing server to work on. Please get in touch if you can help.

--qdel

Sometimes you may be running multiple pipelines and want to stop just one. It can be a pain to find the job numbers to do this manually, so instead you can use Cluster Flow to kill these jobs. When running cf --qstat, ID values are printed for each pipeline. For example:

$ qs

======================================================================
 Cluster Flow Pipeline: fastq_bowtie
 Submitted:             17 hours, 1 minutes, 46 seconds ago
 Working Directory:     /path/to/working/dir
 Cluster Flow ID:       fastq_bowtie_1468357637
 Submitted Jobs:        29
 Running Jobs:          1
 Queued Jobs:           2 (dependencies)
 Completed Jobs:        26 (89%)
======================================================================

You can then use this Cluster Flow ID to kill all jobs within that pipeline:

cf --qdel fastq_bowtie_1468357637

--add_genome

Run the Cluster Flow interactive wizard to add new genomes.

--setup

Run the interactive setup wizard to create a configuration file for Cluster Flow.

--version

Display the currently installed version of Cluster Flow.

--check_updates

Check online for any available Cluster Flow updates.

--help

Show a help message describing the different command line flags available.

Configuration

Config file locations

Cluster flow will search three locations for a config file every time it is run. Variables found in each file can override those read from a previous config file. They are, in order of priority:

  • <working directory>/clusterflow.config
    • A config file found in the current working directory when a pipeline is executed has top priority, trumped only by command line parameters.
  • ~/clusterflow.config
    • A config file in your home directory can be used to set parameters such as notification level and e-mail address.
  • <installation directory>/clusterflow.config
    • A config file in the Cluster Flow installation directory is ideal for common settings specific to the environment.

Config files contain key: value pairs. Syntax is as follows: @key value (tab delimited, one per line). The Cluster Flow source code comes with an example config file called clusterflow.config.example

Typically, there will be a config file in the installation directory which contains the settings that make Cluster Flow work. Each user will then have a personal configuration file in their home directory containing settings such as a notification e-mail address.

Environment Setup

The key things to set up when installing Cluster Flow are the variables that dictate how CF should interact with your cluster - what commands it should use to submit jobs.

Cluster Flow currently supports GRIDEngine (SGE), SLURM and LSF, as well as running locally using background bash jobs. You can specify which environment to use with @cluster_environment:

/* Options: local, GRIDEngine, SLURM or LSF */
@cluster_environment    SLURM

In most cases, that should be enough to get Cluster Flow to work! However, some people have some specific variables that need to be submitted with batch jobs (eg. project identifiers, time limits, other custom flags). If this is the case, the job submission command can be customised with the @custom_job_submit_command config variable.

To use this, enter your typical submission command with the following placeholders which will be replaced at run time:

  • {{command}}
    • The actual command which will be run to execute the module file
  • {{job_id}}
    • The unique identifier which will be assigned to use job dependencies
  • {{outfn}}
    • The filename of the log file to capture STDOUT
  • {{cores}}
    • How many cores to assign
  • {{mem}}
    • How much memory to assign
  • {{time}}
    • How much time to assign
  • {{priority}}
    • A priority to set, defined in the config file
  • {{project}}
    • The cluster project to use
  • {{qname}}
    • The cluster queue name
  • {{email}}
    • The user's e-mail address for cluster job notifications (if set)
  • {{notifications}}
    • A string describing which notifications to be sent (syntax depends on environment set above)

Simply omit any variables which are not needed on your cluster. For example:

@custom_job_submit_command      sbatch  -A MY_PROJECT_ID -t 2-00:00:00 -p core -n {{cores}} --open-mode=append -o {{outfn}} -J {{job_id}} {{notifications}} --wrap="{{command}}"

Cluster Flow will generate it's own sensible default if this isn't set, so it's worth trying it without first.

Note: Cluster Flow will append the job dependency strings to the end of your custom command which are system specific, so it's important that @cluster_environment is correct.

Config File reference

The following section describes the available variables that can be set in the config file. For an example, see the clusterflow.config.example file that comes bundled with Cluster Flow.

@email

Sets your e-mail address, used for e-mail notifications.

@colourful / @colorful

Set to true to make the output from cf --qstat and cf --qstatall colourful (and hopefully easier to read).

@colourful  1

@merge_regex

A regex used to automatically merge files before pipeline processing starts. This works by matching a single regex group within a filename. If multiple input files have the same matching group, they will be merged. The regex group is then used to give the output filename.

For example, given the following config regex:

@merge_regex    [1-8]_[0-9]{6}_[a-zA-Z0-9]+_(P\d+_\d+_[12]).fastq.gz

These input files:

1_160312_CDSH32SDB3889_P1234_001_1.fastq.gz
1_160312_CDSH32SDB3889_P1234_001_2.fastq.gz
2_160312_CDSH32SDB3889_P1234_001_1.fastq.gz
2_160312_CDSH32SDB3889_P1234_001_2.fastq.gz

Would give the resulting merged files:

P1234_001_1.fastq.gz
P1234_001_2.fastq.gz

@split_files

The default number of input files to send to each run. Typically set to 1.

@max_runs

The maximum number of parallel runs that cluster flow will set off in one go. Default is 12 to avoid swamping the cluster for all other users.

@total_cores

The total number of cores available to a Cluster Flow pipeline. Modules are given a recommended number of cores so that resources can be allocated without swamping the cluster.

@total_mem

The total amount of memory available to a Cluster Flow pipeline. Modules are given a recommended quota so that resources can be allocated without swamping the cluster.

@max_time

The maximum time that a job should request in a Cluster Flow pipeline. For example, to prevent jobs from requesting more than 10 days:

@max_time   10-00

@time_multiplier

If your cluster is running slowly and the default time limits specified in Cluster Flow modules are not enough, jobs will fail due to timing out. @time_multiplier is a quick and dirty way to avoid this. Setting @time_multiplier to 2 will double the requested time for every job. Note that these times will still be capped by @max_time.

@priority

The priority to give to cluster jobs.

@cluster_environment / @custom_job_submit_command

See above docs: Environment setup.

@ignore_modules

If you do not use environment modules on your system, you can prevent Cluster Flow from trying to use them (and giving a warning) by adding this line to your config file.

@environment_module_always

Specify an environment module to always load for every Cluster Flow pipeline. Can be used multiple times.

@environment_module_alias

If using environment modules, you may get some errors claiming that certain tools are not installed. If you think that you do have that tool installed, it could be because of a minor difference in the module name (eg. fastqc versus FastQC). You can configure aliases in your configuration file. You can also use these aliases to specify specific software versions for Cluster Flow.

Aliases are added with the @environment_module_alias tag. For example:

@environment_module_alias   fastqc  FastQC/0.11.2
@environment_module_alias   trim_galore TrimGalore

@log_highlight_string / @log_warning_string

To pull out specific highlights or warnings from log files, you can specify search strings with these tags. If found, the e-mail will be highlighted accordingly and the lines from the log file will be displayed at the top of the report e-mail.

For example:

@log_highlight_string at least one reported alignment
@log_warning_string job failed

@notification

Multiple @notification key pairs can be set with the following values:

  • complete
    • A Cluster Flow e-mail notification is sent when all processing for all files has finished
  • run
    • A Cluster Flow e-mail is sent when each run finishes (each set of input files)
  • end
    • A cluster notification e-mail is sent when each cluster job ends. Likely to result in a full inbox!
  • suspend
    • A cluster notification e-mail is sent if a job is suspended
  • abort
    • A cluster notification e-mail is sent if a job is aborted

Cluster Flow sends the run and complete notifications using the cf_run_finished and cf_runs_all_finished modules. These modules handle several tasks, such as cleaning useless warning messages from log files. E-mails contain the contents of all log files, plus a section at the top of highlighted messages, specified within log messages by being prefixed with ###CF.

@check_updates

Cluster Flow can automatically check for new versions. If an update is available, it will print a notification each time you run a job. You can specify how often Cluster Flow should check for updates with this parameter. The syntax is a number followed by d, w, m or y for days, weeks, months or years. Cluster Flow will check for an update at runtime if this period or more has elapsed since you last ran it. You can disable update checks and alerts by setting @check_updates 0 in your ~/clusterflow.config file.

You can manually get Cluster Flow to check for updates by running cf --check_updates

Module Params

Many modules can have their default behaviour modified through the use of Cluster Flow --params. These are described below.

See the documentation about Module Paramters for more information about how to specify these options.

BEDTools intersectNeg

blacklistFile

Use to define a blacklist file (overrides any set as a genome reference).

cf --params blacklistFile="/path/to/file"

Bismark align

pbat

Use the Bismark --pbat flag.

cf --params pbat

unmapped

Save the unmapped reads to a file (Bismark --unmapped flag).

cf --params unmapped

bt1

Align with Bowtie1 instead of Bowtie2 (default).

cf --params bt1

single_cell

Use the --non_directional Bismark flag.

cf --params single_cell

subsample

Only align the first 1000000 reads.

cf --params subsample

Bowtie 1

mirna

Use alignment paramters suitable for miRNA alignment against miRBase references, instead of the standard Bowtie1 command. Uses -n 0 -l 15 -e 99999 -k 200 bowtie flags, instead of default -m 1 --strata.

cf --params mirna

CF merge files

regex

Override any merge regex set in the Cluster Flow configuration and use this instead.

cf --params regex="/REMOVE_([KEEP]+).fastq.gz/"

deeptools bamCoverage

fragmentLength

Set the fragment length to use for bamCoverage, instead of taking from the phantompeaktools cross correlation analysis or using the default (200).

cf --params fragmentLength=120

deeptools bamFingerprint

fragmentLength

Set the fragment length to use for bamCoverage, instead of taking from the phantompeaktools cross correlation analysis or using the default (200).

cf --params fragmentLength=120

FastQ Screen

fastq_screen_config

Use a specific FastQ Screen config file (with --conf FastQ Screen flag).

cf --params fastq_screen_config="/path/to/config"

FastQC

nogroup

Use the --nogroup option with FastQC to prevent automatic grouping of base pair positions in plots. You can end up with some very large plots if you have long reads!

cf --params nogroup

featureCounts

stranded

Set the -s 1 flag for featureCounts.

cf --params stranded

stranded_rev

Set the -s 2 flag for featureCounts.

cf --params stranded_rev

id_tag

Specify the tag to use for counting in the GTF file. If not specified, module tries to guess by looking for a field called gene_id or ID.

cf --params id_tag="Gene"

HiCUP

longest

The longest fragment to accept (HiCUP parameter --longest). Default: 800

cf --params longest=900

shortest

The shortest fragment to accept (HiCUP parameter --shortest). Default: 100

cf --params shortest=50

re1

The restriction enzyme recognition pattern to use. Default: "A^AGCTT,HindIII"

cf --params re1="A^GATCT,BglII"

HTSeq Counts

stranded

Set the -s yes flag for HTSeq Counts. Default is to set -s no

cf --params stranded

stranded_rev

Set the -s reverse flag for HTSeq Counts. Default is to set -s no.

cf --params stranded_rev

id_tag

Specify the tag to use for counting in the GTF file. If not specified, module tries to guess by looking for a field called gene_id or ID.

cf --params id_tag="Gene"

Kallisto

estFragmentLength

Specify the estimated fragment length (Kallisto --fragment-length option). Default: 200.

cf --params estFragmentLength=300

est_sd

Specify the fragment length standard deviation (Kallisto --sd option). Default: 20.

cf --params est_sd=30

MultiQC

template

Specify the MultiQC template to use. Default: default

cf --params template=geo

RSeQC (all modules)

keep_intermediate

Do not delete the R files used to generate the PDF figures. Useful when running downstream tools such as MultiQC, that use these intermediate files.

cf --params keep_intermediate

Samtools sort + index

byname

Sort by name instead of position (-n flag).

cf --params byname

forcesort

Don't skip the sorting step, even if the file already seems to be sorted.

cf --params forcesort

STAR

LoadAndRemove

Load and remove genome index (--genomeLoad LoadAndRemove). Default: NoSharedMemory.

cf --params LoadAndRemove

LoadAndKeep

Load and keep genome index (--genomeLoad LoadAndKeep). Default: NoSharedMemory.

cf --params LoadAndKeep

outSAMattributes

Specify SAM attributes (--outSAMattributes [attr]). Default: Standard.

cf --params outSAMattributes="attr"

TrimGalore!

min_readlength

Minimum read length for trimming to run. If the first file in each run group has reads less than this length, trimming will be skipped. Default: 50

cf --params min_readlength=30

force_trim

Force TrimGalore! to run, even if reads are below minimum read length.

cf --params force_trim

q_cutoff

Specify quality for trimming low-quality ends from reads in addition to adapter removal. Default Phred score: 20.

cf --params q_cutoff=10

stringency

Number of bases of overlap with adapter sequence required to trim a sequence. Default: 1

cf --params stringency=3

adapter

Specify an adapter sequence to trim. Default: Auto-detect (Illumina universal, Nextera transposase or Illumina small RNA adapter).

cf --params adapter=ATACAGCTAGCAGTAC

RRBS

Specifies that the input file was an MspI digested RRBS sample.

cf --params RRBS

nofastqc

Do not run FastQC after trimming is complete.

cf --params nofastqc

 Specific trimming

To remove a custom number of bases from reads after adapter removal, the following parameters can be set:

  • cf --params clip_r1=<int>
    • Remove bp from the 5' end of read 1 (or single-end reads).
  • cf --params clip_r2=<int>
    • Remove bp from the 5' end of read 2 (paired-end only).
  • cf --params three_prime_clip_r1=<int>
    • Remove bp from the 3' end of read 1 AFTER adapter/quality trimming has been performed.
  • cf --params three_prime_clip_r2=<int>
    • Remove bp from the 3' end of read 2 AFTER adapter/quality trimming has been performed.

The following params are presets which are easier to remember and use:

  • cf --params trim=<int>
    • Trim from 5' of R1 and R2. Equivalent to clip_r1=<int> clip_r2=<int>.
  • cf --params pbat
    • clip_r1 6
    • clip_r2 6
  • cf --params ATAC
    • clip_r1 4
    • clip_r2 4
  • cf --params single_cell
    • clip_r1 9
    • clip_r2 9
  • cf --params epignome
    • clip_r1 7
    • clip_r2 7
    • three_prime_clip_r1 7
    • three_prime_clip_r2 7
  • cf --params accel
    • clip_r1 10
    • clip_r2 15
    • three_prime_clip_r1 10
    • three_prime_clip_r2 10
  • cf --params cegx
    • clip_r1 6
    • clip_r2 6
    • three_prime_clip_r1 2
    • three_prime_clip_r2 2

Writing Pipelines

Pipeline syntax

All pipelines conform to a standard syntax. The name of the pipeline is given by the filename, which should end in .config. The top of the file should contain a title and description surrounded by /* and */

Variables can be set using the same @key value syntax as in clusterflow.config files.

Modules are described using # prefixes. Tab indentation denotes dependencies between modules. Syntax is #module_name parameters, where there can be any number of space separated parameters which will be passed on to the module at run time.

Example pipeline

Here is an example pipeline, which requires a genome path and uses three modules:

/*
Example Pipeline
================
This pipeline is an example of running three modules which depend on
each other. Module 2 is run with a parameter that modifies its behaviour.
This block of text is used when cf --help example_pipeline is run
*/
#module1
       #module2
       #module2 parameter
             #module3

Remember to run dos2unix on your pipeline before you run it, if you're working on a windows machine.

Run files

Cluster Flow works by creating .run files for each batch of input files. These are a copy of the pipeline file, with filenames appended for each step of the pipeline. These files are used by subsequent steps in the pipeline to know which input files to use.

Inspecting run files is a quick way to see exactly what analysis was done in a directory.

Writing Modules

Overview

Modules are the heart of Cluster Flow. Each module is a wrapper around a single bioinformatics tool. Each module has three modes of operation:

  1. Specifying the required resources for the job
  2. Running the bioinformatics tool
  3. Printing a help message about the module

Modules are executed using system commands, so can be written in any language. However, most existing modules are written in Perl.

Module filenames must be in the format <module_name>.cfmod.<extension>, eg. mymod.cfmod.pl. They can be stored in the following locations (chosen in this order of preference):

  • Current working directory
  • ~/.clusterflow/modules/
  • <installation_dir>/modules/

Example module

An example module comes bundled with Cluster Flow, containing some highly commented pseudocode which you can modify for your own uses. You can see it in your modules directory: example_module.pl

Existing Perl Scripts

If you have an existing script or tool, it's tempting to try to convert it into a Cluster Flow module. However, I recommend instead keeping it as a standalone script and creating a Cluster Flow module to launch this instead. In our experience, this is much easier. It also has the advantage that your script can still be run outside Cluster Flow.

Specifying resources

At the top of every Cluster Flow module is a hash that defines the resources needed by the tool. It looks something like this:

my %requirements = (
    'cores'     => $cores,
    'memory'    => $mem,
    'modules'   => $modules,
    'references'=> $refs,
    'time'      => $time
);

Each of these variables can be specified as a string, an array specifying a range of appropriate values, or a subroutine to calculate a value based on information specific to the run.

Cores

The number of required cores can be specified either as a string or an array. If your tool always uses a fixed number of cpus (for example, 1 if it's not multi-threaded), just specify that number in quotes ('cores' => '1').

If your tool can be sped up by using multiple cpus, you can specify a minimum and maximum number in an array ('cores' => ['3','8']). Cluster Flow will then allocate a number within that range according to how many jobs are being created in parallel. This way, jobs will run as fast as possible for a handful of files, but not overwhelm the cluster if many are being run at once.

Memory

Memory works just like cores, above - either specify a string or an array with a minimum and maximum amount. Numbers with no suffix will be interpreted as bytes, then you can use K, M and G suffixes to specify kilobytes, megabytes and gigabytes ('memory' => '8G').

In some cases it can be useful to use a subroutine to dynamically calculate the required memory. For example, you could inspect the filesize of a fasta genome reference to determine the required memory:

'memory'    => sub {
    my $cf = $_[0];
    if (defined($cf->{'refs'}{'fasta'}) && -e $cf->{'refs'}{'fasta'}) {
        # Multiple the reference filesize (in bytes) by 1.2
        my $mem_usage = int(1.2 * -s $cf->{'refs'}{'fasta'});
        return CF::Helpers::bytes_to_human_readable($mem_usage);
    } else {
        # Sensible default
        return '8G';
    }
},

Modules

A string or array of strings describing environment modules that should be loaded. Try to keep this as generic as possible. People can specify specific versions or naming in personal config files using @environment_module_alias.

References

Genome reference and annotation is labelled with a field to describe it's type. If a reference is required, you should specify its type here. This prevents Cluster Flow from being launched if the reference genome is not specified.

For example, the bowtie2 module specifies 'references'=> 'bowtie2'; the featureCounts module specifies 'references' => 'gtf'.

Time

Some HPC clusters require a time limit to be specified when launching jobs. Here you should predict approximately how long your module should run.

Some modules will always take a fixed amount of time to run, in which case this can be specified as a string. For ten minutes, specify 'time' => '10'.

The execution time for most modules will depend on how many input files they are processing. Modules often run with multiple sets of input files. To cope with this, supply a subroutine to this variable which can flexibly request an amount of time according to how many input files will be processed.

The helper function minutes_to_timestamp is useful here - it takes a number of minutes and returns a properly formatted timestamp (see below for more information about helper functions).

If a module typically takes three hours to run, it could request it as follows:

'time' => sub {
    my $cf = $_[0];
    my $num_files = $cf->{'num_starting_merged_aligned_files'};
    return CF::Helpers::minutes_to_timestamp ($num_files * 3 * 60);
}

The $cf variable is a hash containing information about the job. See below for a description of the keys available.

Remember to be conservative - high time requests can delay queue priority, but low time requests will result in job failure.

Help Text

Cluster Flow can request help text from a module if called with cf --help <module_name>. You should write some text describing what the module does, including any parameters or customisation available.

my $helptext = "".("-"x15)."\n My awesome module\n".("-"x15)."\n
This module is brilliant and worked first time because the author
read all of the Cluster Flow documentation! What a hero!\n\n";

Module launch

Once the requirements hash and help text are written, we call a core helper function called module_start. If the module is being called to request resource requirements or help, the function will exit. If it is being executed in a cluster job, it will return as hash with useful information such as the input filenames.

Requirements should be passed to the function as a reference:

my %cf = CF::Helpers::module_start(\%requirements, $helptext);

The returned hash contains the following keys: (NB: Not all of these are available in request subroutines)

%cf = {
    refs = '<hash>',                # Reference annotation for the specified genome. Keys are the reference type, values are the path to the annotation.
    prev_job_files = '<array>',     # File names resulting from preceding job.
    starting_files = '<array>',     # File names for the initial files that this thread of the pipeline was started with.
    files = '<hash>',               # Hash of arrays with all files from this pipeline thread. Keys are the module job IDs, values are arrays of output files.
    cores = '<int>',                # The number of cores allocated to the module.
    memory = '<str>',               # The amount of memory allocated to the module.
    params = '<hash>',              # A hash of key: value pairs. Value is `True` if only a flag.
    config = '<hash>',              # Hash containing arbitrary key: value configuration pairs from the run file. Always contains hash with key `notifications`.
    num_starting_files = '<int>',   # Number of files that this thread of the pipeline started with.
    num_starting_merged_files = '<int>', # Number of files that this thread of the pipeline started with, after merging if matched merge regex.
    num_starting_merged_aligned_files = '<int>',  # Guess at number of files after alignment, based on whether pipeline is running in paired end mode or not.
    pipeline_id = '<str>',          # The unique Cluster Flow ID of this pipeline. Useful for generating filenames.
    pipeline_name = '<str>',        # The name of the pipeline that was launched.
    pipeline_started = '<int>',     # A unix timestamp of when the pipeline was started.
    job_id = '<str>',               # The unique Cluster Flow ID for this job.
    prev_job_id = '<str>',          # The unique Cluster Flow ID for the previous job in the pipeline.
    run_fn = '<str>',               # The filename of the run file for this thread of the pipeline.
    run_fns = '<array>',            # All run file filenames for this pipeline (summary modules only).
    modname = '<str>',              # Name of this module
    mod_fn = '<str>',               # Filename of this module
}

Checks

Although not necessary, most modules that use genome references do a sanity check to make sure that they have what they need after this point. For example, the STAR module checks that it has the required reference:

# Check that we have a genome defined
if(!defined($cf{'refs'}{'star'})){
    die "\n\n###CF Error: No star path found in run file $cf{run_fn} for job $cf{job_id}. Exiting..";
} else {
    warn "\nAligning against $cf{refs}{star}\n\n";
}

Version logging

Again not necessary, but good practice - modules log the version of software that they're about to run for future reference:

warn "---------- < module > version information ----------\n";
warn `MY_COMMAND --version`;
warn "\n------- End of < module > version information ------\n";

Parameters

Modules are able to customise the way that they run depending on the presence of custom parameters are run time. These are used for a range of reasons, such as customising bowtie alignments for miRNA data, changing trimming settings depending on library preparation type and many others. You can basically use them however you like, though you'll find may modules doing this sort of thing:

my $extra_flag = (defined($cf{'params'}{'myflag'})) ? '--extra_flag' : '';
my $specific_var = '';
if(defined($cf{'params'}{'myvar'})){
    $specific_var = '--myvar '.$cf{'params'}{'myvar'};
}
# ..later..
$cmd = "mycommand --always $extra_flag $specific_var"

Opening the run file

Each part of the pipeline has a .run file, used by the modules to track the configuration options and output filenames as the pipeline progresses.

Your pipeline will need to open this run file in append mode so that it can add the file names of any output that it creates.

open (RUN,'>>',$cf{'run_fn'}) or die "###CF Error: Can't write to $cf{run_fn}: $!";

Command execution

Looping through files

Once you have everything ready, you'll want to actually run your tool. Remember that modules typically run with a collection of input files, so you will need to loop through these and process them in sequence.

How you do this looping depends on what input your tool expects. If your tool takes a single file and doesn't care whether it's paired end or single, you can simply loop through all files from the previous job:

foreach my $file (@{$cf{'prev_job_files'}}){
    # process $file
}

Most preprocessing and alignment tools need either one single end FastQ file or two paired end FastQ files. To handle this, you can use the is_paired_end helper function to separate the input files into single end and paired end:

my ($se_files, $pe_files) = CF::Helpers::is_paired_end(\%cf, @{$cf{'prev_job_files'}});

These files can then be looped over in separate loops:

# Go through each single end files and run Bowtie
if($se_files && scalar(@$se_files) > 0){
    foreach my $file (@$se_files){
        # process $file
    }
}
if($pe_files && scalar(@$pe_files) > 0){
    foreach my $files_ref (@$pe_files){
        my @files = @$files_ref;
        if(scalar(@files) == 2){
            # process $files[0] and $files[1]
        } else {
            warn "\n###CF Error! Bowtie paired end files had ".scalar(@files)." input files instead of 2\n";
        }
    }
}

Building a command

Typically, Cluster Flow modules build a system command in a string. This is then printed to stderr with the ###CFCMD prefix. This is picked up by Cluster Flow and added to the summary html report and e-mail.

my $command = "my_command -i $file -o $output_fn";
warn "\n###CFCMD $command\n\n";

Running the command

Once build, the command should be executed using the perl system command. This command returns the exit code once complete, which can be checked to see whether the module has worked or not: (0 is success, which evaluates to false)

if(!system ($command)){
    # command worked
} else {
    # Command returned a non-zero result, probably went wrong...
    warn "\n###CF Error! Example module (SE mode) failed for input file '$file': $? $!\n\n";
}

Adding the output to the run file

If your command ran successfully, you should have created a new output file. This should be added to the .run file along with the current job id, so that it can be used by subsequent modules in the pipeline:

if(-e $output_fn){
    print RUN "$cf{job_id}\t$output_fn\n";
} else {
    warn "\n###CF Error! Example module output file $output_fn not found..\n";
}

Job Completion

Run File Output

A run file is created by Cluster Flow for each batch of files. It describes variables to be used, the pipeline specified and the filenames used by each module. The syntax of variables and pipeline is described in Pipeline syntax.

File names are described by a job identifier followed by a tab then a filename. Each module is provided with its own job ID and the ID of the job that was run previously. By using these identifiers, the module can read which input files to use and write out the resulting filenames to the run file when complete. Example run file syntax:

first_job_938 filename_1.txt
first_job_938 filename_2.txt
second_job_375 filename_1_processed.txt
second_job_375 filename_2_processed.txt

There can be any number of extra parameters, these are specific the module and are specified in the pipeline configuration.

E-mail report highlights

Any STDOUT or STDERR that your module produces will be written to the Cluster Flow log file. At the end of each run and pipeline, an e-mail will be sent to the submitter with details of the run results (if specified by the config settings). Because the log file can be very long Cluster Flow pulls out any lines starting with ###CF. Typically, such a line should be printed when a module finishes, with a concise summary of whether it worked or not. Messages including the word Error will be highlighted and cause the final e-mail to have warning colours. The configuration options @log_highlight_string and @log_warning_string can customise this reporting.

Modules should print the command that they are going to run to STDERR so that this is recorded in the log file. These are also sent in the e-mail notification and should start with ###CFCMD.

Exit codes

It's likely that your cluster will continue to fire off the dependent jobs as soon as the parent jobs finish, irrespective of their output. If a module fails, the cleanest way to exit is with a success code, but without printing any resulting output filename. The following modules will not find their input filenames and so should immediately exit.

Appendices

Command line flags

Cluster Flow modules are expected to respond to the following command line flags:

Flag Step Description
--requirements 1 Request the cluster resources needed by the module
--run_fn 2 Path to the Cluster Flow run file(s) for this pipeline
--job_id 2 Cluster job ID for this job
--prev_job_id 2 Cluster job ID for the previous job
--cores <int> 2 Number of cores allocated to the module
--mem <str> 2 Amount of memory allocated to the module
--param <str> 2 Extra parameters to be used
--help 3 Print module help

The step number refers to whether the module is being executed:

  1. By the core Cluster Flow script at pipeline launch
  2. Within a cluster job, executing the tool
  3. By the core Cluster Flow script, when cf --help <modname> is specified.

Helper Functions

If your module is written in Perl, there are some common Cluster Flow packages (perl modules) that you can use to provide some pre-written functions.

There are currently three packages available to Cluster Flow modules. Helpers contains subroutines of general use for most modules. Constants and Headnodehelpers contain subroutines primarily for use in the main cf script. You can include the Helpers package by adding the following to the top of your module file:

use FindBin qw($RealBin);
use lib "$FindBin::RealBin/../source";
use CF::Helpers;

We use the package FindBin to add the binary directory to the path (where cf is executing from).

Note that there is a Python version of the Helpers script which contains many of the same functions and works in a comparable way.

module_start

Handles the initiation of all modules. See above for a description of use.

parse_runfile

Parses .run files. Called by module_start() and not usually run directly.

load_environment_modules

Used to load environment modules into the PATH. Typically used by the main cf script, though occasionally used elsewhere for special occasions.

is_paired_end

This function takes an array of file names and returns an array of single end files and an array of arrays of paired end files.

First, it checks the configuration set in the .run file. If @force_paired_end is set, it sorts the files from the last job into pairs and returns them. If @force_single_end is set it returns all previous files as single end.

If neither variables are set, it sorts the files alphabetically, then removes any occurance of _[1-4] from the filename and compares the list. Identical pairs are returned as paired end.

my ($se_files, $pe_files) = CF::Helpers::is_paired_end(@$files);
foreach my $file (@$se_files){
    print "$file is single end\n";
}
foreach my $files_ref (@$pe_files){
    my @files = @$files_ref;
    print $files[0]." and ".$files[1]." are paired end.\n";
}

is_bam_paired_end

Looks at BAM/SAM file headers and tries to determine whether it has been generated using paired end input files or single end. The function reads through the first 1000 reads of the file and counts how many 0x1 flags it finds (denoting a paired read). If `>= 800 of those first 1000 reads are paired end, it returns true.

if(CF::Helpers::is_bam_paired_end($file)){
    ## do something with paired end BAM
} else {
    ## do something with single end BAM
}

fastq_encoding_type

Scans a FastQ file and tries to determine the encoding type. Returns strings integer, solexa, phred33, phred64 or 0 if too few reads to safely determine. This is done by observing the minimum and maximum quality scores.

For more details, see the Wikipedia page on FastQ encoding

($encoding) = CF::Helpers::fastq_encoding_type($file);

fastq_min_length

Scans the first 100000 reads of a FastQ file and returns the longest read length that it finds.

my $min_length = CF::Helpers::fastq_min_length($file);

parse_seconds

Takes time in seconds as an input and returns a human readable string. The optional second $long variable determines whether to use h/m/s (0, false) or hours/minutes/seconds (1, true, the default).

my $time = CF::Helpers::parse_seconds($seconds, $long);

timestamp_to_minutes / minutes_to_timestamp

Functions to convert between a SLURM / HPC style timestamp and minutes. Attempts string parsing in the following order:

  1. minutes
  2. minutes:seconds
  3. hours:minutes:seconds
  4. days-hours
  5. days-hours:minutes
  6. days-hours:minutes:seconds
my $minutes = CF::Helpers::timestamp_to_minutes($timestamp);
my $timestamp = CF::Helpers::minutes_to_timestamp($minutes);

human_readable_to_bytes / bytes_to_human_readable

Two functions which convert between human readable memory strings (eg. 4G or 100M) and bytes.

my $bytes = CF::Helpers::human_readable_to_bytes('3G');
my $size = CF::Helpers::bytes_to_human_readable('7728742');

mem_return_mbs

Takes a memory string and returns a number of megabytes, rounding up to the nearest MB.

allocate_cores

Takes the suggested number of cores to use, a minimum and maximum number and returns a sensible result.

my $cores = CF::Helpers::allocate_cores($recommended, $min, $max);

allocate_memory

Takes the suggested number of memory to use, a minimum and maximum amount and returns a sensible result. Input can be human readable strings or bytes. Returns a value in bytes.

my $mem = CF::Helpers::allocate_memory($recommended, $min, $max);

cf_compare_version_numbers

Function to properly compare software version numbers. Correctly returns that v0.10 is greater than v0.9.

Troubleshooting

Bugs and Errors

If you come across a strange looking error message or find a bug, please do let us know. You submit new issues here: https://github.com/ewels/clusterflow/issues

Feature Requests

If you'd like Cluster Flow to do something it doesn't, log a request! The issue tracker system mentioned above can be used for enhancement requests too.

E-mail

If you don't want to set up a GitHub account, feel free to drop the author an e-mail at phil.ewels@scilifelab.se

Frequently Asked Questions

Permission Errors

A number of errors can be caused by scripts not having executable file privileges. You can see the file permissions with ls -l, you should see something like this:

$ ls -l clusterflow/modules/
total 608
-rwxrwxr-x 1 phil phil 6770 May 20 16:40 bismark_align.cfmod
-rwxrwxr-x 1 phil phil 4291 May 16 12:54 bismark_deduplicate.cfmod
-rwxrwxr-x 1 phil phil 2748 May 16 12:54 bismark_messy.cfmod
-rwxrwxr-x 1 phil phil 5652 May 16 12:54 bismark_methXtract.cfmod
-rwxrwxr-x 1 phil phil 3553 May 16 12:54 bismark_tidy.cfmod
-rwxrwxr-x 1 phil phil 7119 May 16 12:54 bowtie1.cfmod

This example is for the modules directory (all modules should have executable privileges for all), the same applies to the main cf file.

DOS carriage returns

If you've edited any files, you may get problems due to windows-based editors putting DOS-style \r carriage returns in.

Most linux environments come with a package called dos2unix which will clean these up:

dos2unix *

ERROR:105: Unable to locate a modulefile for 'clusterflow'

This error probably means that Cluster Flow isn't installed in your environment module system, and you're trying to run module load clusterflow

You can skip this step if you have another way of accessing the cf file, or see the Installation Instructions for details about how to set Cluster Flow up with environment modules.

Unable to run job: job rejected: the requested parallel environment "orte" does not exist.

This message means that your GRIDEngine setup doesn't have the default orte environment set up. If you have different environments set up you can list them with:

qconf -spl

You can get the details of any environment with:

qconf -sp [name]

If you find one which assigns slots to a single node (allocation_rule should be $fill_up) then you can just do a find & replace for orte to the name of your local environment and that should make things work again.

(Answered by Simon Andrews)

Unable to run job: job rejected (other reasons)

There may be other differences in the job submission requests that cause them to fail. If you see errors such as this, you can use the @custom_job_submit_command configuration variable to customise the way that jobs are requested.