Workflow Services Overview
Workflow service provides for the ability to run Common Workflow Language executions, which is an open standard for describing analysis workflows and tools. The workflow service makes use of Cromwell as the orchestration layer to manage steps within a workflow. A custom plugin was developed so that the step executions are performed in a highly secure environment by Task Service. You initiate CWL executions either through the LifeOmic CLI (Command Line Interface), through the PHC Console Add Workflow page, or through the PHC SDK for Python.
Before using the CLI or the PHC web console to run workflows, users need to log into their PHC account at https://apps.us.lifeomic.com/phc.
General concept
The workflow service requires that all CWL resource and dependency files exist within the PHC File Service. Once all resources are in place, then you execute a run of the CWL through the CLI or the PHC web console. The Automation page lists all workflows and their current states, start times, and run times.
From this page, select an individual execution. This view displays a graph of the workflow using the Rabix CWL-SVG open source library for generating visualizations. It also lists the individual steps of the workflow as run by the Task Service.
Note: Click the folder icon at the top right of this view to display the workflow files. The files include all of the outputs generated by the workflow. Each output is saved in a directory with the name of the CWL step that generated it.
Use the CLI to run a basic workflow
The following step demonstrates how to run a simple workflow that generates an index for a BAM file. It's important to note that CWL has a fairly broad syntax and the below is just a simple example. Reference Common Workflow Language for more general information. Workflow service only implements a subset of the full CWL feature set, reference Workflow Service limitations for current limitations.
To use the PHC web console to run a workflow, see Add a Workflow.
Generate and upload the CWL resources
Here is a sample master CWL
This file describes:
Two inputs, a bamfile and a filename for the index of the BAM file
One output, the index file that will have the name provided by the input above
One step, this gives a name to the step
index_bam
and the name of the CWL filebamindex.cwl
that will execute the step
Generate this file then upload it using the CLI, ex
lo files upload ./bam_master.cwl <datasetId>
cwlVersion: v1.0
class: Workflow
inputs:
bamfile: File
bamindexfilename: string
outputs:
bamindexout:
type: File
outputSource: index_bam/bamindexout
steps:
index_bam:
run: bamindex.cwl
in:
bamfile: bamfile
bamindexfilename: bamindexfilename
out: [bamindexout]
Here is a sample CWL for the step CWL
This file describes:
The type of tool used, in this case
CommandLineTool
The Docker container that will run the step
Note: The workflow service requires that all steps use a docker container for execution. This allows for the secure execution within task service
Two inputs, the BAM file and the index filename
Note: In this example the two inputs are also used as arguments to the
baseCommand
notice theinputBinding
andposition
values.One output, in this case the input filename is re-used to name the output file
The command that the container runs
Generate this file then upload it using the CLI, ex
lo files upload ./bamindex.cwl <datasetId>
cwlVersion: v1.0
class: CommandLineTool
hints:
DockerRequirement:
dockerPull: genomicpariscentre/samtools
inputs:
bamfile:
type: File
inputBinding:
position: 1
bamindexfilename:
type: string
inputBinding:
position: 2
outputs:
bamindexout:
type: File
outputBinding:
glob: $(inputs.bamindexfilename)
baseCommand: ['samtools', 'index']
Next a JSON file provides the inputs
This file describes:
A file input, using the
class
File
and the ID of the fileThe name of the desired output file
{
"bamfile": {
"class": "File",
"fileId": "805209e1-35cb-49f3-a5cc-327a93d1f72d"
},
"bamindexfilename": "HG00463.bam.bai"
}
Generate this file then upload it using the CLI, ex
lo files upload ./bam_inputs.json <datasetId>
Finally we are ready to run the workflow using the CLI
lo workflows create <datasetId> -n "BAM Indexing" -w <masterCwlFileId> -f <inputsFileId> -d <cwlDependenciesFileId>
And that's it, the workflow is running. You can go to the Automation View to see the list of workflows and select the one you've started to look at in detail.
Using a non-public image
Task service can make use of non-public docker images, ref Using a non public image. The syntax to make use of this in workflow service is as follows:
Using the required DockerRequirement
, prefix the name of the private container
with lifeomic_private/
. This informs workflow service to handle the image as a
non-public image. Then add a file input type to the CWL master and step files,
treating it as any other file input.
requirements:
DockerRequirement:
dockerPull: lifeomic_private/my_private_image
Using an image from the Tool Registry Service
Task service can make use of an image stored within the Tool Registry Service. The syntax to make use of this in workflow service is as follows:
Using the required DockerRequirement
, prefix the name of the tool
with lifeomic_tool/
and then add the account that owns the image
account_owning_image/
. Lastly, add the image name and optional version
my_tool_image:1.0.0
. If no version is supplied, the version
currently marked as default will be used. The complete path
informs workflow service all the details needed to pull the
image of that name owned by that account from the Tool Registry Service.
requirements:
DockerRequirement:
dockerPull: lifeomic_tool/account_owning_image/my_tool_image:1.0.0
Glob Pattern Handling
The supported syntax for handling glob patterns in output files has some limitations when the pattern includes multiple unknown directories. The following examples explain in detail this limitation.
- The pattern
/tmp/**/*.txt
will look for a*.txt
file within any one sub directory. This is our best use case.- For example, pattern
/tmp/**/*.txt
would findoutput.txt
given location/tmp/foo/output.txt
- For example, pattern
- The pattern
/tmp/**/*.txt
should also find*.txt
files under multiple sub directories, but currently we are limited to one sub directory.- For example, pattern
/tmp/**/*.txt
would not findoutput.txt
give location/tmp/foo/bar/output.txt
- For example, pattern
- If the number of sub directories is known, this pattern may be used to get
through this limitation by including
/**
for each directory.- For example, pattern
/tmp/**/**/*.txt
will findoutput.txt
given location/tmp/foo/bar/output.txt
- For example, pattern
Workflow Service limitations
The full CWL syntax is not currently supported. While some CWL Requirements
are required, i.e. DockerRequirement
others most likely will not be supported
due to security concerns, i.e. InlineJavascript
. However, as we add support
for the other Requirements
they will be listed here. Due to the explicit
nature of the file service handling, CWL secondary
files are also not
supported. Each file needs to be explicitly listed as a file input and ID
provided in the inputs.
Supported Requirements
- DockerRequirement (also a required value)
Reference
- BAM - https://en.wikipedia.org/wiki/Binary_Alignment_Map
- Common Workflow Language - https://www.commonwl.org/
- Cromwell - https://cromwell.readthedocs.io/en/stable/
- Docker Overview - https://docs.docker.com/engine/docker-overview/
- Rabix CWL-SVG - https://github.com/rabix/cwl-svg
- Registry of Docker based tools and workflows defined in CWL or WDL for the sciences - https://dockstore.org