2.16. Best practices#

Below are a set of recommended good practices to keep in mind when writing a Common Workflow Language description for a tool or workflow. These guidelines are presented for consideration on a scale of usefulness: more is better, not all are required.

  • No type: string parameters for names of input or reference files/directories; use type: File or type: Directory as appropriate.

  • Include a license that allows for re-use by anyone, e.g. Apache 2.0. If possible, the license should be specified with its corresponding SPDX identifier. Construct the metadata field for the licence by providing a URL of the form https://spdx.org/licenses/[SPDX-ID] where SPDX-ID is the taken from the list of identifiers linked above. See the example snippet below for guidance. For non-standard licenses without an SPDX identifier, provide a URL to the license.

    Example of metadata field for license with SPDX identifier:

    $namespaces:
      s: https://schema.org/
    s:license: https://spdx.org/licenses/Apache-2.0
    # other s: declarations
    

    For more examples of providing metadata within CWL descriptions, see the Metadata and Authorship section of this User Guide.

  • Include attribution information for the author(s) of the CWL tool or workflow description. Use unambiguous identifiers like ORCID.

  • In tool descriptions, list dependencies using short name(s) under SoftwareRequirement.

  • Include SciCrunch identifiers for dependencies in https://identifiers.org/rrid/RRID:SCR_NNNNNN format.

  • All input and output identifiers should reflect their conceptual identity. Use informative names like unaligned_sequences, reference_genome, phylogeny, or aligned_sequences instead of foo_input, foo_file, result, input, output, and so forth.

  • In tool descriptions, include a list of version(s) of the tool that are known to work with this description under SoftwareRequirement.

  • format should be specified for all input and output Files. Bioinformatics tools should use format identifiers from EDAM. See also iana:text/plain, iana:text/tab-separated-values with $namespaces: { iana: "https://www.iana.org/assignments/media-types/" }. Full IANA media type list (also known as MIME types). For non-bioinformatics tools use or build an appropriate ontology/controlled vocabulary in the same way. Please edit this page to let us know about it.

  • Mark all input and output Files that are read from or written to in a streaming compatible way (only once, no random-access), as streamable: true.

  • Each CommandLineTool description should focus on a single operation only, even if the (sub)command is capable of more. Don’t overcomplicate your tool descriptions with options that you don’t need/use.

  • Custom types should be defined with one external YAML per type definition for re-use.

  • Include a top level short label summarising the tool/workflow.

  • If useful, include a top level doc as well. This should provide a longer, more detailed description than was provided in the top level label (see above).

  • Use type: enum instead of type: string for elements with a fixed list of valid values.

  • Evaluate all use of JavaScript for possible elimination or replacement. One common example: manipulating File names and paths? Consider whether one of the built in File properties like basename, nameroot, nameext, etc., could be used instead.

  • Give the tool description to a colleague (preferably at a different institution) to test and provide feedback.

  • Complex workflows with individual components which can be abstracted should utilise the SubworkflowFeatureRequirement to make their workflow modular and allow sections of them to be easily reused.

  • Software containers should be made to be conformant to the “Recommendations for the packaging and containerizing of bioinformatics software” (also useful to other disciplines).