Skip to content

Review and update part 1 materials #10

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
10 changes: 6 additions & 4 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,12 @@ This workshop will provide you with the foundational knowledge required to build

## Prerequisites

This is an intermediate-advanced workshop for people developing reproducible bioinformatics workflows.
This is an intermediate-advanced workshop for people developing reproducible bioinformatics workflows. It assumes some experience with the following:

* Experience working on the command line/Linux environment.
* Experience developing reproducible workflows (e.g., bash, CWL, WDL, or Snakemake).
* Experience with basic scripting (e.g. Bash).

In addition, experience with other reproducible workflow tools (e.g. CWL, WDL, or Snakemake) will be very useful, although not at all required for this workshop.

## Set up requirements

Expand All @@ -31,7 +33,7 @@ In order to foster a positive and professional learning environment we encourage
* Show courtesy and respect towards other community members
* Our full code of conduct, with incident reporting guidelines, is available here.

## Workshop schedule
## Workshop schedule

### Day 1

Expand Down Expand Up @@ -65,6 +67,6 @@ at the end of the workshop. Help us help you! 😁

## Credits and acknowledgements

This workshop event and accompanying materials were developed by the Sydney Informatics Hub, University of Sydney in partnership with Seqera. The workshop was enabled through the Australian BioCommons - [BioCLI Platforms Project](https://www.biocommons.org.au/biocli) (NCRIS via Bioplatforms Australia).
This workshop event and accompanying materials were developed by the Sydney Informatics Hub, University of Sydney in partnership with Seqera. The workshop was enabled through the Australian BioCommons - [BioCLI Platforms Project](https://www.biocommons.org.au/biocli) (NCRIS via Bioplatforms Australia).

![](./img/logos.png)
2 changes: 1 addition & 1 deletion docs/part1/00_intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ During **Part 2**, the skills and concepts you have learned in Part 1 will be ap
It is good practice to organize projects into their own folders to make it easier to track and replicate experiments over time.
We have created separate directories for each part (`~/part1/` and `~/part2/`).

!!!question "Exercise"
!!! question "Exercise"

In the VSCode terminal, move into the directory for all Part 1 activities:

Expand Down
6 changes: 3 additions & 3 deletions docs/part1/01_hellonextflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,21 +22,21 @@ Nextflow’s **core features** are:

## Processes, tasks, and channels

A Nextflow workflow is made by joining together **processes**. Each process can be written in any scripting language that can be executed by the Linux platform Processes can be written in any language that can be executed from the command line, such as Bash, Python, or R.
A Nextflow workflow is made by joining together **processes**. Each process can be written in any scripting language that can be executed from the command line, such as Bash, Python, or R.

Processes in are executed independently (i.e., they do not share a common writable state) as **tasks** and can run in parallel, allowing for efficient utilization of computing resources. Nextflow automatically manages the data dependencies between processes, ensuring that each process is executed only when its input data is available and all of its dependencies have been satisfied.

The only way they can communicate is via asynchronous first-in, first-out (FIFO) queues, called **channels**. Simply, every input and output of a process is represented as a channel. The interaction between these processes, and ultimately the workflow execution flow itself, is implicitly defined by these input and output declarations.

![Image title](img/myworkflow.excalidraw.png)
![An example Nextflow schematic](img/myworkflow.excalidraw.png)

## Execution abstraction

While a process defines what command or script is executed, the **executor** determines how and where the script is executed.

Nextflow provides an **abstraction** between the workflow’s functional logic and the underlying execution system. This abstraction allows users to define a workflow once and execute it on different computing platforms without having to modify the workflow definition. Nextflow provides a variety of built-in execution options, such as local execution, HPC cluster execution, and cloud-based execution, and allows users to easily switch between these options using command-line arguments.

![Image title](img/abstraction.excalidraw.png)
![Execution abstraction of a Nextflow workflow](img/abstraction.excalidraw.png)

## More information

Expand Down
26 changes: 24 additions & 2 deletions docs/part1/02_helloworld.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Let's demonstrate this with simple commands that you can run directly in the ter

The **`echo`** command in Linux is a built-in command that allows users to display lines of text or strings that are passed as arguments. It is commonly used in shell scripts and batch files to output status text to the screen or a file.

The most straightforward usage of the `echo` command is to display a text or string on the terminal. To do this, you simply provide the desired text or string as an argument to the `echo` command:
The most straightforward usage of the `echo` command is to display text or a string on the terminal. To do this, you simply provide the desired text or string as an argument to the `echo` command:

```bash
echo <string>
Expand All @@ -28,6 +28,10 @@ echo <string>
echo 'Hello World!'
```

``` title="Output"
Hello World!
```

## Redirect outputs

The output of the `echo` can be redirected to a file instead of displaying it on the terminal. You can achieve this by using the **`>`** operator for output redirection. For example:
Expand All @@ -36,7 +40,13 @@ The output of the `echo` can be redirected to a file instead of displaying it on
echo 'Welcome!' > output.txt
```

This will write the output of the echo command to the file name `output.txt`.
Notice that nothing is printed in the terminal.

``` title="Output"

```

Instead, this will write the output of the echo command to the file name `output.txt`.

!!!question "Exercise"

Expand All @@ -48,6 +58,10 @@ This will write the output of the echo command to the file name `output.txt`.
echo 'Hello World!' > output.txt
```

``` title="Output"

```

## List files

The Linux shell command **`ls`** lists directory contents of files and directories. It provides valuable information about files, directories, and their attributes.
Expand All @@ -70,6 +84,10 @@ ls

A file named `output.txt` should now be listed in your current directory.

``` title="Output"
output.txt
```

## View file contents

The **`cat`** command in Linux is a versatile companion for various file-related operations, allowing users to view, concatenate, create, copy, merge, and manipulate file contents.
Expand All @@ -92,6 +110,10 @@ cat <file name>

You should see `Hello World!` printed to your terminal.

``` title="Output"
Hello World!
```

!!! abstract "Summary"

In this step you have learned:
Expand Down
18 changes: 9 additions & 9 deletions docs/part1/03_hellonf.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,15 @@

Workflow languages are better than Bash scripts because they handle errors and run tasks in parallel more easily, which is important for complex jobs. They also have clearer structure, making it easier to maintain and work on with others.

Here, you're going learn more about the Nextflow language and take your first steps making a **your first pipeline** with Nextflow.
Here, you're going learn more about the Nextflow language and take your first steps making **your first pipeline** with Nextflow.

## `hello-world.nf`
## Writing you first pipeline: `hello-world.nf`

Nextflow pipelines need to be saved as `.nf` files.
Nextflow pipelines are written inside `.nf` files. They consist of a combination of two main components: **processes** and the **workflow** itself. Each process describes a single step of the pipeline, including its inputs and expected outputs, as well as the code to run it. The workflow then defines the logic that puts all of the processes together.

The process definition starts with the keyword `process`, followed by process name, and finally the process body delimited by curly braces. The process body must contain a `script` block which represents the command or, more generally, a script that is executed by it.
A process definition starts with the keyword `process`, followed by process name, and finally the process body delimited by curly braces. The process body must contain a `script` block which represents the command or, more generally, a script that is executed by it.

A process may contain any of the following definition blocks: `directives`, `inputs`, `outputs`, `when` clauses, and of course, `script`.
A process may contain any of the following definition blocks: `directives`, `input`, `output`, `when` clauses, and of course, `script`.

```groovy
process < name > {
Expand All @@ -41,7 +41,7 @@ A workflow is a composition of processes and dataflow logic.

The workflow definition starts with the keyword `workflow`, followed by an optional name, and finally the workflow body delimited by curly braces.

Let's review the structure of `hello-world.nf`, a toy example you will be executing and developing:
Let's review the structure of `hello-world.nf`, a toy example you will be developing and executing:

```groovy title="hello-world.nf" linenums="1"
process SAYHELLO {
Expand Down Expand Up @@ -101,7 +101,7 @@ As a developer you can to choose how and where to comment your code.

The solution may look something like this:

```groovy title="hello-world.nf"
```groovy title="hello-world.nf" hl_lines="1-3"
/*
* Use echo to print 'Hello World!' to standard out
*/
Expand All @@ -111,7 +111,7 @@ As a developer you can to choose how and where to comment your code.

Or this:

```groovy title="hello-world.nf"
```groovy title="hello-world.nf" hl_lines="1"
// Use echo to print 'Hello World!' to standard out
process SAYHELLO {
<truncated>
Expand Down Expand Up @@ -165,7 +165,7 @@ Hello World!
4. The first process is executed once, which means there is one task. The line starts with a unique hexadecimal value, and ends with the task completion information
5. The result string from stdout is printed

## Task directories
## Understanding the task directories

When a Nextflow pipeline is executed, a `work` directory is created. Processes are executed in isolated **task** directories. Each task uses a unique directory based on its [hash](https://www.nextflow.io/docs/latest/cache-and-resume.html#task-hash) (e.g., `4e/6ba912`) within the work directory.

Expand Down
29 changes: 19 additions & 10 deletions docs/part1/04_output.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
1. Utlizie Nextflow process output blocks
2. Publish results from your pipeline with directives

Instead of printing 'Hello World!' to the standard output it can be saved to a file. In a "real-world" pipeline, this is like having a command that specifies an output file as part of its normal syntax.
Currently, our pipeline is simply printing 'Hello World!' to the terminal via the standard output (`stdout`). This isn't particularly useful if we want to do anything with the outputs of our processes. Instead, we can save the output of our process to a file that can be passed on to other processes later on. In a "real-world" pipeline, this is like having a command that specifies an output file as part of its normal syntax.

Here you're going to update the `script` and the `output` definition blocks to save the 'Hello World!' as an output.

Expand Down Expand Up @@ -36,11 +36,14 @@ The `>` operator can be used for output redirection.
}
```

## Outputs blocks
## Capturing outputs

Outputs in the output definition block typically require an **output qualifier** and a **output name**:
We have now updated our script to write 'Hello World!' to `output.txt`, but we also need to tell Nextflow to expect this file - otherwise, it will ignore it! Nextflow requires us to **declare** what outputs should be captured from each process. This is particularly useful for a number of reasons. First, many tools will generate intermediate files that we don't need, and capturing all of them would be messy and unnecessary. Second, Nextflow uses the outputs we declare to figure out how and when to run each process. And finally, by declaring our process outputs, Nextflow has a way to determine whether our process succeeded or not; if an output is declared but is missing at the end of the process, Nextflow will assume it has failed.

We declare our outputs using the `output` definition block. Typically this will require both an **output qualifier** and an **output name**:

```groovy
output:
<output qualifier> <output name>
```

Expand All @@ -66,15 +69,23 @@ output:
path 'output.txt'
```

The output name and the file generated by the script must match (or be picked up by a glob pattern).
The output name and the file generated by the script must exactly match (or be picked up by a glob pattern), or else Nextflow won't find it and will throw an error.

!!! note

It is important to understand that the `output` block does not *determine* the output of the process. Instead, it simply *declares* what output should be expected. It is up to the logic inside the `script` block to ensure that the file is actually being created.

So far, we have been using the `stdout` output declaration, which tells Nextflow to capture all of the information sent to the standard output. This is a special output qualifier in that it doesn't require an output name to go along with it.

Now that we are redirecting our 'Hello World!' message to a file, we want to tell nextflow to expect an output file called `output.txt`.

!!!question "Exercise"

Add `path 'output.txt'` in the `SAYHELLO` output block.

???Solution

```groovy title="hello-world.nf" hl_lines="4-6"
```groovy title="hello-world.nf" hl_lines="6"
// Use echo to print 'Hello World!' and redirect to output.txt
process SAYHELLO {
debug true
Expand All @@ -93,19 +104,17 @@ The output name and the file generated by the script must match (or be picked up

This example is brittle because the output filename is hardcoded in two separate places (the `script` and the `output` definition blocks). If you change one but not the other, the script will break.

## Publishing directory

Without a **publishing** strategy any files that are created by a process will only exist in the `work` directory.
## Publish outputs

Realistically, you may want to capture a set of outputs and save them in a specific directory.
By default, all files created by processes exist only inside the `work` directory. To make our outputs more accessible and neatly organised, we define a **publishing strategy**, which determines which outputs should be copied to a final **publishing directory**.

The [`publishDir` directive](https://www.nextflow.io/docs/latest/process.html#publishdir) can be used to specify where and how output files should be saved. For example:

```groovy
publishDir 'results'
```

By adding the above to a process, all output files would be saved in a new folder called `results` in the current working directory. The process directive is process specific.
By adding the above to a process, all output files would be saved in a new folder called `results` in the current working directory. The `publishDir` directive is process specific.

!!!question "Exercise"

Expand Down
Loading