Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wip clarification usage #358

Merged
merged 3 commits into from
Apr 12, 2024
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 28 additions & 21 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,57 +6,64 @@

## Samplesheet input

You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It has to be a comma-separated file with 4 columns, and a header row as shown in the examples below.
Before running the pipeline, create a samplesheet with information about the samples you would like to analyse. This samplesheet contains the files that will be passed as inputs to the pipeline. The --input parameter is used to specify the samplesheet location. It has to be a comma-separated file with 4 columns, and a header row as shown in the examples below.

```bash
--input '[path to samplesheet file]'
```

### Multiple replicates

The `sample` identifier is the same when you have multiple biological replicates from the same experimental group, just increment the `replicate` identifier appropriately. The first replicate value for any given experimental group must be 1. Below is an example for a single experimental group in triplicate:
The `sample` identifier is the same when you have multiple biological replicates from the same experimental group, just increment the `replicate` identifier appropriately. The first replicate value for any given experimental group must be 1. Below is an example for the analysis of paired-end sequencing of ATAC-seq experiment performed in triplicate for the cell line "A" :

```console
sample,fastq_1,fastq_2,replicate
CONTROL,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,1
CONTROL,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,2
CONTROL,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,3
SAMPLE_A,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,1
SAMPLE_A,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,2
SAMPLE_A,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,3
```

The pipeline will automatically append the `&_REP<BIOLOGICAL_REPLICATE_NUMBER>` suffix to the sample name within the pipeline e.g. `CONTROL_REP1`, `CONTROL_REP2` and `CONTROL_REP3` using the example above. If you don't have replicates you can set the `replicate` value to 1 for all of your samples.
The pipeline will automatically append the `&_REP<BIOLOGICAL_REPLICATE_NUMBER>` suffix to the sample name within the pipeline e.g. `SAMPLE_A_REP1`, `SAMPLE_A_REP2` and `SAMPLE_A_REP3` using the example above. If you don't have replicates you can set the `replicate` value to 1 for all of your samples. Below an example for the analysis of of paired-end sequencing of ATAC-seq experiment performed without replicates for the cell line "A" and for the tissue "B":

```console
sample,fastq_1,fastq_2,replicate
SAMPLE_A,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,1
SAMPLE_B,BEG599B2_S1_L003_R1_001.fastq.gz,BEG599B2_S1_L003_R2_001.fastq.gz,1
```

### Multiple runs of the same sample

The `sample` and `replicate` identifiers have to be the same when you have re-sequenced the same sample more than once e.g. to increase sequencing depth. The pipeline will perform the alignments in parallel, and subsequently merge them before further analysis. Below is an example a sample sequenced across 3 lanes:
The `sample` and `replicate` identifiers have to be the same when you have re-sequenced the same sample more than once e.g. to increase sequencing depth. The pipeline will perform the alignments in parallel, and subsequently merge them before further analysis. Below is an example of how the samplesheet for SAMPLE_A would look like if sequenced across 3 lanes:

```csv title="samplesheet.csv"
sample,fastq_1,fastq_2,replicate
CONTROL,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,1
CONTROL,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,1
CONTROL,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,1
SAMPLE_A,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,1
SAMPLE_A,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,1
SAMPLE_A,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,1
```

The pipeline will automatically append the `*_T<TECHNICAL_REPLICATE_NUMBER>` suffix to the sample name within the pipeline e.g. `CONTROL_REP1_T1`, `CONTROL_REP1_T2` and `CONTROL_REP1_T3` using the example above.
The pipeline will automatically append the `*_T<TECHNICAL_REPLICATE_NUMBER>` suffix to the sample name within the pipeline e.g. `SAMPLE_A_REP1_T1`, `SAMPLE_A_REP1_T2` and `SAMPLE_A_REP1_T3` using the example above.

### Control data
### INPUT control data

If controls are to be used for peak calling use the parameter `--with_control`. In this case, the samplesheet file needs the additional columns `control` and `control_replicate`. These should be the sample identifier and sample replicate for the controls.
An input control is a file that can be used during peak calling to estimate the background of the experiment. If input controls sequencing information is available, it can be used for peak calling using the parameter `--with_control`. In this case, the samplesheet file needs the additional columns `control` and `control_replicate`. These should be the sample identifier and sample replicate for the input controls, as in the example below.

### Full samplesheet

The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet. The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first 4 columns to match those defined in the table below.

A final samplesheet file consisting of both single- and paired-end data may look something like the one below. This is for 7 samples, where we have biological triplicates for both the `CONTROL` and `TREATMENT` groups, and the third replicate in the `TREATMENT` group has been a technical replicate as a result of being sequenced twice.
A final samplesheet file consisting of both single- and paired-end data may look something like the one below. This is for 7 samples, where we have biological triplicates for both the control condition, such as cell line "A" untreated, `UNTREATED_A` and the treatment condition, such as cell line "A" treated with a compound, `TREATED_A` groups, and the third replicate in the `TREATED_A` group has been a technical replicate as a result of being sequenced twice. In this example, INPUT control is available `INPUT_A` with no replicates.

```csv title="samplesheet.csv"
sample,fastq_1,fastq_2,replicate,control,control_replicate
CONTROL,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,1,,
CONTROL,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz,2,,
CONTROL,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz,3,,
TREATMENT,AEG588A4_S4_L003_R1_001.fastq.gz,,1,CONTROL,1
TREATMENT,AEG588A5_S5_L003_R1_001.fastq.gz,,2,CONTROL,2
TREATMENT,AEG588A6_S6_L003_R1_001.fastq.gz,,3,CONTROL,3
TREATMENT,AEG588A6_S6_L004_R1_001.fastq.gz,,3,CONTROL,3
INPUT_A,IEG577I1_S1_L001_R1_001.fastq.gz,IEG577I1_S1_L002_R2_001.fastq.gz,1,,
UNTREATED_A,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,1,INPUT_A,1
UNTREATED_A,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz,2,INPUT_A,1
UNTREATED_A,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz,3,INPUT_A,1
TREATED_A,AEG588A4_S4_L003_R1_001.fastq.gz,,1,INPUT_A,1
TREATED_A,AEG588A5_S5_L003_R1_001.fastq.gz,,2,INPUT_A,1
TREATED_A,AEG588A6_S6_L003_R1_001.fastq.gz,,3,INPUT_A,1
TREATED_A,AEG588A6_S6_L004_R1_001.fastq.gz,,3,INPUT_A,1
```

| Column | Description |
Expand Down
Loading