Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update reference_genomes_custom_genomes.md #4586

Merged
merged 9 commits into from
Dec 18, 2023
45 changes: 33 additions & 12 deletions faqs/galaxy/datasets_chromosome_identifiers.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,24 +6,45 @@ layout: faq
contributors: [jennaj, Melkeb]
---

Reference data mismatches are similiar to bad reagents in a wet lab experiment: all sorts of odd problems can come up!

- The methods listed here help to identify and correct errors or unexpected results linked to inputs having non-identical chromosome identifiers and/or different chromosome sequence content.
You inputs must be all based on an identical genome assembly build to achieve correct scientific results.

- **If using a Custom Reference genome**, the methods below also apply, but the first step is to make certain that the Custom Genome is formatted correctly. Improper formatting is the most common root cause of CG related errors.
There are two areas to review for data to be considered **identical**.
1. The data are based on the same exact genome **assembly** (or "assembly release").
* The "assembly" refers to the nucleotide sequence of the genome.
* If the base order and length of the chromosomes are not the same, then your coordinates will have scientific problems.
* Converting coordinates between assemblies may be possible. Search tool panel with `CrossMap`.
2. The data are based on the same exact genome assembly **build**.
* The "build" refers to the labels used inside the file. In this context, pay attention to the chromosome identifiers.
* These all may mean the same thing to a person but not to a computer or tool: chr1, Chr1, 1, chr1.1
* Converting identifiers between builds may be possible. Search tool panel with `Replace`.

The methods listed below help to identify and correct errors or unexpected results when the underlying genome assembly build for all inputs are **not identical**.

Method 1: [Finding BAM dataset identifiers]({% link faqs/galaxy/datasets_BAM_dataset_identifiers.md %})
**Method 1**: [Finding BAM dataset identifiers]({% link faqs/galaxy/datasets_BAM_dataset_identifiers.md %})

Method 2: [Directly obtaining UCSC sourced *genome* identifiers]({% link faqs/galaxy/datasets_UCSC_sourced_genome.md %})
**Method 2**: [Directly obtaining UCSC sourced *genome* identifiers]({% link faqs/galaxy/datasets_UCSC_sourced_genome.md %})

Method 3: [Adjusting identifiers for UCSC sourced data used with other sourced data](https://galaxyproject.org/support/chrom-identifiers/#adjusting-identifiers-or-input-source)
**Method 3**: [Adjusting identifiers for UCSC sourced data used with other sourced data](https://galaxyproject.org/support/chrom-identifiers/#adjusting-identifiers-or-input-source)

Method 4: [Adjusting identifiers or input source for any mixed sourced data](https://galaxyproject.org/support/chrom-identifiers/#any-mixed-sourced-data)
**Method 4**: [Adjusting identifiers or input source for any mixed sourced data](https://galaxyproject.org/support/chrom-identifiers/#any-mixed-sourced-data)

**A Note on Built-in Reference Genomes**

- The default variant for all genomes is "Full", defined as all primary chromosomes (or scaffolds/contigs) including mitochondrial plus associated unmapped, plasmid, and other segments.
- When only one version of a genome is available for a tool, it represents the default "Full" variant.
- Some genomes will have more than one variant available.
{% icon tip %} Reference data is self referential. [More help for your genome, transcriptome, and annotation]({% link faqs/galaxy/analysis_differential_expression_help.md %})

- The "Canonical Male" or sometimes simply "Canonical" variant contains the primary chromosomes for a genome. For example a human "Canonical" variant contains chr1-chr22, chrX, chrY, and chrM.
- The "Canonical Female" variant contains the primary chromosomes excluding chrY.
{% icon tip %} Genome not available as a native index? [Use a custom genome fasta]({% link faqs/galaxy/reference_genomes_custom_genomes.md %}) and [create a custom build database]({% link faqs/galaxy/analysis_add_custom_build.md %}) instead.

{% icon tip %} More notes on Native Reference Genomes

* Native **reference genomes** (FASTA) are built as pre-computed indexes on the Galaxy server where you are working.
* Different servers host both common *and* different reference genome data.
* Most **reference annotation** (tabular, GTF, GFF3) is supplied from the history by the user, even when the genome is indexed.
* Public Galaxy servers source reference genomes preferentially from [UCSC](https://hgdownload.soe.ucsc.edu/downloads.html).
* A **reference transcriptome** (FASTA) is supplied from the history by the user.
* Many experiements use a combination of all three types of reference data. Consider pre-preparing your files at the start!
* The default variant for a native genome index is "Full". Defined as: all primary chromosomes (or scaffolds/contigs) including mitochondrial plus associated unmapped, plasmid, and other segments.
* When only one version of a genome is available for a tool, it represents the default "Full" variant.
* Some genomes will have more than one variant available.
* The "Canonical Male" or sometimes simply "Canonical" variant contains the primary chromosomes for a genome. For example a human "Canonical" variant contains chr1-chr22, chrX, chrY, and chrM.
* The "Canonical Female" variant contains the primary chromosomes excluding chrY.
Original file line number Diff line number Diff line change
Expand Up @@ -5,24 +5,37 @@ box_type: tip
layout: faq
contributors: [jennaj, AnomalyCodes]
---
1. Click on {% icon galaxy-gear %} in the history panel of the *sender* Galaxy server
2. Click on **Export to File**
3. Select either exporting history **to a link** or **to a remote file**
4. Click on the link text to generate a new archive for the history *if* exporting to a link
5. Wait for the link to generate
6. Copy the link address or click on the generated link to download the history archive
7. Click on **User** on the top menu of the *receiver* Galaxy server
8. Click on **Histories** to view saved histories
9. Click on **Import history** in the grey button on the top right
10. Select the appropriate importing method based on the choices made in steps 3 and 6
- Choose **Export URL from another galaxy instance** if link address was copied in step 6
- Select **Upload local file from your computer** if history archive was downloaded in step 6
- Choose **Select a remote file** if history was exported to a remote file in step 3
11. Click the link text to check out your histories if import is successful


If history being transferred is too large, you may:
1. Click on {% icon galaxy-gear %} in the history panel of the *sender* Galaxy server
2. Click **Copy Datasets** to move just the important datasets into a new history
3. Create the archive from that smaller history

**Transfer a Single Dataset**

At the **sender** Galaxy server, [set the history to a shared state]({% link faqs/galaxy/histories_sharing.md %}), then directly capture the {% icon link %} link for a dataset and paste the URL into the **Upload** tool at the **receiver** Galaxy server.

**Transfer an Entire History**

[Have an account]({% link faqs/galaxy/account_create.md %}) at two different Galaxy servers, and be logged into both.

At the **sender** Galaxy server

1. Navigate to the history you want to transfer, and [set the history to a shared state]({% link faqs/galaxy/histories_sharing.md %}).
2. Click into the **History Options** menu in the history panel.
3. Select from the menu {% icon fa-file-arch %} **Export History to File**.
4. Choose the option for **How do you want to export this History?** as **to direct download**.
5. Click on **Generate direct download**.
6. Allow the archive generation process to complete. \*
7. Copy the {% icon fa-link %} link for your new archive.

At the **receiver** Galaxy server

8. Confirm that you are logged into your account.
9. Click on **User** in the top menu, and choose **Histories** to reach your **Saved Histories**.
10. Click on **Import history** in the grey button on the top right.
11. Paste in your link's URL from step 7.
12. Click on **Import History**.
13. Allow the archive import process to complete. \*
14. The transfered history will be uncompressed and added to your **Saved Histories**.


\* For steps 6 and 13: It is Ok to navigate away for other tasks during processing. If enabled, Galaxy will send you [status notifications]({% link faqs/galaxy/account_update_preference.md %}).


{% icon fa-info-circle %} If the history to transfer is large, you may [copy just your important datasets into a new history]({% link faqs/galaxy/histories_copy_dataset.md %}), and create the archive from that new smaller history. Clearing away deleted and purged datasets will make *all* histories smaller and faster to archive and transfer!
30 changes: 23 additions & 7 deletions faqs/galaxy/reference_genomes_custom_genomes.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,28 @@ contributors: [jennaj, Nurzhamalyrys]
---


A reference genome contains the nucleotide sequence of the chromosomes, scaffolds, transcripts, or contigs for single species. It is representative of a specific genome build or release. There are two options to use reference genomes in Galaxy: _native_ (provided by the server administrators and used by most of the tools) and _custom_ (uploaded by users in FASTA format).
A **reference genome** contains the nucleotide sequence of the chromosomes, scaffolds, transcripts, or contigs for single species. It is representative of a specific genome assembly build or release.

There are five basic steps to use a Custom Reference Genome:

There are two options for reference genomes in Galaxy.
* **Native**
* Index provided by the server administrators.
* Found on tool forms in a drop down menu.
* A database key is automatically assigned. See tip 1.
* The database is what links your data to a FASTA index. Example: used with BAM data
* **Custom**
* FASTA file uploaded by users.
* Input on tool forms then indexed at runtime by the tool.
* An optional custom database key can be created and [assigned by the user]({% link faqs/galaxy/datasets_change_dbkey.md %}).

1. Obtain a FASTA copy of the target genome.
2. Use FTP to upload the genome to Galaxy and load into a history as a dataset.
3. Clean up the format with the tool **NormalizeFasta** using the options to wrap sequence lines at 80 bases and to trim the title line at the first whitespace.
4. Make sure the chromosome identifiers are a match for other inputs.
5. Set a tool form's options to use a custom reference genome from the history and select the loaded genome.
There are five basic steps to use a **Custom Reference Genome**, plus one optional.
1. Obtain a FASTA copy of the target genome. See tip 2.
2. Upload the genome to Galaxy and to add it as a dataset in your history.
3. [Clean up the format]({% link faqs/galaxy/datasets_working_with_fasta.md %}) with the tool **NormalizeFasta** using the options to wrap sequence lines at 80 bases and to trim the title line at the first whitespace.
4. Make sure the [chromosome identifiers]({% link faqs/galaxy/datasets_chromosome_identifiers.md %}) are a match for other inputs.
5. Set a tool form's options to use a custom reference genome from the history and select the loaded genome FASTA.
6. (Optional) Create a [custom genome build's database]{% link faqs/galaxy/analysis_add_custom_build.md %}) that you can [assign to datasets]({% link faqs/galaxy/datasets_change_dbkey.md %}).

{% icon tip %} TIP 1: Avoid [assigning a native database]({% link faqs/galaxy/datasets_change_dbkey.md %}) to uploaded data unless you confirmed the data are based on the [same exact genome assembly]({% link faqs/galaxy/datasets_chromosome_identifiers.md %}) or you [adjusted the data to be a match]({% link topics/introduction/tutorials/data-manipulation-olympics/tutorial.html %}) **first**!

{% icon tip %} TIP 2: When choosing your reference genome, consider [choosing your reference annotation]{% link faqs/galaxy/analysis_differential_expression_help.md %}) at the same time. Standardize the format of both as a preparation step. Put the files in a dedicated "reference data" history for easy resuse.