User Guide Appendix: Uploading Large Nucleic Acid Sequences in GSRS

How to Upload Large Nucleic Acid Sequences in GSRS

Updated 05/09/2022

GSRS supports storing nucleic acid and amino acid sequences for substance records. Sequence data is used within GSRS for a variety of purposes:

To act as part of the definition as a necessary element for validation
To compute molecular formulas, weights, and other properties
To expose the sequence for viewing
To allow sequence alignment searching / similarity searching
To allow validation checks for sufficiently similar sequences which may be duplicates
FASTA and other sequence format exports

Typically, a sequence is added to the “Subunit” forms in proteins and nucleic acid registration, as shown below:

Subunit Form Example

However, when a sequence is very large (>10,000 residues) this procedure is inefficient and performs poorly. There’s no specific cutoff for how large sequences can be while using this mechanism, but the forms, JSON responses, and interactivity of the software are adversely affected the larger the sequences get. As a rule of thumb, we typically suggest that any sequence longer than 10,000 residues not be entered this way.

Instead, there’s an alternative way to register large sequences as a “fasta” file.

Large Sequence Procedure

For sequences of significant size, the process is as follows:

Prepare a FASTA file for the large sequence(s)
- A FASTA file is an ASCII text file that has the first line start with the “>” (greater than) character.
- The second line should have the full amino acid or nucleic acid sequence.
- The file extension should typically be “.fas” or “.fasta”, but “.txt” will work.
- More information is available here.
Mark the record as “incomplete”
Create a new definitional reference
**Add a new reference with source type “FDA_SRS” (optional), source citation “FASTA SEQUENCE DOCUMENT” (optional), and Tags set to “fasta” (required).
**Click “Upload Document” and select the FASTA file prepared in step 1.
**Click “save”.
Add any other information needed (names, codes, references), and save the record

Remarks and Troubleshooting

As of 3.0.1, there are behavioral differences and peculiarities that should be noted.

What’s the same as using “subunits”:

UI-based sequence search works if a user copies and pastes a sequence into the sequence search box.
Validation rules which do sequence similarity searches work with this sequence mechanism.
REST API sequence searches work with this mechanism.

What’s different when using this mechanism:

Automatic property calculation for molecular weight and formula do NOT currently work when using this mechanism.
Data exchange mechanisms do NOT currently embed the FASTA file as part of the JSON. Must be sent in addition and re-applied.
Selecting sites for modifications, glycosylation, disulfide bridges, link areas and sugars/linkages DO NOT currently support this mechanism. However, conventions can be used for some of these cases.
The UI does NOT currently render the sequence, and the file must be downloaded.
Duplicate checks based on definitional hash do not currently use this form.
Exports which use the sequence as part of a spreadsheet do not export these sequences.
UI buttons to search from browse do not allow selecting subunits which are in FASTA file documents.

Known bugs in this approach:

When using the “async” configuration for indexing (the default configuration), sequences are NOT indexed on edit or creation and MUST be reindexed using the backend reindexing. To use this feature, currently, it’s recommended to use the synchronous indexing config instead.
Replacing an uploaded document in the UI using the “replace” mechanism does not work. You must delete and re-add the document.
When updating a document you must save the record itself too. Changing the document alone doesn’t trigger an update unless reindexing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly