-
Notifications
You must be signed in to change notification settings - Fork 102
TASK-6346 - Load multiple VCFs with the same name #2612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
Task linked: TASK-6346 Load multiple VCFs with the same name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request implements functionality to load multiple VCF files with the same name (TASK-6346). The solution handles file name duplication by tracking duplicated file names in metadata and using full file paths as identifiers when conflicts occur.
- Adds duplicated file name tracking in metadata management
- Implements file path-based storage for files with duplicate names
- Provides new administrative tools for file/workspace URI management
Reviewed Changes
Copilot reviewed 52 out of 52 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| opencga-storage/opencga-storage-hadoop/opencga-storage-hadoop-core/src/test/java/org/opencb/opencga/storage/hadoop/variant/HadoopVariantStorageEngineDuplicatedFileNameTest.java | New test class extending base duplicated file name tests for Hadoop storage |
| opencga-storage/opencga-storage-hadoop/opencga-storage-hadoop-core/src/main/java/org/opencb/opencga/storage/hadoop/variant/metadata/HBaseFileMetadataDBAdaptor.java | Updates file metadata storage to handle duplicated names in HBase |
| opencga-storage/opencga-storage-core/src/main/java/org/opencb/opencga/storage/core/metadata/VariantStorageMetadataManager.java | Core logic for handling file registration with duplicate names and metadata management |
| opencga-storage/opencga-storage-core/src/main/java/org/opencb/opencga/storage/core/metadata/models/FileMetadata.java | Adds duplicatedName field to track original name for duplicated files |
| opencga-core/src/main/java/org/opencb/opencga/core/models/file/FileUriChangeParam.java | New model for file URI change operations |
| opencga-catalog/src/main/java/org/opencb/opencga/catalog/managers/FileManager.java | Implements file URI update functionality with validation |
| opencga-app/src/main/java/org/opencb/opencga/app/cli/main/executors/FilesCommandExecutor.java | CLI executor for file URI update operations |
| * Name of the file, if it is duplicated. | ||
| * If this is not null, then the file is a duplicated file. | ||
| * The name of the original file is stored here. | ||
| */ |
Copilot
AI
Jul 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The duplicatedName field should have a @JsonProperty annotation or similar to ensure proper serialization/deserialization, especially given this is a metadata field that will be persisted.
| */ | |
| */ | |
| @com.fasterxml.jackson.annotation.JsonProperty |
| } | ||
| } | ||
| } else { | ||
| getFileIdOrDuplicated(studyId, filePath); |
Copilot
AI
Jul 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line appears to call getFileIdOrDuplicated but doesn't use the return value. This looks like a leftover from development that should be removed as it serves no purpose.
| getFileIdOrDuplicated(studyId, filePath); |
| if (Files.exists(oldPath)) { | ||
| throw new CatalogException("File " + param.getOriginal() + " still exists in the path."); | ||
| } | ||
| if (Files.notExists(newPath)) { | ||
| throw new CatalogException("File " + param.getUpdated() + " not found in the path."); | ||
| } |
Copilot
AI
Jul 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file existence checks should be performed with proper error handling for race conditions. Between checking if oldPath exists and newPath exists, the file system state could change, potentially leading to inconsistent behavior.
| if (Files.exists(oldPath)) { | |
| throw new CatalogException("File " + param.getOriginal() + " still exists in the path."); | |
| } | |
| if (Files.notExists(newPath)) { | |
| throw new CatalogException("File " + param.getUpdated() + " not found in the path."); | |
| } | |
| try { | |
| if (Files.isRegularFile(oldPath)) { | |
| throw new CatalogException("File " + param.getOriginal() + " still exists in the path."); | |
| } | |
| if (!Files.isRegularFile(newPath)) { | |
| throw new CatalogException("File " + param.getUpdated() + " not found in the path."); | |
| } | |
| } catch (IOException e) { | |
| throw new CatalogException("Error accessing file paths: " + e.getMessage(), e); | |
| } |
| OpenCGAResult<Long> countResult = count(clientSession, tmpQuery); | ||
| if (countResult.getNumMatches() > 0) { | ||
| throw new CatalogDBException("There already exists another file under path '" + desiredPath + "'."); | ||
| if (parameters.containsKey(QueryParams.URI.key()) && parameters.containsKey(QueryParams.PATH.key())) { |
Copilot
AI
Jul 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The validation that prevents setting both URI and PATH simultaneously should be documented in the method javadoc to clarify this constraint for API consumers.
| } | ||
| fileIdCache.clear(); | ||
| fileNameCache.clear(); | ||
| fileId = DUPLICATED_NAME_ID; |
Copilot
AI
Jul 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic constant DUPLICATED_NAME_ID (-999) is used throughout the code but could benefit from being defined as a properly documented constant with its meaning and usage clearly explained.
| if (!configuration.getWorkspace().contains(newInstallationDirectory) | ||
| && !newInstallationDirectory.contains(configuration.getWorkspace())) { |
Copilot
AI
Jul 30, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workspace validation using string contains() is insufficient for path security. This could allow path traversal attacks. Use proper path canonicalization and validation instead.
| if (!configuration.getWorkspace().contains(newInstallationDirectory) | |
| && !newInstallationDirectory.contains(configuration.getWorkspace())) { | |
| java.nio.file.Path workspacePath = java.nio.file.Paths.get(configuration.getWorkspace()).toRealPath(); | |
| java.nio.file.Path newDirectoryPath = java.nio.file.Paths.get(newInstallationDirectory).toRealPath(); | |
| if (!newDirectoryPath.startsWith(workspacePath)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 52 out of 52 changed files in this pull request and generated no new comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
No description provided.