Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent Runs / Run while data files are being uploaded #58

Open
GopiGugan opened this issue Jun 2, 2022 · 5 comments
Open

Concurrent Runs / Run while data files are being uploaded #58

GopiGugan opened this issue Jun 2, 2022 · 5 comments

Comments

@GopiGugan
Copy link
Collaborator

  1. The autoprocess.py script sometimes runs twice resulting in duplicate *.mapped.csv and *.coverage.csv files. This occurs because another instance of autoprocess.py starts before the first instance terminates.
  2. The script starts while data files are being uploaded resulting in the following error:
ERROR: Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/cutadapt-1.18-py3.6-linux-x86_64.egg/cutadapt/pipeline.py", line 399, in reader_process
    for chunk_index, (chunk1, chunk2) in enumerate(read_paired_chunks(f, f2, buffer_size)):
  File "/usr/local/lib/python3.6/dist-packages/cutadapt-1.18-py3.6-linux-x86_64.egg/cutadapt/seqio.py", line 890, in read_paired_chunks
    bufend2 = f2.readinto(memoryview(buf2)[start2:]) + start2
  File "/usr/lib/python3.6/gzip.py", line 276, in read
    return self._buffer.read(size)
  File "/usr/lib/python3.6/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/usr/lib/python3.6/gzip.py", line 482, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
@ArtPoon
Copy link
Contributor

ArtPoon commented Jun 7, 2022

Is it possible to use file modification dates to check whether a new upload has started after initializing a script?

@ArtPoon
Copy link
Contributor

ArtPoon commented Oct 19, 2022

Still need solution for cases where the pipeline is run while data files are being uploaded

@ArtPoon
Copy link
Contributor

ArtPoon commented Oct 19, 2022

Would a possible fix be to skip fastq.gz files that are incomplete?
Also edge case of R1 file being present but R2 being absent.

@ArtPoon
Copy link
Contributor

ArtPoon commented Mar 7, 2023

Still no solution if pipeline is run while a user is uploading new data

@ArtPoon
Copy link
Contributor

ArtPoon commented Apr 9, 2024

  • The remaining issue is that if a lab uploads new data to the server while the pipeline is running, the pipeline will terminate when it attempts to read an incomplete file (partial upload).
  • Can Python detect when a file is being written to by another process? If the pipeline encounters a file that is locked for writing by another process, it should delay reading it. If it is still locked after a reasonable delay (10 minutes?) then the pipeline should exit with an error.
  • @GopiGugan described another approach where we use the run summary file (manifest of FASTQ files in the run) to determine what files to look for; the pipeline would catch exceptions where a file is incomplete and it would not be written to the database. When the pipeline is run again, that file would be flagged as new for processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants