Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gzip external process for insert archiving is broken #37

Open
2 tasks
sheridancbio opened this issue Aug 15, 2016 · 1 comment
Open
2 tasks

gzip external process for insert archiving is broken #37

sheridancbio opened this issue Aug 15, 2016 · 1 comment

Comments

@sheridancbio
Copy link
Collaborator

command : java -jar -Xmx10g pdb-alignment-pipeline/target/pdb-alignment-pipeline-0.1.0.jar init

This was run with batch increment sizes of 10000 and a Ensembl sequence database size of 50000.

  • First, the cleanup steps which gzip insert scripts fail when a gzip file already exists in the current directory .. this should be fixed to succeed (or not be run) when the user issues the "init" command on the java command line
  • There is a loop for reading the error stream. This loop has several problems: (1) the standard error stream is a stream of characters, but the code will read these characters as integer values and output them with tab characters between them. This destroys the meaning of the character stream. (2) The loop condition is set by calling InputStream.available() ... which only tells how many characters from the file are currently in the memory buffer. Why should we be printing out only the very small memory buffer? We should print the entire contents of the error stream. read() returns -1 when the system reaches EOF. (3) The reading of the input stream and output stream should be done no matter what --- not just when there is an error. We must flush both streams so that the buffers do not fill up and cause the problem to hang because the next write operation to the standard stream cannot complete due to a full buffer. Use a thread which reads an input stream and collects output into a String. I will send you example code.

Here are examples of the stderr output stream:

As the code is currently written:
2016-08-15 14:23:23 ERROR CommandProcessUtil:33 - [Process] Error: 103 122 105 112 58 32 47 85 115 101 114 115 47 115 104 101 114 105 100 97 114 47 114 101 112 111 115 47 115 104 101 114 105 100 97 110 99 98 105 111 47 112 100 98 45 97 110 110 111 116 97 116 105 111 110 47 103 115 111 99

After changing to output characters, not integers:
2016-08-15 14:53:59 ERROR CommandProcessUtil:33 - [Process] Error: g z i p : / U s e r s / s c

After recoding the loop, so that the reading continues until end of file, and getting rid of tab characters between each letter:
2016-08-15 15:17:19 ERROR CommandProcessUtil:36 - [Process] Error: gzip: /Users/sheridar/repos/sheridancbio/pdb-annotation/gsoc_3d_testing/insert.sql.0.gz already exists; not overwritten

@sheridancbio
Copy link
Collaborator Author

One approach to solving this would be to add the "-f" flag to the gzip command. This causes gzip to overwrite any file that already exists. Another approach would be to detect the existence of a file which already exists, print a warning message to the user (like : "overwriting file insert.sql.0.gz") and then overwriting with -f or first deleting the file with File.delete()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants