Improving ext storage with streaming #5949

PVince81 · 2013-11-19T17:19:05Z

Currently the performance of external storage is bad because the file must first be downloaded into a temporary file, then fopen() is called on that file and the handle is returned. This means that downloading a 4 GB file from an ext storage will first create a 4 GB temporary file which is then passed to the client as download.

In some cases like downloading, mimetype scanning (if we keep it), antivirus app, etc we are only interested in getting either the first bytes of the stream (fread() then fclose()) or stream the whole file sequentially. No fseek() needed.

In such cases, it might be more efficient to do a fopen() on the stream directly, if possible. It seems that PHP allows fopen() on HTTP URLs. We could just stream the body of the response as it comes into the hooks and back to the client.

I had a quick look and it looks like most external storages could be modified to use streaming, as many of them use HTTP calls anyway.

The alternative to this would be to use a library like CURL that uses threads to pre-download the file into the temporary file. The control could be given back to the caller before the file is finished downloading, so that they can already start working on the start of the temporary file.

Please let me know what you think of this idea.
@icewind1991 @karlitschek @schiesbn @DeepDiver1975 @bantu

List of storages

PVince81 · 2013-11-19T17:19:28Z

Here is an example how to use streams with fopen():
http://php.net/manual/en/wrappers.http.php

PVince81 · 2013-11-19T17:22:27Z

@schiesbn the only fseek() call I know so far is the one to detect whether a file is encrypted. As we discussed before, this can be replaced with an approach that simply detects whether keys exist for that file.

karlitschek · 2013-11-19T17:23:42Z

And additionally we could experiment with parallel http request to reduce roundtrip time. .
https://code.google.com/p/rolling-curl/
https://github.com/owncloud/administration/blob/master/performance-tests/oc-stress.php

PVince81 · 2013-11-19T20:35:23Z

Just saw that the webdav.php storage already uses curl, but it still does a full-download of the files.
It also uses "php://temp" as temporary file instead of \OC_Helper::tmpFile($ext).
From the doc it seems that "php://temp" is using memory for up to 2 MB.

And the FTP storage does a fopen() directly on the ftp:// URL.
Seems to work with http as well (haven't tried https yet)

PVince81 · 2013-11-21T14:32:53Z

Check whether thumbnail creation is using streaming (direct fopen) or temporary files (@georgehrke)
Find a way to stream with fopen() into avconv/ffmpeg instead of using a temp file (or raise an issue for that)

PVince81 · 2014-03-06T14:10:12Z

For OC7 we will try and introduce streaming to WebDAV mounts and SMB mounts.
If time is left, then try and fix the other ones as well.

bantu · 2014-03-06T14:12:45Z

@PVince81 Doing WebDAV streaming should be not too hard to implement. How do you want to approach SMB/CIFS streaming?

PVince81 · 2014-03-06T14:14:49Z

Piping with smbclient. I've already experimented with it and it works, see #4770 (comment)

But first we need to get @icewind1991 's new SMB impl in: #4770

georgehrke · 2014-03-06T14:14:52Z

Find a way to stream with fopen() into avconv/ffmpeg instead of using a temp file (or raise an issue for that)

We don't create a temp file of the whole file, just for the first 5MB.
https://github.com/owncloud/core/blob/master/lib/private/preview/movies.php#L47

PVince81 · 2014-03-06T14:16:09Z

@georgehrke yes I know, that's why streaming ext storage will improve that as well.
Currently most ext storage will FIRST download the full file to the temp storage, then call fopen on the temp file and give the stream to you (the preview code)

bantu · 2014-03-06T14:16:31Z

@PVince81 Ah okay, I was expecting a PHP implementation of SMB/CIFS, hehe. :-D

PVince81 · 2014-03-26T17:41:50Z

Just had a look at WebDAV streaming.
It is possible to stream download using fopen('https://...') to the WebDAV server.
BUT the stream wrapper doesn't support writing, which means we'll still need the temporary files.

I also looked at curl and it seems that it's not possible to pipe files to curl, at least not using the "curl_exec" functions.

bantu · 2014-03-26T18:03:04Z

@PVince81 How much work would it be to implement a stream wrapper that can be written to?

PVince81 · 2014-03-26T18:06:45Z

Not sure, but it could be an idea. There are callback functions for read and write, so these could be used to buffer the next block in the wrapper.

I've also found this: http://guzzle.readthedocs.org/en/latest/ (which happens to be used by the Amazon API lib)

lokeller · 2014-08-24T13:37:35Z

Is this still in the roadmap?

In my case I'm now using a workaround. I substitute the code in lib/private/files/storage/dav.php in function DAV::fopen for the case "r" and "rb" with the following code:

if (!$this->file_exists($path)) {
  return false;
}
return fopen($this->createBaseUri() . $this->encodePath($path), $mode);

and I can now stream files directly from the external remote. This may break stuff elsewhere (for instance any place that calls fseek on the returned stream) and it is not 100% functionally equivalent to the code I substituted (for instance maybe SSL validation) but now I can finally use external storage in my usecase (starting to consume > 1GB files as quickly as possible while reading from a very slow external storage).

I studied a bit a possible real solution to the problem starting from the pointers that PVince81 added above. One way to properly fix this seems to be to create a new stream wrapper that saves the file to a local storage (to allow seeking) and returns data as soon as it is available. Ingredients of this solutions seem to be the function stream_wrapper_register and the option CURLOPT_WRITEFUNCTION. I'm not good enough in multi-threaded PHP (if this even exists) to be able to write a patch implementing this but I hope that someone with the necessary skills gets interested in this problem.

PVince81 · 2014-08-25T08:59:05Z

I don't really understand how and why your code change would allow streaming.

From what I heard there is no such thing as multi-threaded PHP, unless maybe if the called function is implemented in C/C++ in the back 😦

There are already a few stream wrappers used in the external storage backends that work almost like you said, but the temporary file is first fully downloaded before access to it is given.
It might be possible to use curl (or multicurl if that makes sense ?) to downloaded the data in the background and make it available packet-wise.
Then if the caller tries to seek forward, it will be blocked until the relevant packet is available.

Another idea, for seeking, is that the caller can inform whether fseek is required or not.
If fseek is required, the old behavior (full download to temp) is used.
Else streaming is used (when available).

Needs further research.

icewind1991 · 2014-08-25T11:23:05Z

See #10620

lokeller · 2014-08-25T20:57:06Z

Thanks for the feedback. Turns out that as you stated there is not multithreading in general in PHP.

I studied a bit more the problem a mix of what you suggest (multicurl) and what I was proposing can be used to implement something that works. Below I put a small sample prototype of the approach I propose:

<?php

class SeekableHttpStream {

    // keeps track of the number of bytes written to the temporary file
    var $read_bytes = 0;

    // context used to download the file
    var $curl;
    var $multi_curl;

    // a file resource pointing to a temporary file where we store the download 
    var $temp_file;

    // different than null as long as we are downloading from the remote server
    var $active;

    // position at which our client (the person that opened the stream with our)
    // protocol wants the next read to happen 
    var $position = 0;

    function write_to_temp($res, $data) {

    // here we write the newly received data at the end of the temporary file
    fseek($this->temp_file, $this->read_bytes);
    $written = fwrite($this->temp_file, $data);
    $this->read_bytes += $written;
    return $written;

    }

    function stream_open($path, $mode, $options, &$opened_path) {

    // open the temporary file that will store the partial download
    $this->temp_file = fopen("php://temp", "rwb");

    // setup the download
    $this->curl = curl_init();

    $real_url = "http".substr($path, strlen("http+curl")) ; 

    curl_setopt($this->curl, CURLOPT_URL, $real_url);
    curl_setopt($this->curl, CURLOPT_HEADER, 0);

    // set a callback that will be called when new data comes in
    curl_setopt($this->curl, CURLOPT_WRITEFUNCTION, array($this, "write_to_temp"));

    $this->multi_curl = curl_multi_init();

    curl_multi_add_handle($this->multi_curl,$this->curl);

    $this->active = null;

    // start the download
    do {
        $mrc = curl_multi_exec($this->multi_curl, $this->active);
    } while ($mrc == CURLM_CALL_MULTI_PERFORM);

    if ( $mrc != CURLM_OK ) {
        $this->stream_close();
    }

    error_log($this->active);

        return true;
    }

    function stream_close() {
    // clean up when closing the download
    $active = null;
    curl_multi_remove_handle($this->multi_curl, $this->curl);
    curl_multi_close($this->multi_curl);
    }


    function stream_read($count) {

    if ( $this->stream_eof()) {
        return FALSE;
    }

    $available_count = min( $this->read_bytes - $this->position, $count); 

    fseek($this->temp_file, $this->position);

    $data = fread($this->temp_file, $available_count);

    $this->position += strlen($data);

    return $data;

    }

    function stream_tell() {
        return $this->position;
    }

    function wait_data() {
    // wait for incoming data
    if (curl_multi_select($this->multi_curl) != -1) {

        // read the data
        do {
            $mrc = curl_multi_exec($this->multi_curl, $this->active);
        } while ($mrc == CURLM_CALL_MULTI_PERFORM);


        if ( $mrc != CURLM_OK) {
            $this->stream_close();
        }

    } else {
        $this->stream_close();
    }
    }

    function wait_for_end() {
    // wait until the download is over
    while ( $this->active) {
        $this->wait_data();
    }
    }

    function wait_for_position($position) {
    // wait until we read past the position or the download is over
    while ( $position >= $this->read_bytes && $this->active) {
        $this->wait_data();
    }
    }

    function stream_eof() {

    // here we could rely on the Content-Lenght field of the header
    $this->wait_for_position($this->position);

    // return FALSE if the request is past the end of the file
    return $this->position >= $this->read_bytes ;
    }

    function stream_seek($offset, $whence) {

        switch ($whence) {
            case SEEK_SET:

        $this->wait_for_position($offset);

        if ($this->read_bytes > $offset) {
            $this->position = $offset;
            return TRUE;
        } else {
            return FALSE;
        }

                break;

            case SEEK_CUR:

        return $this->stream_seek($this->position + $offset, SEEK_SET);

            case SEEK_END:

        // here we could rely on the Content-Lenght field of the header
        $this->wait_for_end();
        return $this->stream_seek($this->position + $offset, SEEK_SET);

            default:
                return false;
        }
    }

}

stream_wrapper_register("http+curl", "SeekableHttpStream")
    or die("Failed to register protocol");

$fp = fopen("http+curl://127.0.0.1/infinite-file.php", "r");

print "First read: ";
print fread($fp, 4);
print "<br>";
print "Second read: ";
print fread($fp, 4);
fseek($fp, 0);
print "<br>";
print "Third read: ";
print fread($fp, 4);

?>

The approach could be optimized significantly (for instance by initiating range requests when the file is seeked on a region that is not yet downloaded and by using the Content-Lenght headers to predict the file size.

lokeller · 2014-09-08T21:42:54Z

I prepared a prototype of the changes needed to fix this bug, it is only partially tested (i.e. I downloaded a file and it worked ) : https://github.com/lokeller/core/commit/fcb1edb3b590ba8bba7f6801ea7fdc79a2400008

I put the commit in public domain, if you feel it may be useful in OwnCloud feel free to include it ( I don't want to sign any contribution agreement so I won't create a proper pull request). If there is interest I can polish and test it a bit more.

bantu · 2014-09-08T21:46:38Z

@lokeller Hey. Signing a contributors agreement is not a requirement for submittung pull requests to ownCloud. After submitting a pull request, you will get the chance to alternatively state that your contribution is MIT licensed. :-)

lokeller · 2014-09-09T05:40:26Z

That's great! I'll test a bit further my patch and then I will create the pull request so it will be easier to review it.

lokeller · 2014-09-10T19:15:51Z

See #11000

PVince81 · 2016-01-14T16:29:04Z

Looks like Guzzle could be used with a stream factory: https://github.com/owncloud/core/pull/19002/files#diff-2ab592dfd95e06115a358d1b5b20cdc5R318

PVince81 · 2016-01-14T16:33:41Z

Okay, looks like it might be even easier, as long as the remote supports HTTP: https://github.com/owncloud/core/pull/18653/files#diff-782207b41e0a420a99054e41c7e7946dR348

PVince81 · 2016-01-14T16:49:26Z

TODO: double check every storage and distinguish stream download from stream upload.
Many already have stream download but not necessarily upload.

PVince81 · 2016-02-25T08:45:19Z

Beware: we discovered that when using Guzzle for stream download it will do a fopen directly on the target URL. This means that calling fread might not return the number of requested bytes. This situation can happen when using encryption. => the solution for fread/fwrite blocks, use Icewind/Streams' "RetryWrapper" on the relevant external storages that use streaming

We might need to fix the encryption code first to not assume that fread always returns the number of expected bytes. The alternative is to make sure that all external storages using fopen streams also have some caching / buffering in place (for example Guzzle's CachingStream).

PVince81 · 2016-03-23T14:55:12Z

GDrive stream download: #23517
Dropbox stream download: #23516

PVince81 · 2016-05-19T08:18:56Z

Streaming upload would help reduce the occurrences of PHP timeouts like #24454 (comment)

PVince81 · 2016-05-19T08:41:42Z

@mmattel FYI I added this to the planning discussion: #24684 (comment)

stale · 2021-09-20T23:35:24Z

This issue has been automatically closed.

PVince81 mentioned this issue Nov 19, 2013

FTP mount doesn't work with encryption #5950

Closed

PVince81 modified the milestones: Maybe someday, ownCloud 6, ownCloud 7 Mar 6, 2014

PVince81 self-assigned this Mar 6, 2014

MorrisJobke removed this from the ownCloud 7 milestone Jun 18, 2014

PVince81 mentioned this issue Dec 5, 2014

files being copied not moved #9252

Closed

This was referenced Mar 8, 2016

Google Drive external storage broke my owncloud owncloud/client#4551

Closed

Google Drive external storage broke my owncloud #22958

Closed

PVince81 mentioned this issue Mar 16, 2016

Fix dropbox storage to not store the whole file in memory #23272

Merged

PVince81 mentioned this issue May 6, 2016

Upload timing problem from a local owncloud system to a federated share #24454

Closed

PVince81 mentioned this issue May 19, 2016

ownCloud Platform - Planning #24684

Closed

PVince81 modified the milestones: 9.1-current, 9.2-next Jun 14, 2016

PVince81 mentioned this issue Aug 22, 2016

owncloud crashs external file system when several users download same large public link shared file, using object storage and 4 servers in the cluster #25892

Closed

PVince81 modified the milestones: 10.0.1, 10.0 Apr 11, 2017

pmaier1 added the PM prio 3 label Apr 19, 2017

PVince81 modified the milestones: 10.0.2, 10.0.1 May 2, 2017

PVince81 modified the milestones: backlog, 10.0.2 May 26, 2017

PVince81 removed their assignment Jun 26, 2017

ownclouders added the status/STALE label Jan 5, 2018

ownclouders mentioned this issue Jul 20, 2018

Disallow sharing when using "Save in session" for ext storages #32110

Open

micbar removed the PM prio 3 label Jan 18, 2021

stale bot closed this as completed Sep 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving ext storage with streaming #5949

Improving ext storage with streaming #5949

PVince81 commented Nov 19, 2013 •

edited by DeepDiver1975

Loading

PVince81 commented Nov 19, 2013

PVince81 commented Nov 19, 2013

karlitschek commented Nov 19, 2013

PVince81 commented Nov 19, 2013

PVince81 commented Nov 21, 2013

PVince81 commented Mar 6, 2014

bantu commented Mar 6, 2014

PVince81 commented Mar 6, 2014

georgehrke commented Mar 6, 2014

PVince81 commented Mar 6, 2014

bantu commented Mar 6, 2014

PVince81 commented Mar 26, 2014

bantu commented Mar 26, 2014

PVince81 commented Mar 26, 2014

lokeller commented Aug 24, 2014

PVince81 commented Aug 25, 2014

icewind1991 commented Aug 25, 2014

lokeller commented Aug 25, 2014

lokeller commented Sep 8, 2014

bantu commented Sep 8, 2014

lokeller commented Sep 9, 2014

lokeller commented Sep 10, 2014

PVince81 commented Jan 14, 2016

PVince81 commented Jan 14, 2016

PVince81 commented Jan 14, 2016

PVince81 commented Feb 25, 2016

PVince81 commented Mar 23, 2016

PVince81 commented May 19, 2016

PVince81 commented May 19, 2016

stale bot commented Sep 20, 2021

Improving ext storage with streaming #5949

Improving ext storage with streaming #5949

Comments

PVince81 commented Nov 19, 2013 • edited by DeepDiver1975 Loading

List of storages

PVince81 commented Nov 19, 2013

PVince81 commented Nov 19, 2013

karlitschek commented Nov 19, 2013

PVince81 commented Nov 19, 2013

PVince81 commented Nov 21, 2013

PVince81 commented Mar 6, 2014

bantu commented Mar 6, 2014

PVince81 commented Mar 6, 2014

georgehrke commented Mar 6, 2014

PVince81 commented Mar 6, 2014

bantu commented Mar 6, 2014

PVince81 commented Mar 26, 2014

bantu commented Mar 26, 2014

PVince81 commented Mar 26, 2014

lokeller commented Aug 24, 2014

PVince81 commented Aug 25, 2014

icewind1991 commented Aug 25, 2014

lokeller commented Aug 25, 2014

lokeller commented Sep 8, 2014

bantu commented Sep 8, 2014

lokeller commented Sep 9, 2014

lokeller commented Sep 10, 2014

PVince81 commented Jan 14, 2016

PVince81 commented Jan 14, 2016

PVince81 commented Jan 14, 2016

PVince81 commented Feb 25, 2016

PVince81 commented Mar 23, 2016

PVince81 commented May 19, 2016

PVince81 commented May 19, 2016

stale bot commented Sep 20, 2021

PVince81 commented Nov 19, 2013 •

edited by DeepDiver1975

Loading