Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunked image encoding #793

Open
cklein05 opened this issue Jul 31, 2017 · 27 comments
Open

Chunked image encoding #793

cklein05 opened this issue Jul 31, 2017 · 27 comments

Comments

@cklein05
Copy link

cklein05 commented Jul 31, 2017

As you may know, we are implementing a fully featured WMS server based on Node Mapnik. It turns out, that most of Mapnik's incredible rendering speed, compared to many other WMS servers, has only little effect with slow network connections. For example, many typical map requests are served about 1.5 to 2.5 times faster than by GeoServer on a local 1 GB network (not to mention localhost requests). However, this factor tends to fall to 1.0 or even 0.8 on typical Internet links.

One reason may be the missing ability to implement chunked HTTP Transfer-Encoding as GeoServer does. Here, the final map image is encoded to, for example, PNG format in chunks. Each chunk is sent down the wire as soon as it is available. With Node Mapnik, the whole rendered image is encoded into one single chunk by one of the save_to_string methods. So, with GeoServer, there is a level of parallelism by overlapping image encoding and sending, which actually brings some extra speed compared to our implementation.

So, I'm asking for a new encodeChunked method for the Node Mapnik Image object. Likely the simplest implementation works with a callback function that gets called for each chunk available. Typically, one would specify the desired chunk size as an argument when the function is called. The callback function gets invoked once for each chunk, which is passed as a Node.js Buffer object. Additionally, a boolean flag last_chunk is required to signal which one is the last chunk.

Since I'm no C++ expert, I currently have no idea how to implement this on the C++ side. This seems to require a stream/buffer, that triggers an event/callback after N bytes have been written to the stream. Maybe there is some Boost template for that? Also, I have no idea whether it's better to implement this as a sync or async method (from a performance point of view), since for each chunk some (expensive?) inter-thread access seems to be required. Actually, this new method could as well be encodeChunkedSync.

On the Javascript side this could simply look like (heavily shortened):

// encode and send image in 512 byte chunks
myImage.encodeChunked(512, function(err, buffer, last_chunk) {
    response.write(buffer);
    if (last_chunk) {
        response.end();
    }
}();
@cklein05
Copy link
Author

Any ideas on that?

@talaj
Copy link
Member

talaj commented Aug 21, 2017

This does not necessarily need to be a part of Node Mapnik. There might be some Node module that can encode PNG this way giving it raw data from img.data().

@flippmoke
Copy link
Member

@cklein05 I believe that you need not do anything on the node side to do this, but rather if you use apache or nginx they will be able to control this completely (but this is just my hunch). Overall, I do not feel that node-mapnik should have any responsibilities in doing this and so I do not think we should add this to the library.

With Node Mapnik, the whole rendered image is encoded into one single chunk by one of the save_to_string methods. So, with GeoServer, there is a level of parallelism by overlapping image encoding and sending, which actually brings some extra speed compared to our implementation.

In node-mapnik, the slowest part of the processing should be simply rendering the image. save_to_string simply takes an uncompressed image and compresses it into a buffer for saving. We can not split up processing of the image into partial sections (which would be very complex) and frankly not worth it.

My best suggestion is this:

If speed is a requirement, I would suggest you investigate another solution besides WMS. WMS is not a fast format for many reasons, one of them being caching.

@cklein05
Copy link
Author

@flippmoke OK, I have to accept this. However,

I believe that you need not do anything on the node side to do this, but rather if you use apache or nginx they will be able to control this completely (but this is just my hunch).

Of course, Node.js as well as both Apache and NGINX do support chunked encoding perfectly. However, this is not of much performance benefit, if we must wait until Mapnik has encoded the whole image. The idea was to overlap/parallelize image encoding and streaming. Imagine to replace the destination string in save_to_string with the Node's response stream. That is real streaming and is actually another way to implement this (but may be complicated, since it requires to work with Node's response stream from C++).

So, an approximation of the above may have been these buffers sent back by a callback function. The idea was not to split up image processing/encoding, but to split up or buffer the output stream.

@talaj
Copy link
Member

talaj commented Aug 21, 2017

On the other hand mapnik::save_to_stream() exists in the public interface so it should be simple to expose it in Node.

@flippmoke
Copy link
Member

mapnik::save_to_string() already utilizes this underneath. So its not a stretch at all to do this, and perhaps we could use the NaN::Buffer() object here. However, this will probably be a good amount of wrapper code that we might need to add and maintain.

template <typename T>
MAPNIK_DECL std::string save_to_string(T const& image,
                                       std::string const& type)
{
    std::ostringstream ss(std::ios::out|std::ios::binary);
    save_to_stream(image, ss, type);
    return ss.str();
}

I am not sold however, that this would actually lower memory footprint much or at all, if we preserve the life of the std::ostringstream() we simply hold that memory a bit longer, at least with the current method this will be dropped once the std::string is created and passed to NaN for ownership. It could possibly be a bit faster, but it might be far easier to simply provide a buffer around the str output from this method in node. I am somewhat concerned about how this Buffer would be used as well, because we would have to figure out how to maintain the std::ostringstream object for the lifetime of the buffer in node.

In short, I don't think its impossible, but could be hard to implement properly.

@talaj
Copy link
Member

talaj commented Aug 21, 2017

I am not sold however, that this would actually lower memory footprint much or at all

The goal is not to save memory, rather this:

level of parallelism by overlapping image encoding and sending, which actually brings some extra speed compared to our implementation.

For this purpose, I would create a new std::ostream based class which would call a javascript callback on every write or when some intermediate buffer is full.

@talaj
Copy link
Member

talaj commented Aug 21, 2017

Something like this:

#include <array>
#include <iostream>

template<typename Callback, std::size_t Size, class Char = char>
class callback_streambuf : public std::basic_streambuf<Char>
{
public:
    using base_type = std::streambuf;
    using char_type = typename base_type::char_type;
    using int_type = typename base_type::int_type;

    callback_streambuf(Callback callback)
        : buffer_{},
          callback_(callback)
    {
        base_type::setp(buffer_.begin(), buffer_.end());
    }

protected:
    int sync()
    {
        bool ok = callback_(base_type::pbase(),
                            base_type::pptr() - base_type::pbase());
        base_type::setp(buffer_.begin(), buffer_.end());
        return ok ? 0 : -1;
    }

    int_type overflow(int_type ch)
    {
        int ret = sync();
        if (ch == base_type::traits_type::eof())
        {
            return ch;
        }
        base_type::sputc(ch);
        return ret == 0 ? 0 : base_type::traits_type::eof();
    }

private:
    std::array<char_type, Size> buffer_;
    Callback callback_;
};

struct cout_callback
{
    template<class Char, class Size>
    bool operator()(const Char* buffer, Size size) const
    {
        std::cout << "CALLBACK: '";
        std::cout.write(buffer, size);
        std::cout << "' (" << size << ")" << std::endl;
        return true;
    }
};

main()
{
    cout_callback callback;
    callback_streambuf<cout_callback, 10> streambuf(callback);
    std::ostream stream(&streambuf);
    stream << "Lorem ipsum dolor sit amet, "
        "consectetur adipiscing elit, "
        "sed do eiusmod tempor incididunt ut "
        "labore et dolore magna aliqua.";
    stream.flush();
}
 $ g++ -std=c++14 callback_stream.cpp -o callback_stream &&  ./callback_stream
CALLBACK: 'Lorem ipsu' (10)
CALLBACK: 'm dolor si' (10)
CALLBACK: 't amet, co' (10)
CALLBACK: 'nsectetur ' (10)
CALLBACK: 'adipiscing' (10)
CALLBACK: ' elit, sed' (10)
CALLBACK: ' do eiusmo' (10)
CALLBACK: 'd tempor i' (10)
CALLBACK: 'ncididunt ' (10)
CALLBACK: 'ut labore ' (10)
CALLBACK: 'et dolore ' (10)
CALLBACK: 'magna aliq' (10)
CALLBACK: 'ua.' (3)

@cklein05
Copy link
Author

@talaj That's it. That's what I was asking for... 👍 Thank you for investigating and creating that descriptive sample application as well as the std::ostream based class.

It's still to check whether this actually works with an asynchronous NaN method or if this must be implemented synchronously.

@cklein05
Copy link
Author

Some remarks:

Template parameter std::size_t Size should be an argument provided by the corresponding NaN method. I believe, you can't use variables inside <...> but have to pass the desired buffer size to the constructor instead:

int chunk_size = 10;      /* from NaN method */
callback_streambuf<cout_callback> streambuf(callback, chunk_size );

A boolean parameter indicating the last chunk should be passed to the callback function, since the last chunk is not necessarily smaller than the desired regular chunk size (requires to detect this stream's eof state). If this new encoding method actually works asynchronously, this seems required in order to give the caller a chance to properly end the request.

@talaj
Copy link
Member

talaj commented Aug 22, 2017

I'm sorry for not responding, I have vacation. I will try to take a look on it later this week.

Template parameter std::size_t Size should be an argument provided by the corresponding NaN method.

It depends on whether you want to set size of the intermediate buffer in runtime. Probably yes, it was just an example.

A boolean parameter indicating the last chunk should be passed to the callback function, since the last chunk is not necessarily smaller than the desired regular chunk size

I'm also thinking about distinct callback function or indication by zero chunk_size.

@cklein05
Copy link
Author

Never mind. That's all voluntary and there's no rush. Enjoy your vacation :)

@talaj
Copy link
Member

talaj commented Aug 30, 2017

@cklein05 Take a look on #799.

@cklein05
Copy link
Author

@talaj Great work! I will have a look at it ASAP. However, some other things are still in my queue. Maybe you could tell me the basic steps required to test this on a Ubuntu 14.04 box (clone repo, apply pull request, build, etc.)? Sorry for not yet being so deep into Node.js, npn, gyp, travis & Co.

@cklein05
Copy link
Author

@talaj Finally I got it sorted to get your branch and to compile it. The newly built Node Mapnik seems to work as expected. However, it will take some time in order to make some statements about any performance gains when using this with our WMS server. Also, I want to test how different buffer sizes influence the performance. Keeping you informed.

@flippmoke flippmoke reopened this Aug 30, 2017
@talaj
Copy link
Member

talaj commented Aug 30, 2017

Also, I want to test how different buffer sizes influence the performance.

I will make buffer size a parameter of encodeChunked(). It makes sense at least for testing.

@talaj
Copy link
Member

talaj commented Aug 30, 2017

Chunk size parameter is there now, see unit tests.

@cklein05
Copy link
Author

cklein05 commented Sep 2, 2017

Thanks for the new chunk size parameter. Sorry for not responding, I've been out of the office some days.

First simple tests showed, that the new method basically works as expected. Again, thanks a lot for your work. Although the image transfer over the network is now definitely spread out (in contrast to a transfer burst after the whole encoding process), we could not yet detect any absolute performance gain, even with slow network connections. However, the process seems now much more network friendly. I guess, we need some more sophisticated tests... We'll continue with more tests in two weeks (it's still vacation time).

@cklein05
Copy link
Author

cklein05 commented Sep 25, 2017

Finally, I can come back with some successful and quite promising tests. However, testing the benefits caused by this patch is quite challenging, as there are so many variables, that affect the results:

  • network speed
  • chunk size
  • image encoding time (image complexity)
  • encoded image size (transfer size)
  • rendering time (including map setup time etc.)

Let me describe our basic testing setup. We decided to eliminate rendering time and used two different pre-rendered images: One is a high-complex random (noise) image (800 x 600 px) which Mapnik compresses to 1.652.373 bytes (~1.6 MB) PNG in ~130 milliseconds

random_map (just a small sample)

The other is a low-complex catastral map image (1201 x 801 px) which Mapnik compresses to 231.261 bytes (~200 KB) PNG in ~65 milliseconds
catastral_map (just a small sample)

For each of the two images, we measured the response time of a WMS GetMap request with these chunk sizes (buffer sizes):

  • not chunked
  • 256 bytes
  • 512 bytes
  • 1024 bytes
  • 2048 bytes
  • 4096 bytes

over physical networks with these bandwidths:

  • 2 Mbit/s
  • 6 Mbit/s
  • 10 Mbit/s
  • 50 Mbit/s
  • 100 Mbit/s
  • 1000 Mbit/s

which makes 72 distinct requests in total. Measures have been taken using Apache JMeter, recording the average response time of 100 sequential request samples for each of the 72 required request over a physical local network. Network bandwidth has been limited on Linux using the wondershaper traffic shaping script.

In the attached Excel sheet, there are three blocks of values and line charts, one for each block. The first block notes the absolute response times in milliseconds. This should be quite self-explanatory.

The second block represents the relative response times in percent, the not chunked encoding response time being 100 percent. Those values are calculated like t_rel = t_x / t_notchunked * 100. The lower the value, the more time was saved by chunking. In the chart, the not chunked requests would form a horizontal line at 100,00 percent. A point at 80 percent means, that only 80 percent of the time of the not chunked request was required for the request. Values above 100 percent seem to be caused by the chunk generating overhead in the C++ code, which is bigger (kicks in earlier), the smaller the encoded image, the faster the network and the smaller the chunk size (which is not a surprise).

The first and second block of values and charts always include the fact, that transferring data over a slow network takes more time. So, in the third block, we tried to express time savings independently from overall response time. In this chart, we compared the absolute savings to the encoding time of the image, which has been measured on the server and is provided as a single (average) constant, one for each of the two images. This may be arbitrary, but that time is the only one we know for the request, which does not include network transfer time. The results look quite right and give us some higher resolution on the 2Mbit/s end of the charts.

So, finally, the third block shows time saving compared to encoding time in percent. The higher this value, the better, that is, the more of the encoding time was saved on the overall response time. Those values are calculated like t_rel = (t_x - t_notchunked) / t_encoding * 100.

Get the Excel sheet from here:
mapnik_image_chunked_encoding.xlsx

More to come soon... in the meantime, I'm curious about any comments.

@talaj
Copy link
Member

talaj commented Sep 26, 2017

Great elaboration!

One could expect upper bound of time saving to be the encoding time. Interestingly, some relative time savings of the random image shows more than 100% of encoding saved. That would mean chunked image can be transferred faster than one-piece image. Or is it a measurement problem?

Relative time savings of the catastral map image shows up to 75% saving of the encoding time on slower lines. So the concept works. I think we can also reduce the chunking overhead a bit.

On the other hand, are tens of milliseconds, compared to hundreds of milliseconds overall response time, worth the effort? According to my experience, to strip tens or even hundreds of milliseconds from rendering time is much more simple.

@cklein05
Copy link
Author

Yes, we also find those more than 100% saving requests interesting. I cannot guarantee that it's not a measurement problem, since the server is theoretically used by several people from the Internet. But I'm quite sure, it's not. We've been taking 100 samples per request and have been working with the average times. Obviously, the transfer is spread out into small chunks. On fast networks, and if encoding is slow, there may even be gaps on the wire between those chunks. Maybe it is possible, that such transfers are simply faster than one single burst transfer. I'm not a networking expert, but this may likely be the case.

Yes, the concept really works and your buffered stream works great. If you see room for improving your code, you're welcome to do so... 👍

BTW, we've encountered that provided palettes are not handled by the new encodeChunked function. Have a look at your method void Image::EIO_EncodeChunked(uv_work_t* work). When calling the mapnik::save_to_stream method, you should likely pass the palette as the last parameter, if one exists. See this in file src/mapnik_image.cpp (line 3713) in method Image::EIO_Encode.

Yes, likely it's much simpler to gain more speed from rendering optimization. However, this likely involves adjusting the stylesheet and/or the database. As the creator of a WMS server, we actually have no knowledge of those parts of the setup, since this is up to the server's maintainer. Even if this setup is optimal, we can still get some improvements with this chunking technique.

Actually, the goal was to optimize network transfer. More precisely, to do it like Apache Tomcat / GeoServer does it. It really annoyed me to see, that our Mapnik based WMS server is always faster from within the 1 GBit/s LAN and always slower (only 5 to 10 percent slower) over the Internet compared to GeoServer, although the server will likely often be used over the Internet. Now, with the chunked encoding in place, the server is almost always faster than GeoServer.

Additionally, a common use case for WMS (in times when tile servers are used everywhere) is HDP maps or printing. There, you often have huge images, for which a lot a savings can be achieved, even with fast networks.

@cklein05
Copy link
Author

Summing it up...

Chunk Sizes

It seems, that chunk sizes smaller than 2048 are not so useful, since there is too much chunk generating overhead. With fast networks (~ >= 100 Mbit/s) there may even be some negative benefit, that is, chunked transfer is slower than not chunked transfer. With 2 KB and 4 KB chunks, chunked encoding was always faster than not chunked encoding for both test images. There may be a negative effect with smaller images (< 200 KB), but those are not subject to this test.

Even more performance could be obtained with even bigger chunks (e.g. 8 KB, 16 KB). Maybe we will run some more tests on that. AFAIK, GeoServer on Apache Tomcat uses a default chunk size of 8 KB.

Network Bandwidth

Most savings can be achieved on network bandwidths between 10 Mbit/s and 100 Mbit/s. It seems like for smaller images the optimal bandwidth range is closer to 10 MBit/s (shifted to the left), whereas for bigger images the optimal bandwidth range is closer to 100 Mbit/s (shifted to the right).

For networks with a bandwidth smaller than 10 MBit/s, there is less benefit, since the overall transfer time eats up most of the savings. Maybe some better effect can be achieved with even smaller images, however, chunking makes less sense for smaller images in general.

For networks with 1000 MBit/s and above (true intranets), there is less benefit, since the network is too fast compared to the time needed to encode the image. There is no real overlapping of encoding and transfer, from which this patch tries to benefit from. With really big images, the performance gain may be greater, but such images are unlikely requested from a WMS server. So, the goal for those networks is just not to be slower than with not chunked encoding due to chunk generating overhead. This can easily be achieved with big enough chunk sizes (> 2 KB).

Between 10 MBit/s and 100 MBit/s (likely even up to 200 or 300 MBit/s), savings range from 80 to 50 percent for bigger images of the not chunked transfer time. These bandwidths represent typical broadband internet connections and are the networks, this patch was initially developed for. As you can see in the third blocks of values and the charts Relative Time Saving compared to Encoding Time in Percent, up to 90 percent of the encoding time of an image can be saved.

Rendering Time

Since we have run our tests with pre-rendered image, of course, the savings discussed so far are only half the story. With a real WMS GetMap request, rendering time often plays the leading role in overall performance.

Rendering time adds a positive offset to the values in the first data block, which constantly moves the lines in chart Response Time [ms] up. The lines of the 2nd chart Relative Response Time compared to Not Chunked Response Time in Percent get flattened and move towards the 100 percent line (the lines get squeezed towards to 100 percent line). The 3rd chart Relative Time Saving compared to Encoding Time in Percent gets not changed by the added rendering time.

Obviously, a long rendering time wrecks the positive effects on performance caused by chunked image encoding. The longer the rendering time, the less the positive effect. However, it's not easy to describe this mathematically (at least for me :-p).

Conclusion

With this patch, the goal was not to generally improve performance (which should better be done with a process called rendering optimization), but to not waste time when encoding and transferring the rendered image. In other words, we tried to make this final step as efficient as possible. I reckon, the numbers and charts provided clearly show, that we have been successful.

The real world test for this patch is a setup that compares our Mapnik based WMS server with GeoServer, running in an Apache Tomcat server. As already said in the first post, Mapnik was always faster from within the local 1 GBit/s network and almost always slower when queried from the Internet. Now, with this patch in place, the Mapnik based server is almost always faster when queried from both the LAN and the Internet.

Additionally, there are more benefits from transferring an image in chunks. In general, I believe that smaller chunks are likely more network friendly than a single burst transfer. Our tests suggest, that those may be transferred even faster than a single big block of data. (For some requests the saving was greater than 100 percent of the encoding time. In theory, the encoding time should be the saving's upper limit.)

Finally, this patch lowers the server's memory footprint, since only a small buffer for a single chunk is required at a time for a single encoding process (in contrast to a buffer that holds the whole encoded image).

With all that in mind, I'd really love having this patch implemented in Node Mapnik.

@talaj
Copy link
Member

talaj commented Sep 30, 2017

BTW, we've encountered that provided palettes are not handled by the new encodeChunked function.

This is fixed and included into the unit tests now.

I've also fixed a nasty bug when resources were released in wrong time, causing crashes.
Please use the latest version from #799.

@cklein05
Copy link
Author

Thank you for fixing this... :-) Haven't yet tested but will soon. Have you also been able to reduce chunking overhead even more?

@talaj
Copy link
Member

talaj commented Oct 10, 2017

I was doing some simple profiling with perf. From what I could see, time spent by chunking was insignificant next to other calls in Node. I will try to find some time to do more performance tests.

In any case, I think that performance could be improved in two ways:

@cklein05
Copy link
Author

I completely agree with #531. However, I don’t see any benefit coming from your second suggestion. Since chunks are instantly sent down the wire when the callback is called, isn't this the same as increasing chunk size, which is already implemented?

Of course, from your (or the encodeChunked method's) point of view, the focus lies on generating chunks, no matter when those are delivered. From my (or the chunked encoding implementing HTTP server's) point of view, this is likely not north it, since, for example, getting two chunks per callback is actually the same than doubling chunk size.

@talaj
Copy link
Member

talaj commented Oct 10, 2017

isn't this the same as increasing chunk size

I must say that I'm not sure.

Since chunks are instantly sent down the wire when the callback is called

But the callback is not instantly called when a chunk is generated. A chunk is generated on a separate thread. The callback must be called from the main thread. The main thread processes a queue of operations. It takes some time than the main thread get through the queue to the callback. In the meantime, new chunks could've been generated.

It now loops by all generated chunks in this for loop. I would maybe send it all at once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants