MusicPlayerDaemon · sp1ff · Feb 19, 2022
diff --git a/src/output/plugins/visualization/VisualizationOutputPlugin.cxx b/src/output/plugins/visualization/VisualizationOutputPlugin.cxx
@@ -0,0 +1,250 @@
+/*
+ * Copyright 2003-2022 The Music Player Daemon Project
+ * http://www.musicpd.org
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+/**
+ * \page vis_out_protocol Visualization Network Protocol
+ *
+ * See \ref vis_out "RFC: Visualizatoin Output Plugin" for background.
+ * 
+ *
+ * \section vis_out_protocol_timing Timing
+ *
+ * In order to deliver sound data to the client at the proper time, the protocol
+ * needs to take into account:
+ *
+ * - network latency: the delta between writing the sound data to the socket & its
+ *   receipt on the client
+ *
+ * - player buffering: the player may buffer sound data (mplayer, for instance,
+ *   buffers half a second's worth of audio before beginning playback)
+ *
+ * - render time: the client presumably wishes the current frame to appear on-screen
+ *   at the moment the current sound information is ending
+ *
+ * Throughout, let \e t be "song time" be measured on the server, and T(t) be
+ * sound information for song time \e t. Let FPS be the frames-per-second at
+ * which the client would like to render.
+ *
+ * Then, at an interval of 1/FPS seconds, the server needs to write
+ *
+ \verbatim
+       T(t - {buffer time} + {render time} + {one way latency})
+ \endverbatim
+ *
+ * to the client socket. If we denote that time value by tau, then the server
+ * should wait min(0, -tau) ms to write the first frame.
+ *
+ * A few examples will illustrate.
+ *
+ * \subsection vis_out_protocol_timing_eg_1 Example 1
+ *
+ * Let the client render time be 4ms and round-trip network latency be
+ * 6ms. Assume no player buffering. In order to render a frame corresponding to
+ * song time \e t, the client would need, at time \e t - 4 ms, sound information
+ * corresponding to time \e t, or T(t). The server would need to \e send that
+ * information at time \e t - 7ms (half of one round-trip plus render time).
+ *
+ * In other words, on the server side at song time \e t, we would need to write
+ * T(t + 7ms) to the client socket. If the server writes T(t+7ms) immediately,
+ * the client will receive it at \e t + 4ms, take 4ms to render the next frame,
+ * and so at \e t + 7ms hence, finish rendering T(t+7).
+ *
+ * \subsection vis_out_protocol_timing_eg_2 Example 2
+ *
+ * Imagine we are running the same client on a laptop, connected to an MPD
+ * server over the internet, and using mplayer as the player. This gives 500ms
+ * of buffer time.  Let us assume the same 4ms render time, but now a 20ms
+ * round-trip time.
+ *
+ * In order to render a frame corresponding to song time \e t, the client would
+ * need, at time \e t - 4ms, T(t). This would need to be sent from the server at
+ * time \e t - 14ms. We now need to incorporate the client-side buffering,
+ * however. Song time \e t will be actually played on the client at \e t + 500ms
+ * on the server.
+ *
+ * In other words, on the server side at song time \e t, we would need to write
+ * T(t-486ms) to the client socket.
+ *
+ * Since the sound won't start on the client for 0.5 sec, it would make no sense
+ * to begin writing sound information for 486ms. Let t(0) be the moment the
+ * client connects and the player begins buffering. If, at t(0) + 486ms, the
+ * server writes T(t(0)), the client will receive it at t(0) + 496ms & complete
+ * rendering it at t(0) + 500ms, which is when the client-side player will
+ * begin playing song time t(0).
+ *
+ * \section vis_out_protocol_proto The Protocol
+ *
+ * \subsection vis_out_protocol_proto_design Design
+ *
+ * The author is unaware of any existing network protocols in this area, so he
+ * designed his own after reveiwing the Shoutcast & Ultravox
+ * protocols. Experience with the TLS & 802.11 protocols also informed this
+ * design.
+ *
+ * Design goals include:
+ *
+ * - client convenience
+ *   - this in particular drove the choice to stream updates; everything
+ *     needed to simply push the data out is knowable at handshake time,
+ *     so why force the client to send a request?
+ * - efficiency on the wire
+ *   - binary format
+ *   - streaming preferred over request/response
+ * - future extensibility
+ *   - protocol versioning built-in from the start
+ * - parsing convenience
+ *   - streaming messages come with a few "magic bytes" at the start
+ *     to assist clients in "locking on" to the stream & recovering from
+ *     corrupted data
+ *   - all messages conform to the "type-length-value" (TLV) format
+ *     beloved of parser writers
+ *
+ *
+ * \subsection vis_out_protocol_proto_overview Overview
+ *
+ * The protocol is a combination of request/response as well as streaming. After
+ * an initial handshake (client goes first) the server will begin streaming
+ * messages to the client; i.e. at the interval the client specified during the
+ * initial handshake the server will send FRAME messages containing sound
+ * information useful for visualizers. Additionally, METADATA messages will be
+ * sent on receipt of notifications from MPD that the song has changed.  The
+ * client need not request these messages or does the client need to acknowledge
+ * them in any way.
+ *
+ * The client \e may, at any time after handshake completion, initiate two other
+ * exchanges:
+ *
+ * - PING: the client may want to periodically adjust it's estimate of the
+ *   round-trip client-side latency; the server will respond with a PONG.
+ *   The client can measure the delta between request & response & update
+ *   its estimate.
+ *
+ * - ADJBUF: the client may periodically want to adjust the "buffer time";
+ *   that is, the delta between song time as measured on the server and
+ *   the song time to each FRAME & METADATA frame corresponds; the server
+ *   will adjust it's internal timers & respond with an ADJBUFACK message.
+ *   The server \e may send one more frame at the old setting of tau.
+ *
+ * Schematically, a conversation looks like this:
+ *
+ \verbatim
+   Client                                                   Server
+
+   desired protocol version
+   tau (buffer offset)
+   desired sound params     --------- CLIHLO --------->
+     damping parameter
+     window parameter
+     ...
+
+	                        <-------- SRVHLO --------- offered protocol version
+                                                                          |
+    sequence number   		--------- PING ---------->                    |
+							<-------- PONG ----------  sequence number    | tau ms
+                                                                          |
+                                                                          |
+							<------- METADATA-------- artist, title &c    v
+							<-------- FRAME --------- samples, spectrum   |
+							                          bass/mids/trebs     |
+                                                      ...                 | tau ms
+                                                                          |
+							<------- METADATA-------- artist, title &c    v
+							<-------- FRAME --------- samples, spectrum   |
+							                          bass/mids/trebs     |
+                                                      ...                 |
+    sequence number   		--------- PING ---------->                    |
+							<-------- PONG ----------  sequence number    | tau ms
+                                                                          |
+                                                                          |
+							<------- METADATA-------- artist, title &c    v
+							<-------- FRAME --------- samples, spectrum   |
+							                          bass/mids/trebs     |
+                                                      ...                 | tau ms
+                                                                          |
+							<------- METADATA-------- artist, title &c    v
+							<-------- FRAME --------- samples, spectrum   |
+							                          bass/mids/trebs     |
+                                                      ...                 | tau ms
+                                                                          |
+    tau' (new buffer        -------- ADJBUF -------->                     |
+		  offset)           <------ ADJBUFACK -------                     |
+                                                                          |
+							<------- METADATA-------- artist, title &c    v
+							<-------- FRAME --------- samples, spectrum   |
+							                          bass/mids/trebs     |
+                                                      ...                 | tau' ms
+                                                                          |
+							<------- METADATA-------- artist, title &c    v
+							<-------- FRAME --------- samples, spectrum
+							                          bass/mids/trebs
+                                                      ...
+				                  ....
+								(forever)
+ \endverbatim
+ *
+ * There is no formal "close" or "teardown" message; each side simply detects
+ * when the other has gone away & treats that as the end of the conversation.
+ *
+ *
+ * \subsection vis_out_protocol_proto_msgs Messages
+ *
+ * All messages:
+ *
+ * - integers use network byte order (i.e. big endian)
+ * - use TLV format (streaming messages prepend magic bytes)
+ * 
+ \verbatim
+
+   +-----------------------+-----------------+-----------------------+--------+
+   | TYPE (16-bit unsigned)|     LENGTH      |    PAYLOAD            |  CHECK |
+   | class | message type  | 16-bits unsigned| LENGTH bytes          | 1 byte |
+   |-------+---------------|-----------------|-----------------------+--------+
+   | 4 bits|   12 bits     | (max len 65535) | format is msg-specfic |    00  |
+   +-----------------------+-----------------+-----------------------+--------+
+
+ \endverbatim
+ *
+ * Notes:
+ *
+ * - the message type is comprised of two values packed into a u16_t:
+ *
+ *   - class: (type & f000) >> 12:
+ *     - 0: handshake
+ *     - 1; control (PING, e.g.)
+ *     - 2: streaming (FRAME, e.g.)
+ *
+ *   - message type: (type & 0ffff) see below for values
+ *
+ * - the "length" field is the length of the \e payload \e only
+ *
+ * - the "check" byte is intended as a sanity test & shall always be zero
+ *   TODO(sp1ff): replace this with a proper checksum?
+ *
+ * TODO(sp1ff): define each message
+ *
+ *
+ */
+
+
+const struct AudioOutputPlugin visualization_output_plugin = {
+	"visualization",
+	nullptr, // cannot serve as the default output
+	nullptr, // TODO(sp1ff): Write me!
+	nullptr, // no particular mixer
+};
diff --git a/src/output/plugins/visualization/VisualizationOutputPlugin.hxx b/src/output/plugins/visualization/VisualizationOutputPlugin.hxx
@@ -0,0 +1,136 @@
+/*
+ * Copyright 2003-2022 The Music Player Daemon Project
+ * http://www.musicpd.org
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#ifndef MPD_VISUALIZATION_OUTPUT_PLUGIN_HXX
+#define MPD_VISUALIZATION_OUTPUT_PLUGIN_HXX
+
+/**
+ * \page vis_out RFC: Visualiation Output Plugin
+ *
+ * \section vis_out_intro Introduction
+ *
+ * This plugin started from a conversation on the #mpd IRC channel. I asked
+ * about the best way to implement a music visualizer as a remote <a
+ * href="https://musicpd.org">MPD</a> client. All the current MPD visualizers of
+ * which I'm aware use the <a
+ * href="https://mpd.readthedocs.io/en/latest/plugins.html#fifo">fifo</a> output
+ * plugin and hence must be run on the same host as the MPD daemon.
+ *
+ * The response I got was a suggestion that I write an output plugin that would
+ * \e just stream the data needed to implement a remote visualizer. I've begun
+ * work on such a plugin, but before I spend too much time implementing it I
+ * would like to lay out my proposal & solicit feedback.
+ *
+ * The codebase uses <a href="https://www.doxygen.nl>Doxygen</a>-style comments,
+ * so I'm presenting this RFC as a few doxygen pages in the first files I'd be
+ * adding to the project.
+ *
+ *
+ * \section vis_out_prior Prior Art
+ *
+ * Music visualization sources which I consulted before settling on this
+ * proposal:
+ *
+ * - This <a href="https://github.com/MusicPlayerDaemon/MPD/pull/488>PR</a>
+ *   proposed solving this problem by implementing an output plugin that would
+ *   stream the raw PCM data over TCP, the idea being that the remote visualizer
+ *   would do the sound analysis client-side. The PR was discarded as being
+ *   redundant with the <a
+ *   href="https://mpd.readthedocs.io/en/latest/plugins.html#httpd">httpd</a>
+ *   output plugin. I would also observe that such a solution sends far more
+ *   data on the wire than is needed for visualization (on which more below).
+ *
+ * - <a href="https://github.com/ncmpcpp/ncmpcpp">ncmpcpp</a> uses the fifo
+ *   output plugin, and as such can only provide the visualization feature when
+ *   it's being run locally. The sound analysis is limited, as well (see below)
+ *
+ * - <a href="https://github.com/dpayne/cli-visualizer">cli-visualizer</a> will
+ *   work with the MPD fifo (again assuming the MPD daemon is running
+ *   locally). Limited sound analysis, as well.
+ *
+ * - <a href="http://www.geisswerks.com/about_milkdrop.html">MilkDrop</a>:
+ *   reading the source code was very instructive in terms of sound analysis for
+ *   music visualization; that aspect of this proposal is largely based on it.
+ *
+ *
+ * \section vis_out_proposal The Proposal
+ *
+ * A new output plugin "visualization" will be implemented. The plugin will
+ * cache recent PCM data. The plugin will also be a ServerSocket. When clients
+ * connect, they will provide the details of the sound analysis they would like
+ * performed, the frame rate at which they would like to receive updates and the
+ * offset between client-side render time & server-side song time (to account
+ * for network lag, client-side buffering & the time needed to render each
+ * frame). Once that initial handshake is complete, the server will stream
+ * updates containing sound analysis results at regular intervals to the
+ * client.
+ *
+ * \subsection vis_output_proposal_analysis Sound Analysis
+ *
+ * Given audio data in raw PCM format, a number of steps may be taken to
+ * analyze that data & produce infromation useful to visualizers:
+ *
+ * - the PCM data may optionally be damped by taking a weighted average between
+ *   the current values & prior values in the time domain; this will have the
+ *   effect of reducing noise in the higher frequency ranges
+ *
+ * - the PCM data may have a "window function" applied to it in the time domain
+ *   around the time of interest; such a function has the effect of "dialing
+ *   down" audio samples further from the timestamp of interest and again will
+ *   reduce higher-frequency noise; the size of the window may be configured to
+ *   incorporate more or less data as desired.
+ *
+ * - the resulting PCM data will be shifted into the frequency domain by
+ *   application of the Discrete Fourier Transform
+ *
+ * - the human ear can only distinguish frequence from (about) 200Hz to 20000Hz,
+ *   and in practice musical sound information doesn't show much activity above
+ *   10000Hz; it is therefore convenient to throw out frequency data outside
+ *   some (client-configurable) range
+ *
+ * - it is also convenient to divide the resulting spectrum into a few coarse
+ *   bands, such as bass/mids/trebs. This is computationally non-trivial because
+ *   perceptually, frequency is not linear, it's logrithmic. A change of one
+ *   octave corresponds to a doubling in frequency. Intuitively, this means that
+ *   the difference betwenn 200 & 300Hz is much greater than the difference
+ *   betwen 5000 & 5100Hz, e.g. The plugin will peform this service for
+ *   each client.
+ *
+ * - it can also be useful to maintain a weighted time average of the activity
+ *   in each frequency range for purposes of beat detection
+ *
+ *
+ * \subsection vis_output_protocol The Protocol
+ *
+ * Note that each update need only be based on relatively few samples (Winamp,
+ * e.g. uses 576). This will keep the data transferred on the wire small (at
+ * least by comparison to, say, the httpd output plugin which of course needs to
+ * send the entire song).  Casting the protocol in terms of client-side FPS
+ * allows us to avoid a "request/response" protocol & simply stream until the
+ * client goes away.
+ *
+ * I've broken out the detailed protocol specification into its own
+ * \ref vis_out_protocol "page".
+ *
+ *
+ */
+
+extern const struct AudioOutputPlugin visualization_output_plugin;
+
+#endif