pymongowatch
is an extension for the MongoDB’s Python driver pymongo
which enable watching and auditing the underlying operations.
MongoDB Community Server doesn’t support auditing and the database profiler is not very flexible. Basically your only option for filtering is to collect operations that take longer than a certain value. You can’t easily separate the insert log from the query log or you cannot extract the operations only for a certain collection or a certain set of queries.
Furthermore, operations at database level are not always the same as operations in an application. A very simple operation like fetching documents with find may be translated to several getMore operations at database level.
Therefore, if you want to analyze the user behavior by studying the database queries, a very simple auditing log for each user search containing the query itself (possibly with deleting value fields to protect the users’ privacy), the number of the documents read from the database and its time is all you need.
Doing this at database level is not always feasible and doing it manually at application level may not be very desirable for the developers. This is when an intermediate tool at library level comes handy.
pymongowatch
can detect and audit operations at application level
without any modifications to existing applications.
First you have to install pymongowatch
. It is available at Python
Package Index (PyPI) and can be installed with pip. At a virtual
environment or system-wide:
pip install pymongowatch
Other installation options such as Debian packages are available, too.
After basic installation, pymongo.watcher
sub-package will be
available and you have two options to go on:
This approach is recommended if you have no restrictions on modifying
the pymongo-based application and you don’t want any hacks on the
pymongo
itself.
from pymongo import watcher
watcher.patch_pymongo()
watcher.add_logging_handlers() # Adds a logging.StreamHandler by default
Of course, you can pass your own customized logging handlers to
add_logging_handlers
and add filters and formatters as you wish. See
the following sections for more information.
If you don’t own the application or you don’t want to even add a
single line of code to your existing application, pymongo_mask
which
is included in the pymongowatch
package comes handy.
pymongo_mask
is a directory with a module pymongo.py
which
emulates the pymongo
package by using the real installed pymongo
;
So if you add pymongo_mask
directory in your Python application
search path for modules–in somewhere with more priority than where
the real pymongo
is installed–the mask will be imported in your
application instead of the real pymongo
. Nevertheless the mask will
work as a proxy to the real installed pymongo
and your application
should work as usual.
The importation of the pymongo mask will also trigger
watcher.patch_pymongo()
automatically, so the watchers will be
enabled without any need to change your application source code or
even pymongo
library itself.
Furthermore, the pymongo mask will apply the configuration at
etc/pymongowatch.yaml
with logging.config.dictConfig to the loggers
of the watchers. A sample etc/pymongowatch.yaml is included in the
pymongowatch
package but you can modify it as needed. The yaml
configuration follows the logging dictionary schema.
Finally, pymongowatch
package includes a pymongowatch-install-mask
script which can help you enable the pymongo mask with ease:
pymongowatch-install-mask # follow the interactive instructions
pymongowatch
implements classes such as WatchCursor
(a subclass of
pymongo’s cursor
) which if used instead of pymongo’s cursor
can
emit logs as specified. To do so, you need some workaround to make
pymongo’s methods to use pymongowatch
classes instead of its
defaults. The patch_pymongo
function is the method that will call
all the patch methods in pymongowatch
classes to replace the default
pymongo underlying classes:
from pymongo import watcher
watcher.patch_pymongo()
All the pymongowatch
watcher classes will use the logger retrieved
with pymongo.watcher.logger.get_log_emitter()
to emit their
logs. Each module will emit the logs with their __name__. So for
example, all the cursor logs will be emitted under the
pymongo.watcher.cursor
. So you can apply specific logging
configuration for each module or as all the logs has the prefix
pymongo.watcher
as their logger name you can apply global
configuration with that name.
import logging
# The following code is not the recommended method for adding handlers
# to the pymongowatch loggers and is for illustrative purposes only
global_logger = logging.getLogger("pymongo.watcher")
# add a stream log handler globally to watch all the logs on the
# console
global_logger.addHandler(logging.StreamHandler())
cursor_logger = logging.getLogger("pymongo.watcher.cursor")
# add a file handler specifically for cursor logs to store them in
# file, too.
cursor_logger.addHandler(logging.FileHandler("/tmp/watch.log"))
Of course you can apply filters and formatters or add more handlers as
you wish but there are some details that you should take care of. The
easiest way to achieve this is by leveraging the
add_logging_handlers
method:
from pymongo import watcher
# Add two handlers to get all the watcher logs both in file and
# console
console_handler = logging.StreamHandler()
global_handler = logging.FileHanlder("/tmp/watcher-all.log")
watcher.add_logging_handlers(console_handler, global_handler)
# Add a more customized handler for cursor logs
cursor_simple_handler = logging.FileHanlder("/tmp/watcher-cursor-simple.log")
watcher.add_logging_handlers(
cursor_simple_handler,
logger_name="pymongo.watcher.cursor",
formatter="{name} - {watch}")
# Add a more customized handler for cursor logs
cursor_customized_handler = logging.FileHanlder(
"/tmp/watcher-cursor-customized.log")
watcher.add_logging_handlers(
cursor_customized_handler,
logger_name="pymongo.watcher.cursor",
formatter="{asctime} {name}.{watch.Collection} - {watch.Query} fetched "
"{watch.RetrievedCount} in {watch.RetrieveTime} seconds")
# Add another handler to log the full information for cursors in csv
cursor_csv_handler = logging.FileHanlder(
"/tmp/watcher-cursor-csv.log")
watcher.add_logging_handlers(
cursor_csv_handler,
logger_name="pymongo.watcher.cursor",
formatter="{asctime},{name},{watch.csv}")
Note that using add_logging_handlers
has not only the advantage of
simplicity for adding formatters, but also take care of automatically
adding an extra logging.handlers.QueueHandler and
logging.handlers.QueueListener for each handler to overcome some log
mutation issues we discuss later. It will also add a couple of
customized filters to the QueueHandler
that are necessary to fully
take advantage of pymongowatch
.
You can set the log format by using formatters either by passing a
string as the formatter
option to the add_logging_handlers
or by
creating a formatter
object and using the handler’s setFormatter
method directly. In either case the recommended way is to use {
style. Specially if you want to access inner values with dot notation
e.g. watch.Query
or watch.Collection
other styles such as %
and
$
(e.g. %(watch.Query)s
or ${watch.Query}
) will NOT work.
Another useful feature of Python logging
module is its filters. You
can use filter
objects not only to filter unwanted logs but also to
modify the ones that you want. add_logging_handlers
will
automatically add some basic filters (such as AddPymongoResults
which will extract information about pymongo
operations results and
add them to the related log), but you can also add your own filters.
pymongo.watcher.filters
module implements a couple of useful
configurable filters such as ExpressionFilter
, ExecuteFilter
and
RateFilter
. The etc/pymongowatch.yaml has some examples on how to
use them.
You can also develop your own logging filters by sub-classing
logging.Filter as usual and access the record.watch
in filter method
for investigating or modifying the watcher attributes such as DB
,
Collection
, Query
, etc.
import logging
class SlowQueriesOnNewsCollectionFilter(logging.Filter):
def filter(self, record):
watch = getattr(record, "watch", {})
return (watch.get("Collection") == "news" and
watch.get("RetrieveTime", 0) > 10)
Or you can add filters to modify the logs:
import logging
def remove_private_data(data):
if isinstance(data, dict):
return {k: remove_private_data(v) for k, v in data.items()}
elif isinstance(data, list):
return [remove_private_data(i) for i in data]
return None
class UserPrivacyFilter(logging.Filter):
def filter(self, record):
watch = getattr(record, "watch", {})
watch.update(remove_private_data(watch))
return True
Don’t forget to add the defined filters to you handlers:
my_handler.addFilter(SlowQueriesOnNewsCollectionFilter)
my_handler.addFilter(UserPrivacyFilter)
Lastly, you can use Python’s great logging.conig module and specially
the new flexible logging.config.dictConfig method to apply all the
handlers
, formatters
and filters
in a single configuration
file. Custom pymongowatch
configuration can also be applied in the
same dictionary.
pymongowatch
has even the required filters implemented in
pymongo.watcher.filters
module. To see the examples for the
dictConfig
configuration with watcher filters refer to the
etc/pymongowatch.yaml
file which will be installed via
pymongowatch
(if you are using a virtual environment, it would be
inside the venv directory).
There is a pymongo.watcher.dictConfig
method that accepts a
configuration dictionary with almost the same specification as
logging.config.dictConfig
. But it will apply custom configuration
required by pymongowatch
itself.
The pymongo.watcher.dictConfig
will not apply the intended logging
configurations such as handlers
and formatters
. Instead, it reads
the special key watchers
from the dictionary and apply the related
configuration for pymongowatch
. Although you can use the a single
dictionary with both logging configuration and watchers
key and
apply both pymongo.watcher.dictConfig
and
logging.config.dictConfig
.
For more information about the pymongowatch
configurations please
refer to the watchers
key inside the example etc/pymongowatch.yaml
file that will be installed alongside the package.
Mutable Logs? Is that a thing?
Usually the good thing about logs is that they are immutable. So if you see a log you can trust it. This is always true when some atomic operation happens and you have no concerns about the start and end time of the operation (and you don’t have access to a time machine to travel to the past and change what happened).
But what if you start an operation which we have no idea when will it end? Suppose we have queried a very large database for a very slow query that may take some time to get back the full results. Also, we may use a cursor in our application to fetch data and the application has some delays itself and we don’t want the slowness of the application to affect the database auditing.
These are the sort of challenges that pymongo.watcher.WatchQueue
tries to fix.
pymongowatch
uses pymongo.watcher.logger.WatchMessage
instead of
strings as log messages as described in using arbitrary objects as
messages in Python’s logging HOWTO. WatchMessage
is a sub-class of
Python’s dictionaries which are mutable objects.
WatchMessage
instances are the {watch}
templates in the format
strings that we saw earlier. They are a dict
so you can access log
attributes with []
access e.g. watch["Query"]
. For more
convenience while using log formatters WatchMessage
provides
attribute access with dot notation e.g. watch.Query
and one of the
reasons why {
style formatting (which let you use dot notation
access) is recommended for logging formatters.
A WatchMessage
is a mutable object and the watcher classes need to
modify the logs from time to time. For example if you fetch more items
from a cursor, they can update the attributes such as RetrieveTime
or RetrievedCount
. So we have flexibility and it is pymongowatch
users decision when to emit the final immutable log with the logger
handlers.
In the old versions of pymongowatch
, the watcher classes modified
the WatchMessage
even after the emitting the log, leaving the entire
burden for consistency on the WatchQueue
. This approach soon became
problematic in multiprocessing environments where the WatchMessage
in the emitter process differs from the one in the queue process.
In the newer versions, WatchMessage
has a unique identifier
WatchID
for the ongoing operation. The watcher classes will not only
mutate the previous WatchMessage
but also re-emit the log again and
again on each update (with an incremental meta-data named
Iteration
). In this way, even if the log has already stored
somewhere as an immutable object (for example in a file as a log
record or in a separate process as an object that cannot see the
modifications to the original WatchMessage
) can still receive the
updates.
The pitfall is now we may have thousands of updates and consequently
thousands of logs for each operation. But this problem can easily be
handled by WatchQueue
or an external tool such as
pymongowatch-csv
.
WatchQueue
alongside a QueueHandler is the right tool to make sure
we handle logs at right time. A WatchQueue
works like a priority
queue in which the earliest logs has the higher priority but it can
also aggregate the updates for each operation based on the WatchID
.
WatchQueue
will update the old iterations with the newer ones and
releases the logs when they have the final
iteration mark or when
they have expired. The final
mark is a special case of Iteration
meta-data in each log that indicates the database operation has
finished and the watcher class will not modify this WatchMessage
(with the specified WatchID
) anymore.
Also it is important to note that the highest priority item (for
retrieval) in the WatchQueue
is not always the earliest generated
log. First, the different operations may take different times, and
second, the watcher classes may assign different expiration times to
each operation.
Finally you can set a forced_delay_sec
to add delay to all the logs
in the queue (for example if you are using logging for analytics and
you do care more about accuracy than delay for the logs). This is an
optional keyword arguments that both WatchQueue
constructor and
add_logging_handlers
method that we saw earlier accepts.
The add_logging_handlers
has a with_queue
optional argument which
if is True
(the default), will use
pymongo.watcher.setup_queue_handler
to setup a QueueHandler
alongside a started QueueListener for each handler you specify with a
WatchQueue
so you usually don’t have to worry about log mutation if
you use add_logging_handlers
to add your handlers to watcher
loggers.
A pymongowatch-csv
script is also available after the pymongowatch
installation that can be used to aggregate all the iterations of a
single operation into a single row when you use the csv
format:
pymongowatch-csv aggr /tmp/watch.csv > /tmp/aggregated-watch.csv
Another useful application of QueueHandler is its use in logging to a
single file from multiple processes. For this application you have to
use a multiprocessing Queue
alongside the
QueueHandler
. Fortunately WatchQueue
also supports
multiprocessing. All you have to do is to pass True
as the
enable_multiprocessing
argument.
This is useful for example when you have a pymongo
based web
application with several web server processes and you need a process
safe method to store the logs in a single file. Its worth noting that
in such environments using the default value for
enable_multiprocessing
i.e. False
will result in total failure in
logging because the default Queue
is not multiprocessing and the
QueueHandler
and the QueueListener
will use different queues in
different processes.
Passing True
as enable_multiprocessing
in WatchQueue
constructor
will makes the constructor to build and return a proxy object from a
customized manager which has its own dedicated process. You can also
pass enable_multiprocessing
to setup_queue_handler
and
add_logging_handlers
:
# Extra keyword arguments of `add_logging_handlers` will be passed to
# the newly created WatchQueue for each handler.
watcher.add_logging_handlers(enable_multiprocessing=True)
pymongowatch
is not a standalone MongoDB library and it relies on
the the MongoDB’s Python driver pymongo. But does the pymongo’s
version matter?
pymongowatch
has been tested with the recent versions of pymongo
i.e. 3.10
and the newer 4
series but you can use it for other
versions at your own risk. If you have any problems you can open an
issue at the project’s issue tracker.
One known difference between pymongo
versions is that they handle
operation closing differently. For example, 4
series close the
cursors more intelligently and you can usually see the cursor
logs
very fast without any need to a explicit timeout whereas any 3
series usually an explicit timeout is required.
If you are a Debian-based GNU/Linux distribution user you are in luck! There is a Debian package maintained in the project’s debian branch that can make your installation even easier.
You can find the binary packages at mrazavi’s pymongowatch PPA and to install it on Ubuntu:
sudo add-apt-repository ppa:mrazavi/pymongowatch
sudo apt update
sudo apt install python3-pymongowatch
The project issue tracker is the main location to maintain the details for the future development. But here the cardinal points will be reviewed briefly.
- NOTE
- If you see an ugly TODO list below with oversize items, it’s
not even clear which items are DONE and which ones are still TODO,
maybe that is because this document is written in org-mode but you
are seeing a bad render e.g. in GitHub or a converted
reStructuredText format e.g. because the lack of org-mode support in
PyPI.
That doesn’t make org-mode less lovable or inferior. Anyway org-mode is one of the most reasonable markup languages to use for text. Why not to use it and brag about it?
Support for basic collection operations #3
Support for CommandCursor #4
Improve mutable logs #2: add a unique ID for each operation
The pymongowatch
has developed mainly by Mohammad Razavi.