Skip to content

gh-134004: Dbm vacuuming #134028

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open
27 changes: 26 additions & 1 deletion Doc/library/dbm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,10 +15,16 @@
* :mod:`dbm.ndbm`

If none of these modules are installed, the
slow-but-simple implementation in module :mod:`dbm.dumb` will be used. There
slow-but-simple implementation in module :mod:`dbm.dumb` will be used. There
is a `third party interface <https://www.jcea.es/programacion/pybsddb.htm>`_ to
the Oracle Berkeley DB.

.. note::
None of the underlying modules will automatically shrink the disk space used by
the database file. However, :mod:`dbm.sqlite3`, :mod:`dbm.gnu` and :mod:`dbm.dumb`
provide a :meth:`!reorganize` method that can be used for this purpose.


.. exception:: error

A tuple containing the exceptions that can be raised by each of the supported
Expand Down Expand Up @@ -186,6 +192,16 @@ or any other SQLite browser, including the SQLite CLI.
The Unix file access mode of the file (default: octal ``0o666``),
used only when the database has to be created.

.. method:: sqlite3.reorganize()

If you have carried out a lot of deletions and would like to shrink the space
used on disk, this method will reorganize the database; otherwise, deleted file
space will be kept and reused as new (key, value) pairs are added.

.. note::
While reorganizing, as much as twice the size of the original database is required
in free disk space.


:mod:`dbm.gnu` --- GNU database manager
---------------------------------------
Expand Down Expand Up @@ -438,6 +454,9 @@ The :mod:`!dbm.dumb` module defines the following:
with a sufficiently large/complex entry due to stack depth limitations in
Python's AST compiler.

.. warning::
:mod:`dbm.dumb` does not support concurrent writes, which can corrupt the database.

.. versionchanged:: 3.5
:func:`~dbm.dumb.open` always creates a new database when *flag* is ``'n'``.

Expand All @@ -460,3 +479,9 @@ The :mod:`!dbm.dumb` module defines the following:
.. method:: dumbdbm.close()

Close the database.

.. method:: dumbdbm.reorganize()

If you have carried out a lot of deletions and would like to shrink the space
used on disk, this method will reorganize the database; otherwise, deleted file
space will not be reused.
14 changes: 12 additions & 2 deletions Doc/library/shelve.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,8 +75,13 @@ Two additional methods are supported:

Write back all entries in the cache if the shelf was opened with *writeback*
set to :const:`True`. Also empty the cache and synchronize the persistent
dictionary on disk, if feasible. This is called automatically when the shelf
is closed with :meth:`close`.
dictionary on disk, if feasible. This is called automatically when
:meth:`reorganize` is called or the shelf is closed with :meth:`close`.

.. method:: Shelf.reorganize()

Calls :meth:`sync` and attempts to shrink space used on disk by removing empty
space resulting from deletions.

.. method:: Shelf.close()

Expand Down Expand Up @@ -116,6 +121,11 @@ Restrictions
* On macOS :mod:`dbm.ndbm` can silently corrupt the database file on updates,
which can cause hard crashes when trying to read from the database.

* :meth:`Shelf.reorganize` may not be available for all database packages and
may temporarely increase resource usage (especially disk space) when called.
Additionally, it will never run automatically and instead needs to be called
explicitly.


.. class:: Shelf(dict, protocol=None, writeback=False, keyencoding='utf-8')

Expand Down
33 changes: 28 additions & 5 deletions Lib/dbm/dumb.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,12 @@
- seems to contain a bug when updating...
- reclaim free space (currently, space once occupied by deleted or expanded
items is never reused)
- support concurrent access (currently, if two processes take turns making
updates, they can mess up the index)
- support efficient access to large databases (currently, the whole index
is read when the database is opened, and some updates rewrite the whole index)
- support opening for read-only (flag = 'm')
"""

import ast as _ast
Expand Down Expand Up @@ -289,6 +284,34 @@ def __enter__(self):
def __exit__(self, *args):
self.close()

def reorganize(self):
if self._readonly:
raise error('The database is opened for reading only')
self._verify_open()
# Ensure all changes are committed before reorganizing.
self._commit()
# Open file in r+ to allow changing in-place.
with _io.open(self._datfile, 'rb+') as f:
reorganize_pos = 0

# Iterate over existing keys, sorted by starting byte.
for key in sorted(self._index.keys(), key = lambda k: self._index[k][0]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .keys() is unnecessary here

pos, siz = self._index[key]
f.seek(pos)
val = f.read(siz)

f.seek(reorganize_pos)
f.write(val)
self._index[key] = (reorganize_pos, siz)

blocks_occupied = (siz + _BLOCKSIZE - 1) // _BLOCKSIZE
reorganize_pos += blocks_occupied * _BLOCKSIZE

f.truncate(reorganize_pos)
# Commit changes to index, which were not in-place.
self._commit()



def open(file, flag='c', mode=0o666):
"""Open the database file, filename, and return corresponding object.
Expand Down
4 changes: 4 additions & 0 deletions Lib/dbm/sqlite3.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
STORE_KV = "REPLACE INTO Dict (key, value) VALUES (CAST(? AS BLOB), CAST(? AS BLOB))"
DELETE_KEY = "DELETE FROM Dict WHERE key = CAST(? AS BLOB)"
ITER_KEYS = "SELECT key FROM Dict"
REORGANIZE = "VACUUM"


class error(OSError):
Expand Down Expand Up @@ -122,6 +123,9 @@ def __enter__(self):
def __exit__(self, *args):
self.close()

def reorganize(self):
self._execute(REORGANIZE)


def open(filename, /, flag="r", mode=0o666):
"""Open a dbm.sqlite3 database and return the dbm object.
Expand Down
5 changes: 5 additions & 0 deletions Lib/shelve.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,11 @@ def sync(self):
if hasattr(self.dict, 'sync'):
self.dict.sync()

def reorganize(self):
self.sync()
if hasattr(self.dict, 'reorganize'):
self.dict.reorganize()


class BsdDbShelf(Shelf):
"""Shelf implementation using the "BSD" db interface.
Expand Down
61 changes: 61 additions & 0 deletions Lib/test/test_dbm.py
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,67 @@ def test_anydbm_access(self):
assert(f[key] == b"Python:")
f.close()

def test_anydbm_readonly_reorganize(self):
self.init_db()
with dbm.open(_fname, 'r') as d:
# Early stopping.
if not hasattr(d, 'reorganize'):
return

self.assertRaises(dbm.error, lambda: d.reorganize())

def test_anydbm_reorganize_not_changed_content(self):
self.init_db()
with dbm.open(_fname, 'c') as d:
# Early stopping.
if not hasattr(d, 'reorganize'):
return

keys_before = sorted(d.keys())
values_before = [d[k] for k in keys_before]
d.reorganize()
keys_after = sorted(d.keys())
values_after = [d[k] for k in keys_before]
self.assertEqual(keys_before, keys_after)
self.assertEqual(values_before, values_after)

def test_anydbm_reorganize_decreased_size(self):

def _calculate_db_size(db_path):
if os.path.isfile(db_path):
return os.path.getsize(db_path)
total_size = 0
for root, _, filenames in os.walk(db_path):
for filename in filenames:
file_path = os.path.join(root, filename)
total_size += os.path.getsize(file_path)
return total_size

# This test requires relatively large databases to reliably show difference in size before and after reorganizing.
with dbm.open(_fname, 'n') as f:
# Early stopping.
if not hasattr(f, 'reorganize'):
return

for k in self._dict:
f[k.encode('ascii')] = self._dict[k] * 100000
db_keys = list(f.keys())

# Make sure to calculate size of database only after file is closed to ensure file content are flushed to disk.
size_before = _calculate_db_size(os.path.dirname(_fname))

# Delete some elements from the start of the database.
keys_to_delete = db_keys[:len(db_keys) // 2]
with dbm.open(_fname, 'c') as f:
for k in keys_to_delete:
del f[k]
f.reorganize()

# Make sure to calculate size of database only after file is closed to ensure file content are flushed to disk.
size_after = _calculate_db_size(os.path.dirname(_fname))

self.assertLess(size_after, size_before)

def test_open_with_bytes(self):
dbm.open(os.fsencode(_fname), "c").close()

Expand Down
1 change: 1 addition & 0 deletions Misc/ACKS
Original file line number Diff line number Diff line change
Expand Up @@ -1362,6 +1362,7 @@ Milan Oberkirch
Pascal Oberndoerfer
Géry Ogam
Seonkyo Ok
Andrea Oliveri
Jeffrey Ollie
Adam Olsen
Bryan Olson
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
:mod:`shelve` as well as underlying :mod:`!dbm.dumb` and :mod:`!dbm.sqlite` now have :meth:`!reorganize` methods to
recover unused free space previously occupied by deleted entries.
Loading