Skip to content

ObjectID

Alice Zoë Bevan–McGregor edited this page May 8, 2019 · 1 revision

Coordination-Free Unique Identifier Generation

Marrow Mongo contains an ObjectID implementation independent from the bson package bundled with PyMongo, developed in "clean-room" isolation based on publicly available end-use documentation.

Implementations are provided of all known ObjectId generation methods and interpretations, primarily as a mechanism to utilize or transition older IDs on modern systems, as well as to provide an option if you prefer the guarantees and information provided by older versions, moving forwards. Additionally, our variant permits explicit "hardware identification" by use of custom, fixed byte strings.

Being Python 3 specific, we are more strict about the type of string being passed. Where PyMongo's bson.ObjectId permits hex-encoded bytes values, our ObjectID does not: binary values will only be interpreted as a raw binary ObjectID; no transformations will be applied. If you have IDs encoded in hexadecimal, use the textual string representation, str, for them.

ObjectId was originally[1] defined (< MongoDB 3.3) as a combination of:

  • 4-byte UNIX timestamp.
  • 3-byte machine identifier.
  • 2-byte process ID.
  • 3-byte counter with random IV ("initialization vector", or starting point) on process start.

The server itself never had a complex interpretation, treating the data after the timestamp as an "arbitrary hardware/node identifier" followed by counter. The documentation and client drivers were brought more in-line with this intended lack of structure[2] replacing the hardware and process identifiers with literal random data initialized on process startup. As such, the modern structure is now comprised of:

  • 4-byte UNIX timestamp.
  • 5-byte random process identifier. ("Random value" in the docs.)
  • 3-byte counter with random IV on process start.

Additionally, the mechanism used to determine the hardware identifier has changed in the past. Initially it used a substring segment of the hex-encoded result of MD5 hashing the value returned by gethostname(). For Federal Information Processing Standard (FIPS) [3] compliance, use of MD5 was eliminated and a custom FNV implementation added. We avoid embedding yet another hashing implementation in our own code and will instead utilize the fnv package, if installed. (This will be automatically installed if your own application depends upon marrow.mongo[fips].) Without the library installed, the fips choice will not be available.

To determine which approach is used for generation, specify the hwid keyword argument to the ObjectID() constructor. Possibilities include:

  • The string "legacy": use the host name MD5 substring value and process ID. Note if FIPS compliance is enabled, the "md5" hash will literally be unavailable for use, resulting in the inability to utilize this choice.
  • The string "fips": use the FIPS-compliant FNV hash of the host name, in combination with the current process ID. Requires the fnv package be installed.
  • The string "random": pure random bytes, the default, aliased as modern.
  • Any 5-byte bytes value: use the given HWID explicitly.

You are permitted to add additional entries to this mapping within your own application, if desired.

Unlike the PyMongo-supplied ObjectId implementation, this does not use a custom Exception subclass to represent invalid values. TypeError will be raised if passed a value not able to be stringified, ValueError if the resulting string is not 12 binary bytes or 24 hexadecimal characters. Warning: any 12-byte bytes value will be accepted as-is.

Additional points of reference:

Footnotes

  1. https://docs.mongodb.com/v3.2/reference/method/ObjectId/
  2. https://docs.mongodb.com/v3.4/reference/method/ObjectId/
  3. https://en.wikipedia.org/wiki/Federal_Information_Processing_Standards
Clone this wiki locally