-
-
Notifications
You must be signed in to change notification settings - Fork 48
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
173 additions
and
0 deletions.
There are no files selected for viewing
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,173 @@ | ||
# 唯一 ID 生成算法 | ||
|
||
|
||
## 结构说明 | ||
|
||
![](./_img/distributed-unique-id-algorithms.webp) | ||
|
||
## 1. UUID | ||
|
||
When talking about generating unique IDs, UUIDs or Universal Unique Identifiers come to mind. | ||
|
||
A UUID is made of 32 hexadecimal characters. Remember, each character is 4 bits. So, all in all, it’s 128 bits. And when you include the 4 hyphens, you’ll see 36 characters: | ||
|
||
``` | ||
6e965784–98ef-4ebf-b477–8bd14164aaa4 | ||
5fd6c336-48c4-4510-bfe5-f7928a83a3e2 | ||
0333be18-5ecc-4d7e-98d4-80cc362e4ade | ||
``` | ||
|
||
There are 5 common types of UUID: | ||
|
||
* **Version 1 — Time-based MAC**: This UUID uses the MAC Address of your computer and the current time. | ||
* **Version 2 — DCE Security**: Similar to Version 1 but with extra info like POSIX UID or GID. | ||
* **Version 3 — Name-based MD5**: This one takes a namespace and a string, then uses MD5 to create the UUID. | ||
* **Version 4 — Randomness**: Every character is chosen randomly. | ||
* **Version 5 — Name-based SHA1**: Think of Version 3, but instead of MD5, it uses SHA-1. | ||
* …You may want to consider other drafts like Version 6 - Reordered Time and Version 7 - Unix Epoch Time, etc, among the latest proposals at ramsey/uuid. | ||
|
||
I won’t go into the details of each version right now. But if you’re unsure about which to choose, I’ve found **Version 4 — Randomness** to be a good starting point. It’s straightforward and effective. | ||
|
||
> “Random and unique? How’s that even possible?” | ||
The magic lies in its super low chance of collision. | ||
|
||
Pulling from Wikipedia, imagine generating 1 billion UUIDs every second for 86 whole years and only then would you have a 50% chance of getting a single match. | ||
|
||
“Why do some say UUID has only 122 bits when it’s clearly 128 bits?” | ||
|
||
When people talk about UUIDs, they often refer to the most common type, which is variant 1, version 4. | ||
|
||
In this type, 6 out of the 128 bits are already set for specific purposes: 4 bits tell us it’s version 4 (or “v4”), and 2 bits are reserved for variant information. | ||
|
||
So, only 122 bits are left to be filled in randomly. | ||
Pros | ||
|
||
* It’s simple, there’s no need for initial setups or a centralized system to manage the ID. | ||
* Every service in your distributed system can roll out its own unique ID, no chit-chat needed. | ||
|
||
Cons | ||
|
||
* With 128 bits, it’s a long ID and it’s not something you’d easily write down or remember. | ||
* It doesn’t reveal much information about itself. | ||
* UUIDs aren’t sortable (except for versions 1 and 2). | ||
|
||
## 2. NanoID | ||
|
||
Drawing from the concept of UUID, NanoID streamlines things a bit with just 21 characters but these characters are sourced from an alphabet of 64 characters, inclusive of hyphens and underscores. | ||
|
||
Doing the math, each NanoID character takes up 6 bits, as opposed to the 4 bits of UUID and a quick multiplication, and we see NanoID coming in at a neat 126 bits. | ||
|
||
``` | ||
NUp3FRBx-27u1kf1rmOxn | ||
XytMg-01fzdSaHoKXnPMJ | ||
_4hP-0rh8pNbx6-Qw1pMl | ||
``` | ||
|
||
> “Does storing NanoID vs. UUID in a database make much of a difference?” | ||
Well, if you’re saving them as strings, NanoID might be a bit more efficient, being 15 characters shorter than UUID, but in their binary forms, the difference is a mere 2 bits, often a minor detail in most storage scenarios. | ||
|
||
Pros | ||
|
||
* NanoID uses characters (A-Za-z0–9_-) which is friendly with URLs. | ||
* At just 21 characters, it’s more compact than UUID, shaving off 15 characters to be precise (though it’s 126 bits versus UUID’s 128) | ||
|
||
Cons | ||
|
||
* NanoID is newer and might not be as widely supported as UUID. | ||
|
||
## 3. ObjectID(MongoDB 96Bits) | ||
|
||
ObjectID is MongoDB’s answer to a unique document ID, this 12-byte identifier typically resides in the “_id” field of a document, and if you’re not setting it yourself, MongoDB steps in to do it for you. | ||
|
||
Here’s what makes up an ObjectID: | ||
|
||
* Timestamp (4 bytes): This represents the time the object was created, measured from the Unix epoch (a timestamp from 1970, for those who might need a refresher). | ||
* Random Value (5 bytes): Each machine or process gets its own random value. | ||
* Counter (3 bytes): A simple incrementing counter for a given machine | ||
|
||
> “But how does each process ensure its random value is unique?” | ||
With 5 bytes, we’re talking about 2⁴⁰ potential values, given the limited number of machines or processes, collisions are exceedingly rare | ||
ObjectID (source: blog.devtrovert.com) | ||
|
||
When representing ObjectIDs, MongoDB goes with hexadecimal, turning those 12 bytes (or 96 bits) into 24 characters | ||
|
||
``` | ||
6502b4ab cf09f864b0 074858 | ||
6502b4ab cf09f864b0 074859 | ||
6502b4ab cf09f864b0 07485a | ||
``` | ||
|
||
Pros | ||
|
||
* Ensures a global order without needing a centralized authority to oversee uniqueness | ||
* In terms of byte size, it’s more compact than both UUID and NanoID. | ||
* Using IDs for sorting is straightforward, and you can easily see when each object was made. | ||
* Reveals the specific process or machine that created an item. | ||
* Scales gracefully, thanks to its time-based structure ensuring no future conflicts. | ||
|
||
Cons | ||
|
||
* Despite its relative compactness, 96 bits can still be considered long. | ||
* Be careful when sharing ObjectIDs with clients, they might reveal too much. | ||
|
||
|
||
## 4. Snowflake(64 Bits) | ||
|
||
Twitter Snowflake (64 bits) | ||
|
||
Commonly known as “Snowflake ID”, this system was developed by Twitter to efficiently generate IDs for their massive user base. | ||
|
||
Also, a Snowflake ID boils down to a 64-bit integer, which is more compact than MongoDB’s ObjectID | ||
|
||
* Sign Bit (1 bit): This bit is typically unused, though it can be reserved for specific functions. | ||
* Timestamp (41 bits): Much like ObjectID, it represents data creation time in milliseconds, spanning ~70 years from its starting point. | ||
* Datacenter ID (5 bits): Identifies the physical datacenter location. With 5 bits, we can have up to 2⁵ = 32 datacenter. | ||
* Machine/ Process ID (5 bits): Tied to individual machines, services, or processes creating data. | ||
* Sequence (12 bits): An incrementing counter that resets to 0 every millisecond. | ||
|
||
|
||
> “Hold on. 70 years? So from 1970, it’s done by 2040?” | ||
Exactly. | ||
|
||
Many Snowflake implementations use a custom epoch that begins more recently, like Nov 04 2010 01:42:54 UTC, for instance. As for its advantages, they’re pretty evident given the design. | ||
|
||
Cons | ||
|
||
* Might be over-engineered for medium-sized businesses, especially with complex setups like multiple data centers, millisecond-level timestamps, sequence resets... | ||
* Limited lifespan, it has a lifespan of ~70 years. | ||
|
||
It packs features some may find excessive, but for giants like Twitter, it’s right on the money. | ||
|
||
|
||
## 性能测试 | ||
|
||
https://github.com/knifecake/python-id-benchmarks | ||
|
||
测试结果(时间单位为 ns): | ||
|
||
``` | ||
----------------- benchmark 'test_generate': 7 tests ----------------- | ||
Name (time in ns) Mean StdDev | ||
---------------------------------------------------------------------- | ||
generate[snowflake] 492.1435 (1.0) 46.7127 (1.0) | ||
generate[cyksuid] 695.9376 (1.41) 119.0593 (2.55) | ||
generate[python-ulid] 1,719.8319 (3.49) 240.3840 (5.15) | ||
generate[uuid4] 1,961.3799 (3.99) 119.5277 (2.56) | ||
generate[timeflake] 2,637.3506 (5.36) 451.9482 (9.68) | ||
generate[svix] 3,746.8204 (7.61) 685.0287 (14.66) | ||
generate[cuid2] 317,832.6200 (645.81) 4,876.3859 (104.39) | ||
---------------------------------------------------------------------- | ||
``` | ||
|
||
结论: | ||
|
||
1. snowflake 是其中最快的算法,但是它的缺点是依赖于时钟,如果时钟回拨,会导致 ID 重复。 | ||
2. python 标准库中的 uuid4 用时是 snowflake 的 400%,也即它比 snowflake 慢了 300%,它甚至还是 C 语言实现的!而 snowflake 是纯 python. | ||
|