feat: uid generator

ryan4yin · Dec 7, 2023 · 2ff5879 · 2ff5879
1 parent b186e7d
commit 2ff5879
Show file tree

Hide file tree

Showing 2 changed files with 173 additions and 0 deletions.
diff --git a/algorithm/_img/distributed-unique-id-algorithms.webp b/algorithm/_img/distributed-unique-id-algorithms.webp
diff --git a/algorithm/唯一 ID 生成算法.md b/algorithm/唯一 ID 生成算法.md
@@ -0,0 +1,173 @@
+# 唯一 ID 生成算法
+
+
+## 结构说明
+
+![](./_img/distributed-unique-id-algorithms.webp)
+
+## 1. UUID
+
+When talking about generating unique IDs, UUIDs or Universal Unique Identifiers come to mind. 
+
+A UUID is made of 32 hexadecimal characters. Remember, each character is 4 bits. So, all in all, it’s 128 bits. And when you include the 4 hyphens, you’ll see 36 characters:
+
+```
+6e965784–98ef-4ebf-b477–8bd14164aaa4
+
+5fd6c336-48c4-4510-bfe5-f7928a83a3e2
+
+0333be18-5ecc-4d7e-98d4-80cc362e4ade
+```
+
+There are 5 common types of UUID:
+
+* **Version 1 — Time-based MAC**: This UUID uses the MAC Address of your computer and the current time.
+* **Version 2 — DCE Security**: Similar to Version 1 but with extra info like POSIX UID or GID.
+* **Version 3 — Name-based MD5**: This one takes a namespace and a string, then uses MD5 to create the UUID.
+* **Version 4 — Randomness**: Every character is chosen randomly.
+* **Version 5 — Name-based SHA1**: Think of Version 3, but instead of MD5, it uses SHA-1.
+* …You may want to consider other drafts like Version 6 - Reordered Time and Version 7 - Unix Epoch Time, etc, among the latest proposals at ramsey/uuid.
+
+I won’t go into the details of each version right now. But if you’re unsure about which to choose, I’ve found **Version 4 — Randomness** to be a good starting point. It’s straightforward and effective.
+
+> “Random and unique? How’s that even possible?”
+
+The magic lies in its super low chance of collision. 
+
+Pulling from Wikipedia, imagine generating 1 billion UUIDs every second for 86 whole years and only then would you have a 50% chance of getting a single match.
+
+    “Why do some say UUID has only 122 bits when it’s clearly 128 bits?”
+
+When people talk about UUIDs, they often refer to the most common type, which is variant 1, version 4. 
+
+In this type, 6 out of the 128 bits are already set for specific purposes: 4 bits tell us it’s version 4 (or “v4”), and 2 bits are reserved for variant information. 
+
+So, only 122 bits are left to be filled in randomly.
+Pros
+
+* It’s simple, there’s no need for initial setups or a centralized system to manage the ID.
+* Every service in your distributed system can roll out its own unique ID, no chit-chat needed.
+
+Cons
+
+* With 128 bits, it’s a long ID and it’s not something you’d easily write down or remember.
+* It doesn’t reveal much information about itself.
+* UUIDs aren’t sortable (except for versions 1 and 2).
+
+## 2. NanoID
+
+Drawing from the concept of UUID, NanoID streamlines things a bit with just 21 characters but these characters are sourced from an alphabet of 64 characters, inclusive of hyphens and underscores.
+
+Doing the math, each NanoID character takes up 6 bits, as opposed to the 4 bits of UUID and a quick multiplication, and we see NanoID coming in at a neat 126 bits.
+
+```
+NUp3FRBx-27u1kf1rmOxn
+XytMg-01fzdSaHoKXnPMJ
+_4hP-0rh8pNbx6-Qw1pMl
+
+```
+
+> “Does storing NanoID vs. UUID in a database make much of a difference?”
+
+Well, if you’re saving them as strings, NanoID might be a bit more efficient, being 15 characters shorter than UUID, but in their binary forms, the difference is a mere 2 bits, often a minor detail in most storage scenarios.
+
+Pros
+
+* NanoID uses characters (A-Za-z0–9_-) which is friendly with URLs.
+* At just 21 characters, it’s more compact than UUID, shaving off 15 characters to be precise (though it’s 126 bits versus UUID’s 128)
+
+Cons
+
+* NanoID is newer and might not be as widely supported as UUID.
+
+## 3. ObjectID(MongoDB 96Bits)
+
+ObjectID is MongoDB’s answer to a unique document ID, this 12-byte identifier typically resides in the “_id” field of a document, and if you’re not setting it yourself, MongoDB steps in to do it for you.
+
+Here’s what makes up an ObjectID:
+
+* Timestamp (4 bytes): This represents the time the object was created, measured from the Unix epoch (a timestamp from 1970, for those who might need a refresher).
+* Random Value (5 bytes): Each machine or process gets its own random value.
+* Counter (3 bytes): A simple incrementing counter for a given machine
+
+> “But how does each process ensure its random value is unique?”
+
+With 5 bytes, we’re talking about 2⁴⁰ potential values, given the limited number of machines or processes, collisions are exceedingly rare
+ObjectID (source: blog.devtrovert.com)
+
+When representing ObjectIDs, MongoDB goes with hexadecimal, turning those 12 bytes (or 96 bits) into 24 characters
+
+```
+6502b4ab cf09f864b0 074858
+6502b4ab cf09f864b0 074859
+6502b4ab cf09f864b0 07485a
+```
+
+Pros
+
+* Ensures a global order without needing a centralized authority to oversee uniqueness
+* In terms of byte size, it’s more compact than both UUID and NanoID.
+* Using IDs for sorting is straightforward, and you can easily see when each object was made.
+* Reveals the specific process or machine that created an item.
+* Scales gracefully, thanks to its time-based structure ensuring no future conflicts.
+
+Cons
+
+* Despite its relative compactness, 96 bits can still be considered long.
+* Be careful when sharing ObjectIDs with clients, they might reveal too much.
+
+
+## 4. Snowflake(64 Bits)
+
+Twitter Snowflake (64 bits)
+
+Commonly known as “Snowflake ID”, this system was developed by Twitter to efficiently generate IDs for their massive user base.
+
+Also, a Snowflake ID boils down to a 64-bit integer, which is more compact than MongoDB’s ObjectID
+
+* Sign Bit (1 bit): This bit is typically unused, though it can be reserved for specific functions.
+* Timestamp (41 bits): Much like ObjectID, it represents data creation time in milliseconds, spanning ~70 years from its starting point.
+* Datacenter ID (5 bits): Identifies the physical datacenter location. With 5 bits, we can have up to 2⁵ = 32 datacenter.
+* Machine/ Process ID (5 bits): Tied to individual machines, services, or processes creating data.
+* Sequence (12 bits): An incrementing counter that resets to 0 every millisecond.
+
+
+> “Hold on. 70 years? So from 1970, it’s done by 2040?”
+
+Exactly.
+
+Many Snowflake implementations use a custom epoch that begins more recently, like Nov 04 2010 01:42:54 UTC, for instance. As for its advantages, they’re pretty evident given the design.
+
+Cons
+
+* Might be over-engineered for medium-sized businesses, especially with complex setups like multiple data centers, millisecond-level timestamps, sequence resets...
+* Limited lifespan, it has a lifespan of ~70 years.
+
+It packs features some may find excessive, but for giants like Twitter, it’s right on the money.
+
+
+## 性能测试
+
+https://github.com/knifecake/python-id-benchmarks
+
+测试结果（时间单位为 ns）：
+
+```
+----------------- benchmark 'test_generate': 7 tests -----------------
+Name (time in ns)                 Mean                StdDev          
+----------------------------------------------------------------------
+generate[snowflake]           492.1435 (1.0)         46.7127 (1.0)    
+generate[cyksuid]             695.9376 (1.41)       119.0593 (2.55)   
+generate[python-ulid]       1,719.8319 (3.49)       240.3840 (5.15)   
+generate[uuid4]             1,961.3799 (3.99)       119.5277 (2.56)   
+generate[timeflake]         2,637.3506 (5.36)       451.9482 (9.68)   
+generate[svix]              3,746.8204 (7.61)       685.0287 (14.66)  
+generate[cuid2]           317,832.6200 (645.81)   4,876.3859 (104.39) 
+----------------------------------------------------------------------
+```
+
+结论：
+
+1. snowflake 是其中最快的算法，但是它的缺点是依赖于时钟，如果时钟回拨，会导致 ID 重复。
+2. python 标准库中的 uuid4 用时是 snowflake 的 400%，也即它比 snowflake 慢了 300%，它甚至还是 C 语言实现的！而 snowflake 是纯 python.
+