Skip to content

Latest commit

 

History

History
40 lines (24 loc) · 3.5 KB

distributed_id.md

File metadata and controls

40 lines (24 loc) · 3.5 KB

Generating a unique ID for a transaction of a database entry has never been a problem in normal computing world even in primary/secondary database setup. But in distributed computing it is a challenging task to maintain the uniqueness since the processing or storage is distributed among various nodes in a cluster.

One easy solution would be to maintain a single node (centralized system) to generate and maintain the unique ids. Here this goes against the founding principles of distributed computing. If the centralized node fails, it will bring the entire system down. Thus the fault tolerance is very low.

Some of the possible solutions are:

UUID

NoSQL databases like MongoDB uses UUID. UUIDs are globally unique 128 bit hexadecimal numbers. The probability of a repeated UUID getting generated is very low.

The annual risk of a given person being hit by a meteorite is estimated to be one chance in 17 billion, which means the probability is about 0.00000000006 (6 × 10−11), equivalent to the odds of creating a few tens of trillions of UUIDs in a year and having one duplicate. In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%.

However, these probabilities only hold when the UUIDs are generated using sufficient entropy. Otherwise, the probability of duplicates could be significantly higher, since the statistical dispersion might be lower. Where unique identifiers are required for distributed applications, so that UUIDs do not clash even when data from many devices is merged, the randomness of the seeds and generators used on every device must be reliable for the life of the application. Where this is not feasible, RFC4122 recommends using a namespace variant instead.

There are 4 versions of UUIDs available.

  • Version 1 (date-time and MAC address)
  • Version 2 (date-time and MAC address, DCE security version)
  • Versions 3 and 5 (namespace name-based)
  • Version 4 (random)

A sample UUID would like this 123e4567-e89b-12d3-a456-426655440000

A major advantage with this is the nodes do not need to coordinate with each other to produce unique UUIDs.

Since UUIDs are large compared to conventional IDs, they pose problems in indexing and query performances.

Twitter Snowflake

Twitter snowflake is a dedicated network service for generating 64-bit unique IDs at high scale with some simple guarantees. Snowflake powers the Twitter IDs. Snowflake is now archived and not in active development.

  • Epoch timestamp in millisecond precision— 41 bits (gives us 69 years with a custom epoch)
  • Machine id— 10 bits (gives us up to 1024 machines)
  • Sequence number— 12 bits (A local counter per machine that rolls over every 4096)
  • The extra 1 bit is reserved for future purposes.

So the id which is generated by this is 64bit which solves the problems of size and latency issues but also introduces one problem for maintaining extra servers.

Ref: https://medium.com/@sauravomar01/generating-unique-id-in-distributed-environment-in-high-scale-88f83240db57

Snowflake Algorithm in Java: https://developpaper.com/distributed-unique-id-generation-series-5-is-snowflake-twitters-snow-algorithm-suitable-for-distributed-id/