How to generate unique ids in a distributed system.
Why are unique ids important in distributed systems?
Let's try to understand this with a real example.
In twitter, millions of tweets are generated everyday and to handle these tweets, there are thousands of servers allocated in a distributed fashion.
Let's say, we need to assign an unique id to each tweet.
If Twitter had only one server, this would be an easy task. We could easily identify the tweet with the primary key generated by the database.
But, what to do when there are thousands of database servers working in tandem?
Now, the system should make sure that the ids do not collide and they should be unique globally.
Twitter uses Snow Flake algorithm to generate unique ids with minimal chances of collision.
Ways of generating unique ids in distributed system.
- UUIDs:
UUIDs are 128 bit hexadecimal systems. They look something like this: c3273cb9-a9c2-4cdc-ad38-5ada7cc78de9
The uuid above is an example of version 4 uuid which is generated using random numbers.
The more random the numbers used, the less chances of two uuids colliding. On closer observation, you can see that the number of characters separated by hyphen follows a pattern of: 8-4-4-4-12.
The pros for this approach is that each server can generate the uuids uniquely and they do not need to coordination with each other. Here, we are basically assuming that the uuids will be unique across the servers and the chance of collision is practically zero( although not impossible).
The cons to this approach is that using 128 bit hexadecimals as primary key makes indexing slower and that means the database write will be slower.
- Range Handler:
Range Handler is a full proof solution. It is guaranteed that the two identifiers will never collide with each other.
Each server will be provided a range, and they will generate ids only within that range.
For example, if a server is given a range of 100 to 1000, it will progressively generate ids such as 101, 102, 103 and so on. Once it generates 1000, it has run out of ids. In that case, it again requests for a new range.
The pros of this approach is that there will never be a duplication of ids.
The cons is, if a server is down, it will request for a new range when it is up. So, the old range it was working on in lost and we lost some ids.
To minimize the loss, we need to ensure that the range is not very large.