Extract and manage HashTags from all the uploaded photos. Assuming 5 million photos uploaded every hour is 5 million.
More Details: Designing Hashtag Service
- Naive (count++) for every event
- Naive batching (batch on server and then write to database)
- Efficient batching with minimizing stop-the-world usng deep-copy
- Efficient batching with minimizing stop-the-world using two-maps
- Kafka adapter pattern to re-ingest the post hashtags partitioned by hashtag
- Create a DB hashtags with following schema:
+------------+-------------+------+-----+
| Field | Type | Null | Key |
+------------+-------------+------+-----+
| hashtag_id | varchar(50) | NO | PRI |
| count | int | YES | |
+------------+-------------+------+-----+
- Run
insertDBData()
function from insert_db_data to create asetup_db.sql
file. - Run the following command to insert data into the table.
mysql -u root -p < setup_db.sql
- Run
main.go
For 6000 unique hastags and 10000 photo uploads, we have:
Naive Counting, Time Taken: 10.6695487s
Naive Batching, Time Taken: 9.0829599s
Efficient Batching (Deep-Copy), Time Taken: 547.9µs
Efficient Batching (Two-Maps), Time Taken: 521.4µs
Kafka Adapter, Time Taken: 4.4738473s
- Count for each hashtag is updated directly in the database every time a new post is processed. Each post triggers an immediate database write operation.
- Leads to a high frequency of database writes, causing significant latency as each write is blocking. The system stops processing new posts until the write operation completes, introducing delays.
- Posts are batched in memory, and once a batch is complete, the entire batch is written to the database at once. Reduces the number of database writes, but the system still waits for the batch write to complete before processing new posts.
- Stop-the-World: The system "pauses" while waiting for the batch to be written to the database. New posts can't be processed until the database update is finished.
- Batches are accumulated in memory, and a separate goroutine is used to handle the database write in parallel. Before the write, a deep copy of the batch is made, allowing the main batch to continue accepting new posts.
- Stop-the-World: The effect is minimized because the batch write happens in parallel, and the system can continue processing new posts without waiting for the database update. However, there may still be some delay due to concurrent access to shared resources.
- Uses two maps to handle batch updates. One map is used for accumulating counts, while another map is used for processing the database write in parallel. The system switches between maps to ensure uninterrupted processing of new posts.
- Stop-the-World: Minimized by using two maps, ensuring that while one map is being written to the database, the other map continues accepting new posts. This reduces latency and improves concurrency.
- Instead of directly writing to the database during post processing, a Kafka producer is used to publish hashtag updates as events to a Kafka topic. A separate Kafka consumer processes these events and updates the database asynchronously.
- Stop-the-World: Eliminated because posts are immediately processed and pushed to Kafka, with database updates happening asynchronously in the background. This decouples the post processing from the database write operations, ensuring no delays in post handling.