SYSTEM DESIGN - EX1
1. Pastebin
1.1. Requirement & scope
// use case
User: enter text & content -> gets short url
User: enters short url -> view text & content
User: anonymous
Service: page analytics
Service: deletes expired text
Service: high availability
// assumption
Not even traffic
No realtime analytics
10M paste writes per month
100M paste reads per month
10:1 read to write ratio1.2. Calculation
Size per paste ~ 1.27KB
1KB content
7B short_link
4B expiration_in_hr
5B created_at
255B path
new paste per month = 1.27KB * 10M = 12.7GB
new paste per 3yrs = 1.27KB * 10M * 36 = 450GB
new short_links per 3yrs = 10M * 36 = 360M
new paste per second = 10M / 2.5M = 4
read paste per second = 100M / 2.5M = 40
1.3. Design


1.4. Core
1.4.1 Overview
RDBMS = [1] as hash table, [2] mapping short_link to file, [3] Could be replaced by NoSQL key-value store.
File server / Object store = [1] Amazon S3, [2] NoSQL document store
web server = reverse proxy
write api = (1) generate unique URL, (2) save to DB, (3) save to object store, (4) return url
1.4.2 Encoding
MD5 = [1] hash ip+timestamp, [2] uniformly distributed
Base62 = [1] encode the MD5 hash to [a-zA-Z0-9], [2] deterministic, one-to-one, [3] O(k), k=7
Base64 = including +, -, which requires escape in URL
62^7 >> 360M = 36 * 10^7, enough to handle
1.4.3 Request
1.4.4 Analytics
no real time = MapReduce logs -> hit counts
1.4.4 Expire delete
scan the SQL Database = timestamp > expire
1.5. Scale
1) Benchmark, 2) bottlenecks 3) trade-offs
Analytics Database = Amazon Redshift or Google BigQuery.
READ = [1] traffic for popular content in Cache, [2] uneven traffic in cache, [3] cache miss -> Read SQL
WRITE scaling = Federation, Sharding, Denormalization, SQL Tuning
2. Twitter timeline and search
2.1. Requirement & scope
2.2. Calculate usage
Size per tweet ~ 10KB
8B tweet_id
32B user_id
140B text
10KB media
new tweet per month = 15b * 10KB =150TB
new tweet per 3yrs = 150TB * 3 * 12 = 5.4PB
read requests per second = 250b / 2.5M = 100k
write new tweet per second = 6k
fanout new tweet per second = 60k
search requests per second = 4k
2.3. Design


2.4 Core
2.4.1 User posts a tweet & fanout
2.4.2 View home timeline
2.4.3 View user timeline
2.4.4 searches
2.5 Scale
2.5.1 Fanout service bottleneck
A user has million followers -> tweet propagates minutes -> race @reply -> mitigate by re-ordering tweets at serve time.
Keep only 100s tweets for each home timeline in Cache
Keep only active users' home timeline in Cache
Store only a month of tweets in the Tweet Info Service
Store only active users in the User Info Service
2.5.2 SQL DB bottleneck
Although Cache reduce DB load -> unlikely SQL Read Replicas would be enough to handle cache misses-> SQL scaling.
high volume writes -> overwhelm a single SQL Write Master-Slave -> SQL scaling.
3. web crawler
3.1. Requirement & scope
3.2. Calculate usage
Size per page ~ 500KB
2 PB of stored page per month = 500KB * 4b
72 PB of stored page in 3 years
1,600 write requests per second
40,000 search requests per second
3.3. Design


3.4 Core
3.4.1 Service crawls a list of urls
handle duplicate
re-crawl
3.4.2 User inputs a search term and sees a list of relevant pages with titles and snippets
3.4.3 Crawl service detail

URL Frontier = priority queue + politeness router
[1] priority queue = sort on popularity / last modified, etc
[2] politeness router = each queue will have only one domain, and a machine controls fetching rate.
Detection = Update + duplicate
[1] update = curl HEAD request to get header without body (fast).
[2] duplicate = use simhash to detect whether duplicated or copied partial content.
Storage
Bigtable
S3, MIN.io
HDFS
Last updated
Was this helpful?