SYSTEM DESIGN - EX1

1. Pastebin

1.1. Requirement & scope

// use case
User: enter text & content -> gets short url
User: enters short url -> view text & content
User: anonymous
Service: page analytics
Service: deletes expired text
Service: high availability

// assumption
Not even traffic
No realtime analytics
10M paste writes per month
100M paste reads per month
10:1 read to write ratio

1.2. Calculation

Size per paste ~ 1.27KB

1KB content

7B short_link

4B expiration_in_hr

5B created_at

255B path

new paste per month = 1.27KB * 10M = 12.7GB

new paste per 3yrs = 1.27KB * 10M * 36 = 450GB

new short_links per 3yrs = 10M * 36 = 360M

new paste per second = 10M / 2.5M = 4

read paste per second = 100M / 2.5M = 40

1.3. Design

1.4. Core

1.4.1 Overview

RDBMS = [1] as hash table, [2] mapping short_link to file, [3] Could be replaced by NoSQL key-value store.

File server / Object store = [1] Amazon S3, [2] NoSQL document store

web server = reverse proxy

write api = (1) generate unique URL, (2) save to DB, (3) save to object store, (4) return url

1.4.2 Encoding

MD5 = [1] hash ip+timestamp, [2] uniformly distributed

Base62 = [1] encode the MD5 hash to [a-zA-Z0-9], [2] deterministic, one-to-one, [3] O(k), k=7

Base64 = including +, -, which requires escape in URL

62^7 >> 360M = 36 * 10^7, enough to handle

1.4.3 Request

1.4.4 Analytics

no real time = MapReduce logs -> hit counts

1.4.4 Expire delete

scan the SQL Database = timestamp > expire

1.5. Scale

1) Benchmark, 2) bottlenecks 3) trade-offs

Analytics Database = Amazon Redshift or Google BigQuery.

READ = [1] traffic for popular content in Cache, [2] uneven traffic in cache, [3] cache miss -> Read SQL

WRITE scaling = Federation, Sharding, Denormalization, SQL Tuning

2.1. Requirement & scope

2.2. Calculate usage

Size per tweet ~ 10KB

8B tweet_id

32B user_id

140B text

10KB media

new tweet per month = 15b * 10KB =150TB

new tweet per 3yrs = 150TB * 3 * 12 = 5.4PB

read requests per second = 250b / 2.5M = 100k

write new tweet per second = 6k

fanout new tweet per second = 60k

search requests per second = 4k

2.3. Design

2.4 Core

2.4.1 User posts a tweet & fanout

2.4.2 View home timeline

2.4.3 View user timeline

2.4.4 searches

2.5 Scale

2.5.1 Fanout service bottleneck

  1. A user has million followers -> tweet propagates minutes -> race @reply -> mitigate by re-ordering tweets at serve time.

  2. Keep only 100s tweets for each home timeline in Cache

  3. Keep only active users' home timeline in Cache

  4. Store only a month of tweets in the Tweet Info Service

  5. Store only active users in the User Info Service

2.5.2 SQL DB bottleneck

  1. Although Cache reduce DB load -> unlikely SQL Read Replicas would be enough to handle cache misses-> SQL scaling.

  2. high volume writes -> overwhelm a single SQL Write Master-Slave -> SQL scaling.

3. web crawler

3.1. Requirement & scope

3.2. Calculate usage

Size per page ~ 500KB

2 PB of stored page per month = 500KB * 4b

72 PB of stored page in 3 years

1,600 write requests per second

40,000 search requests per second

3.3. Design

3.4 Core

3.4.1 Service crawls a list of urls

handle duplicate

re-crawl

3.4.2 User inputs a search term and sees a list of relevant pages with titles and snippets

3.4.3 Crawl service detail

URL Frontier = priority queue + politeness router

[1] priority queue = sort on popularity / last modified, etc

[2] politeness router = each queue will have only one domain, and a machine controls fetching rate.

Detection = Update + duplicate

[1] update = curl HEAD request to get header without body (fast).

[2] duplicate = use simhash to detect whether duplicated or copied partial content.

Storage

Bigtable

S3, MIN.io

HDFS

Last updated

Was this helpful?