SYSTEM DESIGN - EX1
1. Pastebin
1.1. Requirement & scope
// use case
User: enter text & content -> gets short url
User: enters short url -> view text & content
User: anonymous
Service: page analytics
Service: deletes expired text
Service: high availability
// assumption
Not even traffic
No realtime analytics
10M paste writes per month
100M paste reads per month
10:1 read to write ratio1.2. Calculation
Size per paste ~ 1.27KB
1KB content
7B short_link
4B expiration_in_hr
5B created_at
255B path
new paste per month = 1.27KB * 10M = 12.7GB
new paste per 3yrs = 1.27KB * 10M * 36 = 450GB
new short_links per 3yrs = 10M * 36 = 360M
new paste per second = 10M / 2.5M = 4
read paste per second = 100M / 2.5M = 40
1.3. Design


1.4. Core
1.4.1 Overview
RDBMS = [1] as hash table, [2] mapping short_link to file, [3] Could be replaced by NoSQL key-value store.
File server / Object store = [1] Amazon S3, [2] NoSQL document store
web server = reverse proxy
write api = (1) generate unique URL, (2) save to DB, (3) save to object store, (4) return url
shortlink char(7) NOT NULL
expiration_length_in_minutes int NOT NULL
created_at datetime NOT NULL
paste_path varchar(255) NOT NULL
PRIMARY KEY(shortlink)
index @ shortlink + created_at1.4.2 Encoding
MD5 = [1] hash ip+timestamp, [2] uniformly distributed
Base62 = [1] encode the MD5 hash to [a-zA-Z0-9], [2] deterministic, one-to-one, [3] O(k), k=7
Base64 = including +, -, which requires escape in URL
62^7 >> 360M = 36 * 10^7, enough to handle
short_link = Base62(MD5(ip+timestamp))1.4.3 Request
$ curl --header "Content-Type: application/json" \
  --request POST \
  --data '{ "expiration_length_in_minutes": "60", "paste_contents": "Hello World!" }' \
  https://pastebin.com/api/v1/paste1.4.4 Analytics
no real time = MapReduce logs -> hit counts
// map from log files
    (2016-01, url0), 1
    (2016-01, url0), 1
    (2016-01, url1), 1
// reduce
    (2016-01, url0), 2
    (2016-01, url1), 11.4.4 Expire delete
scan the SQL Database = timestamp > expire
1.5. Scale
1) Benchmark, 2) bottlenecks 3) trade-offs
Analytics Database = Amazon Redshift or Google BigQuery.
READ = [1] traffic for popular content in Cache, [2] uneven traffic in cache, [3] cache miss -> Read SQL
WRITE scaling = Federation, Sharding, Denormalization, SQL Tuning
2. Twitter timeline and search
2.1. Requirement & scope
// Use cases
User posts tweet -> Service fanout to followers -> sending notifications & emails
User: view my timeline
User: view home timeline
User: searches keywords
Service: high availability
// Out of scope
Visibility settings
Analytics
// assumption
Not even traffic
100M users
500M tweets per day; 15b tweets per month.
5b fanout per day; 150b fanout per month.
250b read per month.
10b search per month.
Fast on (1) Posting a tweet, (2) Fanning out tweets to followers (3) Viewing timeline (4) search
write tweets = write heavy
timeline = Read heavy
search = Read heavy2.2. Calculate usage
Size per tweet ~ 10KB
8B tweet_id
32B user_id
140B text
10KB media
new tweet per month = 15b * 10KB =150TB
new tweet per 3yrs = 150TB * 3 * 12 = 5.4PB
read requests per second = 250b / 2.5M = 100k
write new tweet per second = 6k
fanout new tweet per second = 60k
search requests per second = 4k
2.3. Design


2.4 Core
2.4.1 User posts a tweet & fanout
my tweets & my timeline in RDBMS
Fanout tweet & home timeline to others
60K/s fanout -> overload RDBMS -> NoSQL/Cache for fast write
store media = Object StoreClient POST a tweet -> Web server -> Write API
(1) save my tweet & timeline in SQL DB
(2) -> FANOUT service
    query USER GRAPH from Cache -> get followers
    save tweet to home timeline of followers in Cache; [O(n) 10 followers = 10 inserts] (tweet_id, user_id)
(3) -> SEARCH service 
    save tweet enable fast searching
(4) -> Object store
    save media
(5) -> NOTIFICATION service 
    sending Queue asynchronously2.4.2 View home timeline
Client POST home timeline -> Web server -> Read API -> TIMELINE service
(1) Get timeline (tweet_id, user_id)^x from Cache; O(1)
(2) query Tweet Info Service by MGET; O(n)
(3) query User Info Service by MGET; O(n)2.4.3 View user timeline
Client POST user timeline -> Web server -> Read API 
(1) user timeline from SQL Database2.4.4 searches
Client search -> Web server -> Search API -> SEARCH service
(1) Parses/tokenizes input query (rm markup; normalize capitalization; to boolean)
(2) Queries Cluster  (Merges, ranks, sorts, and returns the results)2.5 Scale
2.5.1 Fanout service bottleneck
- A user has million followers -> tweet propagates minutes -> race @reply -> mitigate by re-ordering tweets at serve time. 
- Keep only 100s tweets for each home timeline in Cache 
- Keep only active users' home timeline in Cache 
- Store only a month of tweets in the Tweet Info Service 
- Store only active users in the User Info Service 
2.5.2 SQL DB bottleneck
- Although Cache reduce DB load -> unlikely SQL Read Replicas would be enough to handle cache misses-> SQL scaling. 
- high volume writes -> overwhelm a single SQL Write Master-Slave -> SQL scaling. 
3. web crawler
3.1. Requirement & scope
// use cases
User: inputs search term & sees relevant pages with titles and snippets.
Service Crawls a list of urls
    -> Generates reverse index of words to pages containing the search terms
    -> Generates titles and snippets for pages (static)
    -> High availability
// Out of scope
Search analytics (crawl freq) (popular sites could be re-crawled in shorter intervals.)
Personalized search results
Page rank
// Assumptions
Not even traffic
Some searches are popular, some not.
Anonymous users
FAST: (1)Generating search results
Crawler not get stuck in loop for a cycle
1b links to crawl (crawled regularly to ensure freshness) (refresh rate of about once per week)
4b crawled each month
100b searches per month3.2. Calculate usage
Size per page ~ 500KB
2 PB of stored page per month = 500KB * 4b
72 PB of stored page in 3 years
1,600 write requests per second
40,000 search requests per second
3.3. Design


3.4 Core
3.4.1 Service crawls a list of urls
links_to_crawl(url) + ranked(popularity) -> crawled_links (url, signature)
(1) save links_to_crawl & crawled_links in NoSQL
(2) save ranked identifiers of links_to_crawl in RedisReverse Index Service -> generate a reverse index
Document Service -> generate a static title and snippet
// what is reverse index
T0 = "it is what it is"
T1 = "what is it"
T2 = "it is a banana"
反向文件索引:
 "banana": {2}
 "it":     {0, 1, 2}
 "what":   {0, 1}class Crawler:
    def crawl(self):
        while True:
            page = extract_max_priority_page // take the highest priority link
            if page is None:
                break
            if self.memory.crawled_similar(page.signature): // check crawled_links, if similar to previous page, than reduce priority to avoid cycle.
                self.memory.reduce_priority_link_to_crawl(page.url)
            else: // no similar crawled pages, then crawl
                self.__crawl_page(page)
    def __crawl_page(self, page):
        // fetch page
        // add to ReverseIndexQueue & DocumentQueue
        // insert child_urls to links_to_crawl
        // calculate signature
        // remove from links_to_crawl
        // insert (url, signature) to crawled_linkshandle duplicate
// url duplication
MapReduce -> output only entries with frequency = 1
// page duplicate
IOU (Intersection over Union)
word vector & calculate cosine similarity
simhashre-crawl
1. timestamp
2. one week for regular pages, shorter time for frequent-updated pages
3. page update frequency by HEAD request to check Last-Modified
4. data mining -> mean time for page to update3.4.2 User inputs a search term and sees a list of relevant pages with titles and snippets
Client -> Web Server -> Query API server
(1) Parses query / Removes markup / Breaks up the text into terms / Fixes typos / Normalizes capitalization
(2) -> Reverse Index Service 
    find documents matching the query / rank & return top ones
(3) -> Document Service
    return titles and snippets$ curl https://search.com/api/v1/search?query=hello+world
// Response:
{
    "title": "foo's title",
    "snippet": "foo's snippet",
    "link": "https://foo.com",
}...3.4.3 Crawl service detail

URL Frontier = priority queue + politeness router
[1] priority queue = sort on popularity / last modified, etc
[2] politeness router = each queue will have only one domain, and a machine controls fetching rate.
// put all urls in https://www.wikipedia.org/ to one queue (one machine)
1. save time rebuilding TCP connections
2. control request speed to avoid break rate limitDetection = Update + duplicate
[1] update = curl HEAD request to get header without body (fast).
[2] duplicate = use simhash to detect whether duplicated or copied partial content.
Storage
Bigtable
S3, MIN.io
HDFS
Last updated
Was this helpful?