Complat software training 101
  • Introduction
  • Day 1
  • Day 2
  • TODO
  • Linear regression
  • Tmux
  • quick link
  • CLI more - 1
  • Vim more - 1
  • MQ
  • iv - 1
  • iv - 2
  • iv - 3
  • clear Arch
  • lv - array
  • INTERVIEW - JS
  • RDKit - read/write
  • RDKit - process
  • RDKit - transform
  • RDKit - rxn
  • SYSTEM DESIGN - Question
  • SYSTEM DESIGN - EX1
  • SYSTEM DESIGN - EX2
  • SYSTEM DESIGN - EX3
  • SYSTEM DESIGN - EX99
Powered by GitBook
On this page
  • 1. Pastebin
  • 1.1. Requirement & scope
  • 1.2. Calculation
  • 1.3. Design
  • 1.4. Core
  • 1.5. Scale
  • 2. Twitter timeline and search
  • 2.1. Requirement & scope
  • 2.2. Calculate usage
  • 2.3. Design
  • 2.4 Core
  • 2.5 Scale
  • 3. web crawler
  • 3.1. Requirement & scope
  • 3.2. Calculate usage
  • 3.3. Design
  • 3.4 Core

Was this helpful?

SYSTEM DESIGN - EX1

1. Pastebin

1.1. Requirement & scope

// use case
User: enter text & content -> gets short url
User: enters short url -> view text & content
User: anonymous
Service: page analytics
Service: deletes expired text
Service: high availability

// assumption
Not even traffic
No realtime analytics
10M paste writes per month
100M paste reads per month
10:1 read to write ratio

1.2. Calculation

Size per paste ~ 1.27KB

1KB content

7B short_link

4B expiration_in_hr

5B created_at

255B path

new paste per month = 1.27KB * 10M = 12.7GB

new paste per 3yrs = 1.27KB * 10M * 36 = 450GB

new short_links per 3yrs = 10M * 36 = 360M

new paste per second = 10M / 2.5M = 4

read paste per second = 100M / 2.5M = 40

1.3. Design

1.4. Core

1.4.1 Overview

RDBMS = [1] as hash table, [2] mapping short_link to file, [3] Could be replaced by NoSQL key-value store.

File server / Object store = [1] Amazon S3, [2] NoSQL document store

web server = reverse proxy

write api = (1) generate unique URL, (2) save to DB, (3) save to object store, (4) return url

shortlink char(7) NOT NULL
expiration_length_in_minutes int NOT NULL
created_at datetime NOT NULL
paste_path varchar(255) NOT NULL
PRIMARY KEY(shortlink)

index @ shortlink + created_at

1.4.2 Encoding

MD5 = [1] hash ip+timestamp, [2] uniformly distributed

Base62 = [1] encode the MD5 hash to [a-zA-Z0-9], [2] deterministic, one-to-one, [3] O(k), k=7

Base64 = including +, -, which requires escape in URL

62^7 >> 360M = 36 * 10^7, enough to handle

short_link = Base62(MD5(ip+timestamp))

1.4.3 Request

$ curl --header "Content-Type: application/json" \
  --request POST \
  --data '{ "expiration_length_in_minutes": "60", "paste_contents": "Hello World!" }' \
  https://pastebin.com/api/v1/paste

1.4.4 Analytics

no real time = MapReduce logs -> hit counts

// map from log files
    (2016-01, url0), 1
    (2016-01, url0), 1
    (2016-01, url1), 1

// reduce
    (2016-01, url0), 2
    (2016-01, url1), 1

1.4.4 Expire delete

scan the SQL Database = timestamp > expire

1.5. Scale

1) Benchmark, 2) bottlenecks 3) trade-offs

Analytics Database = Amazon Redshift or Google BigQuery.

READ = [1] traffic for popular content in Cache, [2] uneven traffic in cache, [3] cache miss -> Read SQL

WRITE scaling = Federation, Sharding, Denormalization, SQL Tuning

2. Twitter timeline and search

2.1. Requirement & scope

// Use cases
User posts tweet -> Service fanout to followers -> sending notifications & emails
User: view my timeline
User: view home timeline
User: searches keywords
Service: high availability

// Out of scope
Visibility settings
Analytics

// assumption
Not even traffic
100M users
500M tweets per day; 15b tweets per month.
5b fanout per day; 150b fanout per month.
250b read per month.
10b search per month.

Fast on (1) Posting a tweet, (2) Fanning out tweets to followers (3) Viewing timeline (4) search
write tweets = write heavy
timeline = Read heavy
search = Read heavy

2.2. Calculate usage

Size per tweet ~ 10KB

8B tweet_id

32B user_id

140B text

10KB media

new tweet per month = 15b * 10KB =150TB

new tweet per 3yrs = 150TB * 3 * 12 = 5.4PB

read requests per second = 250b / 2.5M = 100k

write new tweet per second = 6k

fanout new tweet per second = 60k

search requests per second = 4k

2.3. Design

2.4 Core

2.4.1 User posts a tweet & fanout

my tweets & my timeline in RDBMS

Fanout tweet & home timeline to others
60K/s fanout -> overload RDBMS -> NoSQL/Cache for fast write

store media = Object Store
Client POST a tweet -> Web server -> Write API
(1) save my tweet & timeline in SQL DB
(2) -> FANOUT service
    query USER GRAPH from Cache -> get followers
    save tweet to home timeline of followers in Cache; [O(n) 10 followers = 10 inserts] (tweet_id, user_id)
(3) -> SEARCH service 
    save tweet enable fast searching
(4) -> Object store
    save media
(5) -> NOTIFICATION service 
    sending Queue asynchronously

2.4.2 View home timeline

Client POST home timeline -> Web server -> Read API -> TIMELINE service
(1) Get timeline (tweet_id, user_id)^x from Cache; O(1)
(2) query Tweet Info Service by MGET; O(n)
(3) query User Info Service by MGET; O(n)

2.4.3 View user timeline

Client POST user timeline -> Web server -> Read API 
(1) user timeline from SQL Database

2.4.4 searches

Client search -> Web server -> Search API -> SEARCH service
(1) Parses/tokenizes input query (rm markup; normalize capitalization; to boolean)
(2) Queries Cluster  (Merges, ranks, sorts, and returns the results)

2.5 Scale

2.5.1 Fanout service bottleneck

  1. A user has million followers -> tweet propagates minutes -> race @reply -> mitigate by re-ordering tweets at serve time.

  2. Keep only 100s tweets for each home timeline in Cache

  3. Keep only active users' home timeline in Cache

  4. Store only a month of tweets in the Tweet Info Service

  5. Store only active users in the User Info Service

2.5.2 SQL DB bottleneck

  1. Although Cache reduce DB load -> unlikely SQL Read Replicas would be enough to handle cache misses-> SQL scaling.

  2. high volume writes -> overwhelm a single SQL Write Master-Slave -> SQL scaling.

3. web crawler

3.1. Requirement & scope

// use cases
User: inputs search term & sees relevant pages with titles and snippets.

Service Crawls a list of urls
    -> Generates reverse index of words to pages containing the search terms
    -> Generates titles and snippets for pages (static)
    -> High availability

// Out of scope
Search analytics (crawl freq) (popular sites could be re-crawled in shorter intervals.)
Personalized search results
Page rank

// Assumptions
Not even traffic
Some searches are popular, some not.
Anonymous users
FAST: (1)Generating search results
Crawler not get stuck in loop for a cycle
1b links to crawl (crawled regularly to ensure freshness) (refresh rate of about once per week)
4b crawled each month
100b searches per month

3.2. Calculate usage

Size per page ~ 500KB

2 PB of stored page per month = 500KB * 4b

72 PB of stored page in 3 years

1,600 write requests per second

40,000 search requests per second

3.3. Design

3.4 Core

3.4.1 Service crawls a list of urls

links_to_crawl(url) + ranked(popularity) -> crawled_links (url, signature)
(1) save links_to_crawl & crawled_links in NoSQL
(2) save ranked identifiers of links_to_crawl in Redis
Reverse Index Service -> generate a reverse index
Document Service -> generate a static title and snippet

// what is reverse index
T0 = "it is what it is"
T1 = "what is it"
T2 = "it is a banana"
反向文件索引:
 "banana": {2}
 "it":     {0, 1, 2}
 "what":   {0, 1}
class Crawler:
    def crawl(self):
        while True:
            page = extract_max_priority_page // take the highest priority link
            if page is None:
                break
            if self.memory.crawled_similar(page.signature): // check crawled_links, if similar to previous page, than reduce priority to avoid cycle.
                self.memory.reduce_priority_link_to_crawl(page.url)
            else: // no similar crawled pages, then crawl
                self.__crawl_page(page)

    def __crawl_page(self, page):
        // fetch page
        // add to ReverseIndexQueue & DocumentQueue
        // insert child_urls to links_to_crawl
        // calculate signature
        // remove from links_to_crawl
        // insert (url, signature) to crawled_links

handle duplicate

// url duplication
MapReduce -> output only entries with frequency = 1

// page duplicate
IOU (Intersection over Union)
word vector & calculate cosine similarity
simhash

re-crawl

1. timestamp
2. one week for regular pages, shorter time for frequent-updated pages
3. page update frequency by HEAD request to check Last-Modified
4. data mining -> mean time for page to update

3.4.2 User inputs a search term and sees a list of relevant pages with titles and snippets

Client -> Web Server -> Query API server
(1) Parses query / Removes markup / Breaks up the text into terms / Fixes typos / Normalizes capitalization
(2) -> Reverse Index Service 
    find documents matching the query / rank & return top ones
(3) -> Document Service
    return titles and snippets
$ curl https://search.com/api/v1/search?query=hello+world
// Response:
{
    "title": "foo's title",
    "snippet": "foo's snippet",
    "link": "https://foo.com",
}...

URL Frontier = priority queue + politeness router

[1] priority queue = sort on popularity / last modified, etc

[2] politeness router = each queue will have only one domain, and a machine controls fetching rate.

// put all urls in https://www.wikipedia.org/ to one queue (one machine)
1. save time rebuilding TCP connections
2. control request speed to avoid break rate limit

Detection = Update + duplicate

[1] update = curl HEAD request to get header without body (fast).

[2] duplicate = use simhash to detect whether duplicated or copied partial content.

Storage

Bigtable

S3, MIN.io

HDFS

PreviousSYSTEM DESIGN - QuestionNextSYSTEM DESIGN - EX2

Last updated 5 years ago

Was this helpful?

3.4.3 Crawl service detail