// use case
User: enter text & content -> gets short url
User: enters short url -> view text & content
User: anonymous
Service: page analytics
Service: deletes expired text
Service: high availability
// assumption
Not even traffic
No realtime analytics
10M paste writes per month
100M paste reads per month
10:1 read to write ratio
1.2. Calculation
Size per paste ~ 1.27KB
1KB content
7B short_link
4B expiration_in_hr
5B created_at
255B path
new paste per month = 1.27KB * 10M = 12.7GB
new paste per 3yrs = 1.27KB * 10M * 36 = 450GB
new short_links per 3yrs = 10M * 36 = 360M
new paste per second = 10M / 2.5M = 4
read paste per second = 100M / 2.5M = 40
1.3. Design
1.4. Core
1.4.1 Overview
RDBMS = [1] as hash table, [2] mapping short_link to file, [3] Could be replaced by NoSQL key-value store.
File server / Object store = [1] Amazon S3, [2] NoSQL document store
web server = reverse proxy
write api = (1) generate unique URL, (2) save to DB, (3) save to object store, (4) return url
shortlink char(7) NOT NULL
expiration_length_in_minutes int NOT NULL
created_at datetime NOT NULL
paste_path varchar(255) NOT NULL
PRIMARY KEY(shortlink)
index @ shortlink + created_at
// Use cases
User posts tweet -> Service fanout to followers -> sending notifications & emails
User: view my timeline
User: view home timeline
User: searches keywords
Service: high availability
// Out of scope
Visibility settings
Analytics
// assumption
Not even traffic
100M users
500M tweets per day; 15b tweets per month.
5b fanout per day; 150b fanout per month.
250b read per month.
10b search per month.
Fast on (1) Posting a tweet, (2) Fanning out tweets to followers (3) Viewing timeline (4) search
write tweets = write heavy
timeline = Read heavy
search = Read heavy
2.2. Calculate usage
Size per tweet ~ 10KB
8B tweet_id
32B user_id
140B text
10KB media
new tweet per month = 15b * 10KB =150TB
new tweet per 3yrs = 150TB * 3 * 12 = 5.4PB
read requests per second = 250b / 2.5M = 100k
write new tweet per second = 6k
fanout new tweet per second = 60k
search requests per second = 4k
2.3. Design
2.4 Core
2.4.1 User posts a tweet & fanout
my tweets & my timeline in RDBMS
Fanout tweet & home timeline to others
60K/s fanout -> overload RDBMS -> NoSQL/Cache for fast write
store media = Object Store
Client POST a tweet -> Web server -> Write API
(1) save my tweet & timeline in SQL DB
(2) -> FANOUT service
query USER GRAPH from Cache -> get followers
save tweet to home timeline of followers in Cache; [O(n) 10 followers = 10 inserts] (tweet_id, user_id)
(3) -> SEARCH service
save tweet enable fast searching
(4) -> Object store
save media
(5) -> NOTIFICATION service
sending Queue asynchronously
2.4.2 View home timeline
Client POST home timeline -> Web server -> Read API -> TIMELINE service
(1) Get timeline (tweet_id, user_id)^x from Cache; O(1)
(2) query Tweet Info Service by MGET; O(n)
(3) query User Info Service by MGET; O(n)
2.4.3 View user timeline
Client POST user timeline -> Web server -> Read API
(1) user timeline from SQL Database
2.4.4 searches
Client search -> Web server -> Search API -> SEARCH service
(1) Parses/tokenizes input query (rm markup; normalize capitalization; to boolean)
(2) Queries Cluster (Merges, ranks, sorts, and returns the results)
2.5 Scale
2.5.1 Fanout service bottleneck
A user has million followers -> tweet propagates minutes -> race @reply -> mitigate by re-ordering tweets at serve time.
Keep only 100s tweets for each home timeline in Cache
Keep only active users' home timeline in Cache
Store only a month of tweets in the Tweet Info Service
Store only active users in the User Info Service
2.5.2 SQL DB bottleneck
Although Cache reduce DB load -> unlikely SQL Read Replicas would be enough to handle cache misses-> SQL scaling.
high volume writes -> overwhelm a single SQL Write Master-Slave -> SQL scaling.
3. web crawler
3.1. Requirement & scope
// use cases
User: inputs search term & sees relevant pages with titles and snippets.
Service Crawls a list of urls
-> Generates reverse index of words to pages containing the search terms
-> Generates titles and snippets for pages (static)
-> High availability
// Out of scope
Search analytics (crawl freq) (popular sites could be re-crawled in shorter intervals.)
Personalized search results
Page rank
// Assumptions
Not even traffic
Some searches are popular, some not.
Anonymous users
FAST: (1)Generating search results
Crawler not get stuck in loop for a cycle
1b links to crawl (crawled regularly to ensure freshness) (refresh rate of about once per week)
4b crawled each month
100b searches per month
3.2. Calculate usage
Size per page ~ 500KB
2 PB of stored page per month = 500KB * 4b
72 PB of stored page in 3 years
1,600 write requests per second
40,000 search requests per second
3.3. Design
3.4 Core
3.4.1 Service crawls a list of urls
links_to_crawl(url) + ranked(popularity) -> crawled_links (url, signature)
(1) save links_to_crawl & crawled_links in NoSQL
(2) save ranked identifiers of links_to_crawl in Redis
Reverse Index Service -> generate a reverse index
Document Service -> generate a static title and snippet
// what is reverse index
T0 = "it is what it is"
T1 = "what is it"
T2 = "it is a banana"
反向文件索引:
"banana": {2}
"it": {0, 1, 2}
"what": {0, 1}
class Crawler:
def crawl(self):
while True:
page = extract_max_priority_page // take the highest priority link
if page is None:
break
if self.memory.crawled_similar(page.signature): // check crawled_links, if similar to previous page, than reduce priority to avoid cycle.
self.memory.reduce_priority_link_to_crawl(page.url)
else: // no similar crawled pages, then crawl
self.__crawl_page(page)
def __crawl_page(self, page):
// fetch page
// add to ReverseIndexQueue & DocumentQueue
// insert child_urls to links_to_crawl
// calculate signature
// remove from links_to_crawl
// insert (url, signature) to crawled_links
handle duplicate
// url duplication
MapReduce -> output only entries with frequency = 1
// page duplicate
IOU (Intersection over Union)
word vector & calculate cosine similarity
simhash
re-crawl
1. timestamp
2. one week for regular pages, shorter time for frequent-updated pages
3. page update frequency by HEAD request to check Last-Modified
4. data mining -> mean time for page to update
3.4.2 User inputs a search term and sees a list of relevant pages with titles and snippets
Client -> Web Server -> Query API server
(1) Parses query / Removes markup / Breaks up the text into terms / Fixes typos / Normalizes capitalization
(2) -> Reverse Index Service
find documents matching the query / rank & return top ones
(3) -> Document Service
return titles and snippets
[1] priority queue = sort on popularity / last modified, etc
[2] politeness router = each queue will have only one domain, and a machine controls fetching rate.
// put all urls in https://www.wikipedia.org/ to one queue (one machine)
1. save time rebuilding TCP connections
2. control request speed to avoid break rate limit
Detection = Update + duplicate
[1] update = curl HEAD request to get header without body (fast).
[2] duplicate = use simhash to detect whether duplicated or copied partial content.