Complat software training 101
  • Introduction
  • Day 1
  • Day 2
  • TODO
  • Linear regression
  • Tmux
  • quick link
  • CLI more - 1
  • Vim more - 1
  • MQ
  • iv - 1
  • iv - 2
  • iv - 3
  • clear Arch
  • lv - array
  • INTERVIEW - JS
  • RDKit - read/write
  • RDKit - process
  • RDKit - transform
  • RDKit - rxn
  • SYSTEM DESIGN - Question
  • SYSTEM DESIGN - EX1
  • SYSTEM DESIGN - EX2
  • SYSTEM DESIGN - EX3
  • SYSTEM DESIGN - EX99
Powered by GitBook
On this page
  • 1. Amazon's sales rank by category feature
  • 1.1. Requirement & scope
  • 1.2. Calculation
  • 1.3. Design
  • 1.4. Core
  • 1.5. Scale
  • 2. Scales to millions of users on AWS
  • 2.1. Requirement & scope
  • 2.2. Calculation
  • 2.3. Design
  • 2.4. Core

Was this helpful?

SYSTEM DESIGN - EX3

1. Amazon's sales rank by category feature

1.1. Requirement & scope

// Use cases
User: views lwmppbc
Service: calculates lwmppbc (last week's most popular products by category)
Service: high availability

// Out of scope
other e-commerce components
// Assumptions
Not even traffic
Items can be in multiple categories
Items cannot change categories

no subcategories
updated hourly / frequently update for more popular products

1.2. Calculation

10M products
1K categories
1b transactions / month
100b read / month
100:1 read to write ratio

40GB transaction / month = 1b * 40B
1.44TB transaction / 3yrs
400 transactions / second
40,000 read / second

5B created_at

8B product_id

4B categry_id

8B seller_id

8B buyer_id

4B quantity

5B total_price

total size ~ 40B

1.3. Design

1.4. Core

1.4.1 Use case: Service calculates lwmppbc

save Sales API server log files on S3

// sample log
timestamp   product_id  category_id    qty     total_price   seller_id    buyer_id
t1          product1    category1      2       20.00         1            1
t2          product1    category2      2       20.00         2            2
t2          product1    category2      1       10.00         2            3
Sales Rank Service -> MapReduce @ sale log files -> aggregate sale_rank table to SQL DB
    -> MapReduce step 1 (category, product_id), sum(quantity)
    -> MapReduce step 2 distributed sort

// sorted list
(category1, 1), product4
(category1, 2), product1
(category1, 3), product2
(category2, 3), product1
(category2, 7), product3
// sale_rank table
id int NOT NULL AUTO_INCREMENT
category_id int NOT NULL
product_id int NOT NULL
total_sold int NOT NULL

PRIMARY KEY(id)
FOREIGN KEY(category_id) REFERENCES Categories(id)
FOREIGN KEY(product_id) REFERENCES Products(id)
// index = O(N) -> O(logN)

1.4.2 Use case: User views lwmppbc

client -> Web Server -> Read API server -> reads from the SQL Database sales_rank table

1.5. Scale

Analytics DB = Amazon Redshift or Google BigQuery
    -> limited time period in DB -> long term storage in S3

40K read / second
    -> popular content by cache
    -> SQL Read Replicas + DB for cache miss
    -> SQL scaling

400 write / scond
    -> SQL scaling

2. Scales to millions of users on AWS

2.1. Requirement & scope

// Use cases
User: read or write request
Service: processing, stores user data, then returns results
Service: scale & high availability
// Assumptions
Not even traffic
Users+ -> Users++ -> Users+++

2.2. Calculation

10M users
1b writes / month
100b reads / month
100:1 read to write ratio
1KB content per write

1 TB content / month = 1KB * 1b write
36 TB content / 3 years
400 writes / second
40,000 reads / second

2.3. Design

2.4. Core

2.4.1 User makes a read or write request

(1) Step 1

// Simple starting point -> Vertical scaling when needed -> Monitor bottleneck
Web server on EC2 + MySQL DB (need for relational)
Monitor = CPU, memory, IO, network, etc // CloudWatch, top, nagios, statsd, graphite, etc

// Assign a public static IP
Elastic IPs => a public endpoint whose IP doesn't change on reboot
Failover => point domain to a new IP

// Use a DNS
Route 53 => map domain to the instance's public IP.

// Secure the web server
Open only: HTTP:80 / HTTPS:443 / SSH:22 + whitelisted ips
Prevent the web server from initiating outbound connections

(2) Step 2

MySQL DB takes more and more memory and CPU resources; no disk space.

// spread // disadvantage (cost, complexity, security)
1. use S3 for static files (Server side encryption) (JS/CSS/images/videos/user files)
2. move MySQL to other machine (Amazon Relational Database Service (RDS)) (Multiple availability zones)

// Security
Virtual Private Cloud
    -> public subnet for single Web Server -> send & receive from internet
    -> private subnet for everything else -> preventing outside access
    -> Only open ports from whitelisted IPs for each component

(3) Step 3

Slow web server

// Horizontal Scaling = increasing loads + single points of failure
Load Balancer = Amazon's ELB or HAProxy
1. Terminate SSL = (1) reduce backend server load (2) simple certificate admin
2. Multiple Web Servers across multiple zones
3. Multiple MySQL instances in Master-Slave Failover mode across multiple availability zones to improve redundancy

// Separate Web Servers from App Servers(Read / Write API)

// CDN: static (and some dynamic) content (CloudFront)

(4) Step 4

read-heavy (100:1 with writes) & poor DB performance

// spread read load
1. Cache = (1) configure MySQL cache if sufficient (2) Move frequent accessed to Cache (Elasticache)
2. MySQL read replica (Load Balancers in front of MySQL Read Replicas)

(5) Step 5

heavy load in office time => autoscaling

// autoscaling DevOps: Chef, Puppet, Ansible
// monitoring
Host level = Review a single EC2 instance
Aggregate level = Review load balancer stats
Log analysis = CloudWatch, CloudTrail
External site performance = Pingdom or New Relic
Handle notifications and incidents = PagerDuty
Error Reporting = Sentry

//  AWS Autoscaling in multiple zones
one group for each Web Server 
one group for each Application Server type
Set a min and max number of instances
Trigger to scale up and down through CloudWatch = CPU load / Latency / Network traffic

(6) Step 6

continue

// MySQL
1. storing a limited time period of data in the database
2. storing the rest in a data warehouse such as Redshift(easy 1TB/month) (async)
3. Cache = 40K read/sec + uneven traffic
4. SQL scaling = 400 write / sec

// high read/write  to NoSQL
// computation in async
// example = photo service
Client upload -> App Server -> Queue (SQS) -> Worker Service on EC2
    -> Creates thumbnail / Updates DB / Stores in S3
PreviousSYSTEM DESIGN - EX2NextSYSTEM DESIGN - EX99

Last updated 5 years ago

Was this helpful?

Autoscaling groups not shown to reduce clutter