SYSTEM DESIGN - Question
System Design
Harvard Youtube course
https://www.youtube.com/watch?v=-W9F__D3oY4
1. Trade off
Performance vs scalability
Latency vs throughput
Availability vs consistency
1.1. Performance vs scalability
scalable = added resources ~ increased performance.
performance problem = slow for a single user.
scalability problem = fast for a single user / slow under heavy load.
1.2. Latency vs throughput
Latency = time to perform some actions
Throughput = requests / time
Aim = maximal throughput with acceptable latency.
1.3. Availability vs consistency
Consistency = read receives [1] the lastet write or [2] an error
Availability = [1] request always receives a response, [2] no guarantee of consistency
Partition Tolerance = [1] The system continues to operate [2] despite partitioning due to network failures
CP = [1] atomic read & write [2] response from the partitioned node -> timeout error.
AP = [1] eventual consistency [2] response the most recent data, but not the latest
1.3.1 Consistency patterns
Weak consistency = [1] data may lose [2] video chat
Eventual consistency = [1] asynchronous data [2] email
Strong consistency = [1] synchronous data [2] RDBMSes
1.3.2 Availability patterns
1. server fail-over (master-slave & master-master)
2. database replication (master-slave & master-master)
2 CONS: [1] added hardware & complexity, [2] potential for loss of data because not in time
99.9% availability - three 9s (downtime per year = 8hr 45min)
99.99% availability - four 9s (downtime per year = 52min)
1.3.3 Availability in parallel vs in sequence
multiple components prone to failure
# in sequence
99.8% = 99.9% * 99.9%
# in parallel
99.9999% = 1 - [(1-99.9%) * (1-99.9%)]
2. DNS
name space to ip = www.google.com to 140.123.456.789
hierarchical & caching
EX = CloudFlare and Route 53
CONS = [1] delay (mitigated by caching), [2] complex (managed by ISP or government), [3] DDoS attack
3. CDN
globally distributed proxys, serving content closer to user
static files = HTML/CSS/JS, photos
improve performance = [1] Users receive content at closer data centers, [2] It is other people's server
Push CDNs = Owner uploads content and rewrite CDN URLs. (small traffic & infrequent upload)
Pull CDNs = Provider grabs new content from your server
CONS: [1] cost, [2] content stale
4. Load Balancer
[1] receive request from client [2] pick a worker to forward request [3] waiting for response [4] forward response to client
distribute incoming client requests to computing resources
EX = HAProxy
PROS
1. Horizontal scaling
2. Avoid unhealthy servers
3. Avoid overloading resources
4. Avoid SPOF
5. SSL termination
6. Session persistence
Distribute methods:
1. Random
2. Round Robin
3. Least loaded
4. Session/cookies/request params
5. Layer 4 = ip (faster)
6. Layer 7 = content (direct video traffic to video servers)
CONS horizontal scaling:
1. complexity (stateless servers, centralized sessions servers)
2. downstream databases suffer more simultaneous connections.
CONS load balancer:
1. performance bottleneck
2. complexity
3. SPOF
5. Reverse proxy
a web server = [1] centralizes internal services [2] unified public interfaces
PROS:
1. Increased security (Hide backend information & IPs)
2. Increased scalability (Clients only see the reverse proxy's IP)
3. SSL termination
4. Caching
5. Serve Static content
CONS:
1. complexity
2. SPOF
5.1 Load balancer vs reverse proxy
load balancer = multiple servers
Reverse proxies = single server
6. Application layer
Microservices = independently deployable, small services
Service Discovery = [1] Consul, Zookeeper [2] find services by keeping track of registered names
CONS: complexity
7. Database
7.1 RDBMS
ACID RDBMS transactions.
1. Atomicity = transaction is all or nothing
2. Consistency = transaction brings DB from one state to another
3. Isolation = exec transactions concurrently or serially is the same
4. Durability = Once committed, it will remain so
Scale RDBMS
1. master-slave (replicate writes to more read slaves) (CONS: [1] more logics [2] CONS Replication)
2. master-master (CONS: [1] more logics, [2] CONS Replication, [3] LBs, [4] Sync & conflict resolution )
3. federation (split by functions) (CONS: complex join & logic)
4. sharding (split by word order) (CONS: complex join & logic)
5. denormalization (Redundant copies in multi-tables to avoid expensive joins) (CONS: [1] sync duplication, [2] add write loading)
6. SQL tuning (benchmark and profile)
PROS of federation & sharding = [1] less read and write traffic [2] more cache hits [3] reduce index size
CONS replication
1. loss data before ready
2. more slaves more replication lag
3. replaying writing stuck in slaves
4. complexity
7.2 NoSQL
BASE
1. Basically available = guarantees availability.
2. Soft state = DB state may change over time, even without input.
3. Eventual consistency = Async
Types
1. Key-value store = O(1) by word order
2. Document store = JSON, MongoDB (collections of key-value)
3. Wide column store = ColumnFamily (Bigtable, HBase, Cassandra)
4. Graph database = complex relationships, such as a social network
Reasons for SQL:
1. Structured data
2. Strict schema
3. complex joins
4. Transactions
Reasons for NoSQL:
1. Semi-structured data
2. flexible schema
3. big data
4. Very data intensive / high throughput workload
Well-suited for NoSQL:
1. log data
2. Temporary data, such as a shopping cart
3. Frequently accessed ('hot') tables
8. Cache
[1] lookup cache [2] found & return [3] goto DB, save in cache, and return
CONS cache: [1] consistency between cache & DB [2] expiration [3] add logics
cache update strategy
Cache-aside = cache<-clien->DB
STEP: [1] look cache [2] miss & read DB [3] save to cache and return
CONS: [1] three trips in cache miss [2] stale data [3] replace new node without memory
Write-through = client->cache->db
STEP: [1] read & write to cache [2] sync to DB [3] return
CONS: [1] slow write fast read [2] replace new node without memory
Write-behind = client->cache~~>db
STEP: [1] read & write to cache [2] async to DB & return
CONS: [1] loss data before ready [2] complex
Refresh-ahead = automatically refresh
CONS: [1] Difficult to predict which to refresh
8.1 MongoDB vs Redis vs Memcached
http://stackoverflow.com/questions/7888880/what-is-redis-and-what-do-i-use-it-for
When to use Redis: If you can map a case to Redis and discover you aren't at risk of running out of RAM.
MongoDB =
key-value
+disk-based store
.Redis =
key-value
+built-in persistence
.Memcached = Redis -
built-in persistence
.
Persistence to disk = you can use Redis as a real DB instead of a volatile cache. The data won't disappear when you restart as with Memcached.
9. Asynchronism
Reduce request times for expensive operations
The user is not blocked, and the job is processed in the background.
Message queues = RabbitMQ
Task queues = Celery
Back pressure: limiting the queue size
10. Communication
10.1 HTTP
encoding and transporting data between a client and a server
request/response protocol
verb + resource
Level 7 on top of TCP/IP
10.2 REST & RPC
RPC
REST
request-response
request-response
API
private
public
syntax
customized
general
implementation
tight coupling
minimizes coupling
stateless and cacheable
PROS
Better performance, customized syntx
general purpose, less couple
CONS
difficult to debug, tightly coupled
few verbs only, difficult to name for nested hierarchies
- - - - - - - REST - - - - - - -
GET /someresources/anId
PUT /someresources/anId
{
"anotherdata": "another value"
}
- - - - - - - RPC - - - - - - -
GET /someoperation?data=anId
POST /anotheroperation
{
"data":"anId";
"anotherdata": "another value"
}
10.3 TCP & UDP
TCP/IP
UDP
connection-oriented
connectionless
order
receive in order
order not guarantee
IP, hankshake = ack + checksum
broadcast to subnet (in DHCP since no IP yet)
no correct response
[1] resend the packets [2] timeouts
X
data loss vs latency
intact data, less time critical
data may lose, real-time
example
web servers, database, and SSH.
video chat
11. Security
1. XSS and SQL injection = Sanitize all user inputs
2. SQL injection = parameterized queries
3. least privilege.
Appendix
A1. power of 2
Power Exact Value Approx Value Bytes
---------------------------------------------------------------
8 256
10 1024 1 thousand 1 KB
16 65,536 64 KB
20 1,048,576 1 million 1 MB
30 1,073,741,824 1 billion 1 GB
40 1,099,511,627,776 1 trillion 1 TB
A2. metrics
Read sequentially from disk at 40 MB/s (100)
Read sequentially from 1 Gbps Ethernet at 100 MB/s (40)
Read sequentially from SSD at 1 GB/s (4)
Read sequentially from main memory at 4 GB/s (1)
Reading 1 MB sequentially from memory takes about 250 us
SSD 4x
Disk 80x
6-7 world-wide round trips per second
2,000 round trips per second within a data center
L1 cache reference 0.5 ns
L2 cache reference 7 ns 14x L1 cache
Compress 1K bytes with Zippy 10,000 ns 10 us
Round trip within same datacenter 500,000 ns 500 us
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
A3. per seconds
2.5M seconds per month
400 requests per second = 1 billion requests per month
A4. varchar & char
char 格式, 資料不足長度者後面補空白,儲存空間固定
varchar 格式資料實際之內容,儲存空間不固定
單就儲存空間上來看
char(5)
寫入 'ab' 資料庫為 'ab ' 5 byptes # 會多存3個空白
寫入 'abcd' 資料庫為 'abcd ' 5 byptes # 會多存1個空白
寫入 'abcde' 資料庫為 'abcde' 5 byptes
寫入 'abcdef' 資料庫為 'abcde' 5 byptes
varchar(5)
寫入 'ab' 資料庫為 'ab' 3 byptes # 多 1 byte 是存長度
寫入 'abcd' 資料庫為 'abcd' 5 byptes
寫入 'abcde' 資料庫為 'abcde' 6 byptes
寫入 'abcdef' 資料庫為 'abcde' 6 byptes
Last updated
Was this helpful?