SYSTEM DESIGN - Question

System Design

Harvard Youtube course

https://www.youtube.com/watch?v=-W9F__D3oY4

1. Trade off

Performance vs scalability
Latency vs throughput
Availability vs consistency

1.1. Performance vs scalability

scalable = added resources ~ increased performance.

performance problem = slow for a single user.
scalability problem = fast for a single user / slow under heavy load.

1.2. Latency vs throughput

Latency = time to perform some actions
Throughput = requests / time

Aim = maximal throughput with acceptable latency.

1.3. Availability vs consistency

Consistency = read receives [1] the lastet write or [2] an error
Availability = [1] request always receives a response, [2] no guarantee of consistency
Partition Tolerance = [1] The system continues to operate [2] despite partitioning due to network failures

CP = [1] atomic read & write [2] response from the partitioned node -> timeout error.
AP = [1] eventual consistency [2] response the most recent data, but not the latest

1.3.1 Consistency patterns

Weak consistency = [1] data may lose [2] video chat
Eventual consistency = [1] asynchronous data [2] email
Strong consistency = [1] synchronous data [2] RDBMSes

1.3.2 Availability patterns

1. server fail-over (master-slave & master-master)
2. database replication (master-slave & master-master)

2 CONS: [1] added hardware & complexity, [2] potential for loss of data because not in time

99.9% availability - three 9s (downtime per year = 8hr 45min)
99.99% availability - four 9s (downtime per year = 52min)

1.3.3 Availability in parallel vs in sequence

multiple components prone to failure

# in sequence
99.8% = 99.9% * 99.9%

# in parallel
99.9999% = 1 - [(1-99.9%) * (1-99.9%)]

2. DNS

name space to ip = www.google.com to 140.123.456.789

hierarchical & caching

EX = CloudFlare and Route 53

CONS = [1] delay (mitigated by caching), [2] complex (managed by ISP or government), [3] DDoS attack

3. CDN

globally distributed proxys, serving content closer to user

static files = HTML/CSS/JS, photos
improve performance = [1] Users receive content at closer data centers, [2] It is other people's server

Push CDNs = Owner uploads content and rewrite CDN URLs. (small traffic & infrequent upload)
Pull CDNs = Provider grabs new content from your server

CONS: [1] cost, [2] content stale

4. Load Balancer

[1] receive request from client [2] pick a worker to forward request [3] waiting for response [4] forward response to client

distribute incoming client requests to computing resources

EX = HAProxy

PROS
1. Horizontal scaling
2. Avoid unhealthy servers
3. Avoid overloading resources
4. Avoid SPOF
5. SSL termination
6. Session persistence

Distribute methods:
1. Random
2. Round Robin
3. Least loaded
4. Session/cookies/request params
5. Layer 4 = ip (faster)
6. Layer 7 = content (direct video traffic to video servers)

CONS horizontal scaling:
1. complexity (stateless servers, centralized sessions servers)
2. downstream databases suffer more simultaneous connections.

CONS load balancer: 
1. performance bottleneck
2. complexity
3. SPOF

5. Reverse proxy

a web server = [1] centralizes internal services [2] unified public interfaces

PROS:
1. Increased security (Hide backend information & IPs)
2. Increased scalability (Clients only see the reverse proxy's IP)
3. SSL termination
4. Caching
5. Serve Static content

CONS:
1. complexity
2. SPOF

5.1 Load balancer vs reverse proxy

load balancer = multiple servers

Reverse proxies = single server

6. Application layer

Microservices = independently deployable, small services

Service Discovery = [1] Consul, Zookeeper [2] find services by keeping track of registered names

CONS: complexity

7. Database

7.1 RDBMS

ACID RDBMS transactions.
1. Atomicity = transaction is all or nothing
2. Consistency = transaction brings DB from one state to another
3. Isolation = exec transactions concurrently or serially is the same
4. Durability = Once committed, it will remain so

Scale RDBMS
1. master-slave (replicate writes to more read slaves) (CONS: [1] more logics [2] CONS Replication)
2. master-master (CONS: [1] more logics, [2] CONS Replication, [3] LBs, [4] Sync & conflict resolution )
3. federation (split by functions) (CONS: complex join & logic)
4. sharding (split by word order) (CONS: complex join & logic)
5. denormalization (Redundant copies in multi-tables to avoid expensive joins) (CONS: [1] sync duplication, [2] add write loading)
6. SQL tuning (benchmark and profile)

PROS of federation & sharding = [1] less read and write traffic [2] more cache hits [3] reduce index size

CONS replication
1. loss data before ready
2. more slaves more replication lag
3. replaying writing stuck in slaves
4. complexity

7.2 NoSQL

BASE
1. Basically available = guarantees availability.
2. Soft state = DB state may change over time, even without input.
3. Eventual consistency = Async

Types
1. Key-value store = O(1) by word order
2. Document store = JSON, MongoDB (collections of key-value)
3. Wide column store = ColumnFamily (Bigtable, HBase, Cassandra)
4. Graph database = complex relationships, such as a social network

Reasons for SQL:
1. Structured data
2. Strict schema
3. complex joins
4. Transactions


Reasons for NoSQL:
1. Semi-structured data
2. flexible schema
3. big data
4. Very data intensive / high throughput workload

Well-suited for NoSQL:
1. log data
2. Temporary data, such as a shopping cart
3. Frequently accessed ('hot') tables

8. Cache

[1] lookup cache [2] found & return [3] goto DB, save in cache, and return

CONS cache: [1] consistency between cache & DB [2] expiration [3] add logics

cache update strategy

Cache-aside = cache<-clien->DB
STEP: [1] look cache [2] miss & read DB [3] save to cache and return
CONS: [1] three trips in cache miss [2] stale data [3] replace new node without memory

Write-through = client->cache->db
STEP: [1] read & write to cache [2] sync to DB [3] return
CONS: [1] slow write fast read [2] replace new node without memory

Write-behind = client->cache~~>db
STEP: [1] read & write to cache [2] async to DB & return
CONS: [1] loss data before ready [2] complex 

Refresh-ahead = automatically refresh
CONS: [1] Difficult to predict which to refresh

8.1 MongoDB vs Redis vs Memcached

http://stackoverflow.com/questions/7888880/what-is-redis-and-what-do-i-use-it-for

When to use Redis: If you can map a case to Redis and discover you aren't at risk of running out of RAM.

MongoDB = key-value + disk-based store.
Redis = key-value + built-in persistence.
Memcached = Redis - built-in persistence.

Persistence to disk = you can use Redis as a real DB instead of a volatile cache. The data won't disappear when you restart as with Memcached.

9. Asynchronism

Reduce request times for expensive operations
The user is not blocked, and the job is processed in the background.

Message queues = RabbitMQ
Task queues = Celery

Back pressure: limiting the queue size

10. Communication

10.1 HTTP

encoding and transporting data between a client and a server
request/response protocol
verb + resource
Level 7 on top of TCP/IP

10.2 REST & RPC

RPC

REST

request-response

API

private

public

syntax

customized

general

implementation

tight coupling

minimizes coupling

stateless and cacheable

PROS

Better performance, customized syntx

general purpose, less couple

CONS

difficult to debug, tightly coupled

few verbs only, difficult to name for nested hierarchies

- - - - - - - REST - - - - - - - 
GET /someresources/anId

PUT /someresources/anId
{
  "anotherdata": "another value"
}

- - - - - - - RPC - - - - - - - 
GET /someoperation?data=anId

POST /anotheroperation
{
  "data":"anId";
  "anotherdata": "another value"
}

10.3 TCP & UDP

TCP/IP

UDP

connection-oriented

connectionless

order

receive in order

order not guarantee

IP, hankshake = ack + checksum

broadcast to subnet (in DHCP since no IP yet)

no correct response

[1] resend the packets [2] timeouts

data loss vs latency

intact data, less time critical

data may lose, real-time

example

web servers, database, and SSH.

video chat

11. Security

1. XSS and SQL injection = Sanitize all user inputs
2. SQL injection = parameterized queries
3. least privilege.

Appendix

A1. power of 2

Power           Exact Value         Approx Value        Bytes
---------------------------------------------------------------
8                             256
10                           1024   1 thousand           1 KB
16                         65,536                       64 KB
20                      1,048,576   1 million            1 MB
30                  1,073,741,824   1 billion            1 GB
40              1,099,511,627,776   1 trillion           1 TB

A2. metrics

Read sequentially from disk at 40 MB/s (100)
Read sequentially from 1 Gbps Ethernet at 100 MB/s (40)
Read sequentially from SSD at 1 GB/s (4)
Read sequentially from main memory at 4 GB/s (1)

Reading 1 MB sequentially from memory takes about 250 us
SSD 4x 
Disk 80x

6-7 world-wide round trips per second
2,000 round trips per second within a data center

L1 cache reference                           0.5 ns
L2 cache reference                           7   ns                      14x L1 cache
Compress 1K bytes with Zippy            10,000   ns       10 us
Round trip within same datacenter      500,000   ns      500 us
Disk seek                           10,000,000   ns   10,000 us   10 ms  20x datacenter roundtrip
Send packet CA->Netherlands->CA    150,000,000   ns  150,000 us  150 ms

A3. per seconds

2.5M seconds per month

400 requests per second = 1 billion requests per month

A4. varchar & char

char 格式, 資料不足長度者後面補空白,儲存空間固定
varchar 格式資料實際之內容,儲存空間不固定

單就儲存空間上來看
char(5)
寫入 'ab' 資料庫為 'ab ' 5 byptes # 會多存3個空白
寫入 'abcd' 資料庫為 'abcd ' 5 byptes # 會多存1個空白
寫入 'abcde' 資料庫為 'abcde' 5 byptes
寫入 'abcdef' 資料庫為 'abcde' 5 byptes

varchar(5)
寫入 'ab' 資料庫為 'ab' 3 byptes # 多 1 byte 是存長度
寫入 'abcd' 資料庫為 'abcd' 5 byptes
寫入 'abcde' 資料庫為 'abcde' 6 byptes
寫入 'abcdef' 資料庫為 'abcde' 6 byptes

PreviousRDKit - rxn NextSYSTEM DESIGN - EX1

Last updated 5 years ago

Was this helpful?