scalability problem = fast for a single user / slow under heavy load.
1.2. Latency vs throughput
Latency = time to perform some actions
Throughput = requests / time
Aim = maximal throughput with acceptable latency.
1.3. Availability vs consistency
Consistency = read receives [1] the lastet write or [2] an error
Availability = [1] request always receives a response, [2] no guarantee of consistency
Partition Tolerance = [1] The system continues to operate [2] despite partitioning due to network failures
CP = [1] atomic read & write [2] response from the partitioned node -> timeout error.
AP = [1] eventual consistency [2] response the most recent data, but not the latest
1.3.1 Consistency patterns
Weak consistency = [1] data may lose [2] video chat
Eventual consistency = [1] asynchronous data [2] email
Strong consistency = [1] synchronous data [2] RDBMSes
Distribute methods:
1. Random
2. Round Robin
3. Least loaded
4. Session/cookies/request params
5. Layer 4 = ip (faster)
6. Layer 7 = content (direct video traffic to video servers)
a web server = [1] centralizes internal services [2] unified public interfaces
PROS:
1. Increased security (Hide backend information & IPs)
2. Increased scalability (Clients only see the reverse proxy's IP)
3. SSL termination
4. Caching
5. Serve Static content
CONS:
1. complexity
2. SPOF
5.1 Load balancer vs reverse proxy
load balancer = multiple servers
Reverse proxies = single server
6. Application layer
Microservices = independently deployable, small services
Service Discovery = [1] Consul, Zookeeper [2] find services by keeping track of registered names
CONS: complexity
7. Database
7.1 RDBMS
ACID RDBMS transactions.
1. Atomicity = transaction is all or nothing
2. Consistency = transaction brings DB from one state to another
3. Isolation = exec transactions concurrently or serially is the same
4. Durability = Once committed, it will remain so
Scale RDBMS
1. master-slave (replicate writes to more read slaves) (CONS: [1] more logics [2] CONS Replication)
2. master-master (CONS: [1] more logics, [2] CONS Replication, [3] LBs, [4] Sync & conflict resolution )
3. federation (split by functions) (CONS: complex join & logic)
4. sharding (split by word order) (CONS: complex join & logic)
5. denormalization (Redundant copies in multi-tables to avoid expensive joins) (CONS: [1] sync duplication, [2] add write loading)
6. SQL tuning (benchmark and profile)
PROS of federation & sharding = [1] less read and write traffic [2] more cache hits [3] reduce index size
CONS replication
1. loss data before ready
2. more slaves more replication lag
3. replaying writing stuck in slaves
4. complexity
7.2 NoSQL
BASE
1. Basically available = guarantees availability.
2. Soft state = DB state may change over time, even without input.
3. Eventual consistency = Async
Types
1. Key-value store = O(1) by word order
2. Document store = JSON, MongoDB (collections of key-value)
3. Wide column store = ColumnFamily (Bigtable, HBase, Cassandra)
4. Graph database = complex relationships, such as a social network
Reasons for SQL:
1. Structured data
2. Strict schema
3. complex joins
4. Transactions
Reasons for NoSQL:
1. Semi-structured data
2. flexible schema
3. big data
4. Very data intensive / high throughput workload
Well-suited for NoSQL:
1. log data
2. Temporary data, such as a shopping cart
3. Frequently accessed ('hot') tables
8. Cache
[1] lookup cache [2] found & return [3] goto DB, save in cache, and return
CONS cache: [1] consistency between cache & DB [2] expiration [3] add logics
cache update strategy
Cache-aside = cache<-clien->DB
STEP: [1] look cache [2] miss & read DB [3] save to cache and return
CONS: [1] three trips in cache miss [2] stale data [3] replace new node without memory
Write-through = client->cache->db
STEP: [1] read & write to cache [2] sync to DB [3] return
CONS: [1] slow write fast read [2] replace new node without memory
Write-behind = client->cache~~>db
STEP: [1] read & write to cache [2] async to DB & return
CONS: [1] loss data before ready [2] complex
Refresh-ahead = automatically refresh
CONS: [1] Difficult to predict which to refresh
8.1 MongoDB vs Redis vs Memcached
When to use Redis: If you can map a case to Redis and discover you aren't at risk of running out of RAM.
MongoDB = key-value + disk-based store.
Redis = key-value + built-in persistence.
Memcached = Redis - built-in persistence.
Persistence to disk = you can use Redis as a real DB instead of a volatile cache. The data won't disappear when you restart as with Memcached.
9. Asynchronism
Reduce request times for expensive operations
The user is not blocked, and the job is processed in the background.
Message queues = RabbitMQ
Task queues = Celery
Back pressure: limiting the queue size
10. Communication
10.1 HTTP
encoding and transporting data between a client and a server
request/response protocol
verb + resource
Level 7 on top of TCP/IP
10.2 REST & RPC
RPC
REST
request-response
request-response
API
private
public
syntax
customized
general
implementation
tight coupling
minimizes coupling
stateless and cacheable
PROS
Better performance, customized syntx
general purpose, less couple
CONS
difficult to debug, tightly coupled
few verbs only, difficult to name for nested hierarchies
1. XSS and SQL injection = Sanitize all user inputs
2. SQL injection = parameterized queries
3. least privilege.
Appendix
A1. power of 2
Power Exact Value Approx Value Bytes
---------------------------------------------------------------
8 256
10 1024 1 thousand 1 KB
16 65,536 64 KB
20 1,048,576 1 million 1 MB
30 1,073,741,824 1 billion 1 GB
40 1,099,511,627,776 1 trillion 1 TB
A2. metrics
Read sequentially from disk at 40 MB/s (100)
Read sequentially from 1 Gbps Ethernet at 100 MB/s (40)
Read sequentially from SSD at 1 GB/s (4)
Read sequentially from main memory at 4 GB/s (1)
Reading 1 MB sequentially from memory takes about 250 us
SSD 4x
Disk 80x
6-7 world-wide round trips per second
2,000 round trips per second within a data center
L1 cache reference 0.5 ns
L2 cache reference 7 ns 14x L1 cache
Compress 1K bytes with Zippy 10,000 ns 10 us
Round trip within same datacenter 500,000 ns 500 us
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
A3. per seconds
2.5M seconds per month
400 requests per second = 1 billion requests per month