15)System Design & Distributed Systems:

System Design & Distributed Systems:

  • HiredInTech:
    • https://www.hiredintech.com/classrooms/system-design/lesson/59#
    • A strong process is crucial to successfully solving system design questions. We broke it down into four steps:
      • Scope the problem: Don't make assumptions; Ask questions; Understand the constraints and use cases.
      • Sketch up an abstract design that illustrates the basic components of the system and the relationships between them.
      • Think about the bottlenecks these components face when the system scales.
      • Address these bottlenecks by using the fundamentals principles of scalable system design.
    • Make sure to start with asking questions, what constraints are we under? Typical access patterns? How much data? How many requests? Etc.
  • (Database) Sharding:
    • Just partition data by hashing the primary key.
    • https://www.hiredintech.com/classrooms/system-design/lesson/60
    • http://highscalability.com/blog/2009/8/6/an-unorthodox-approach-to-database-design-the-coming-of-the.html
    • http://www.addsimplicity.com/adding_simplicity_an_engi/2008/08/shard-lessons.html
    • https://www.percona.com/blog/2009/08/06/why-you-dont-want-to-shard/
    • To fix the problem of hotspots (incoming data not evenly distributed), use a good hash function. To fix the problem of node failures or of being able to scale up (with naive hashing, would be forced to migrate tons of data / nearly all data to new nodes), use consistent (ring) hashing:
      • https://www.paperplanes.de/2011/12/9/the-magic-of-consistent-hashing.html
      • http://michaelnielsen.org/blog/consistent-hashing/
        • "But unlike naive hashing, consistent hashing requires only a relatively small amount of data to be moved: if you add a machine to the cluster, only the data that needs to live on that machine is moved there; all the other data stays where it is."
      • https://akshatm.svbtle.com/consistent-hash-rings-theory-and-implementation
    • Could also use "dynamic" sharding where there's a specialized metadata locator service that points an incoming query to the right partition (basically just key:value store of key range:partition). Think, e.g., HDFS' NameNode. Easy to modify mappings, but single point of failure.
    • partioning is vertical; sharding horizontal on some attribute of the data, eg for users, sharded into a-m and n-z datasets; federation is splitting the data completely on some obvious data attribute, like putting all user data on one node and all purchases data on another
  • CAP:
    • A dismal guide to concurrency:
      https://www.facebook.com/notes/facebook-engineering/a-dismal-guide-to-concurrency/379717628919/
      • "A Consistent/Available system means that reading and writing always works the way you expect, but requires a majority or quorum of nodes to be running in order to function. Think of a parliament that must have more than half of members present in order to hold a vote. If too many can't make it, say because a flood washes out the bridge, a quorum can't be formed and business can't proceed. But when enough members are in communication the decision-making process is fast and unambiguous.
        Impossible in larger-scale distributed systems.
      • Consistent/Partitionable means that the system can recover from failures, but requires so much extra coordination that it collapses under heavy use. Imagine having to send and receive a status report for every decision made at your company. You'll always be current, and when you come back from vacation you will never miss a thing, but making actual progress would be very slow.
      • Available/Partitionable means that you can always read and write values, but the values you read might be out of date. A classic example is gossip: at any point you might not know the latest on what Judy said to Bill but eventually word gets around. When you have new gossip to share you only have to tell one or two people and trust that in time it will reach everyone who cares. Spreading gossip among computers is a bit more reliable because they are endlessly patient and (usually) don't garble messages."
    • Eventually Consistent — Amazon.com paper
      http://queue.acm.org/detail.cfm?id=1466448
      • "Eventual consistency. This is a specific form of weak consistency; the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value. If no failures occur, the maximum size of the inconsistency window can be determined based on factors such as communication delays, the load on the system, and the number of replicas involved in the replication scheme. The most popular system that implements eventual consistency is DNS (Domain Name System). Updates to a name are distributed according to a configured pattern and in combination with time-controlled caches; eventually, all clients will see the update.
      • The eventual consistency model has a number of variations that are important to consider:
        • Causal consistency. If process A has communicated to process B that it has updated a data item, a subsequent access by process B will return the updated value, and a write is guaranteed to supersede the earlier write. Access by process C that has no causal relationship to process A is subject to the normal eventual consistency rules.
        • Read-your-writes consistency. This is an important model where process A, after it has updated a data item, always accesses the updated value and will never see an older value. This is a special case of the causal consistency model.
        • Session consistency. This is a practical version of the previous model, where a process accesses the storage system in the context of a session. As long as the session exists, the system guarantees read-your-writes consistency. If the session terminates because of a certain failure scenario, a new session needs to be created and the guarantees do not overlap the sessions.
        • Monotonic read consistency. If a process has seen a particular value for the object, any subsequent accesses will never return any previous values.
        • Monotonic write consistency. In this case the system guarantees to serialize the writes by the same process. Systems that do not guarantee this level of consistency are notoriously hard to program."
  • ACID vs BASE:
    • ACID = atomicity, consistency (not the same as CAP consistency), isolation, durability
    • BASE = basically available, soft state, eventually consistent
      • Basically available = the system does guarantee availability, in terms of the CAP theorem.
      • Soft state = the state of the system may change over time, even without input (because of eventual consistency)
      • Eventual consistency = the system will become consistent over time, given that the system doesn't receive input during that time
    • Two ends of the CAP continuum: ACID focuses on consistency and BASE focuses on availability and partition tolerance—often relaxes consistency guarantees to e.g. eventual consistency. Maps pretty well to SQL vs NoSQL and to why the latter usually scales better.
  • Asynchronous vs synchronous:
    • sync means that processes must wait for each other to complete before beginning execution
    • async means that other stuff can happen while a process is executing
    • the meaning of synchronous/asynchronous is in the context of a global clock: operations are synchronous when they use the same global clock/lock, and asynchronous when they operate on their own clocks (i.e., so they don't have to wait for each other)
    • http://stackoverflow.com/questions/748175/asynchronous-vs-synchronous-execution-what-does-it-really-mean
  • Case studies:
    • Pastebin or bit.ly:
      • https://github.com/donnemartin/system-design-primer/blob/master/solutions/system_design/pastebin/README.md
      • http://blog.gainlo.co/index.php/2016/03/08/system-design-interview-question-create-tinyurl-system/
  • Paxos and other consensus methods:
    • https://www.quora.com/Distributed-Systems-What-is-a-simple-explanation-of-the-Paxos-algorithm
  • Numbers every programmer should know about latency, memory, etc.:
    • Numbers (
      • L1 cache reference 0.5 ns
      • Branch mispredict 5 ns
      • L2 cache reference 7 ns
      • Mutex lock/unlock 100 ns
      • Main memory reference 100 ns
      • Compress 1K bytes with Zippy 10,000 ns
      • Send 2K bytes over 1 Gbps network 20,000 ns
      • Read 1 MB sequentially from memory 250,000 ns
      • Round trip within same datacenter 500,000 ns
      • Disk seek 10,000,000 ns
      • Read 1 MB sequentially from network 10,000,000 ns
      • Read 1 MB sequentially from disk 30,000,000 ns
      • Send packet CA->Netherlands->CA 150,000,000 ns
    • Note that:
      • 1 ns = 10^-9 seconds
        Nanosecond.
      • 1 us = 10^-6 seconds = 1,000 ns
        Microsecond.
      • 1 ms = 10^-3 seconds = 1,000 us = 1,000,000 ns
        Millisecond.
    • Things to notice:
      • Notice the magnitude differences in the performance of different options.
      • Datacenters are far away so it takes a long time to send anything between them.
      • Memory is fast and disks are slow.
      • By using a cheap compression algorithm a lot (by a factor of 2) of network bandwidth can be saved.
      • Writes are 40 times more expensive than reads.
      • Global shared data is expensive. This is a fundamental limitation of distributed systems. The lock contention in shared heavily written objects kills performance as transactions become serialized and slow.
      • Architect for scaling writes.
      • Optimize for low write contention.
      • Optimize wide. Make writes as parallel as you can.
    • https://gist.github.com/jboner/2841832
    • http://highscalability.com/blog/2011/1/26/google-pro-tip-use-back-of-the-envelope-calculations-to-choo.html
    • https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html
    • http://highscalability.com/blog/2013/1/15/more-numbers-every-awesome-programmer-must-know.html
  • Designing Data-Intensive Applications notes:
    • logs (not narrow concept of application logs, but any immutable append-only key:value sequence structure) are a crucial abstraction for distributed systems:
      • https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
    • LSM-trees (log-structured merge trees) and SSTables (sorted string tables) are an important under-the-hood concept:
      • write to memtable (in-memory [ordered] balanced tree structure like a red-black or AVL tree)
      • when full, write to disk as an SSTable file/segment, becomes most recent segment of DB; while being written to disk, writes continue on new memtable instance
        • an SSTable is just sorted key:value pairs, but with some offsets stored in an index file. so when searching, can check against index to narrow down block that I have to look through
      • to serve reads, first try to find key in memtable, then in most recent on-disk SSTable segment, then next most recent. can use a bloom filter to avoid having to check all segments for nonexistent key
        • to avoid losing recent writes to memtable in event of DB crash, separate log on disk where every write is immediately appended (append-only immutable log etc.); not in order bc only purpose is to restore memtable. each time successfully written to disk, discard it
      • in background, from time to time, run merging and compaction process to combine segments and discard overwritten/deleted values (think tombstone for deletions)
  • Links:
    • General:
      • https://www.palantir.com/2011/10/how-to-ace-a-systems-design-interview/
      • https://www.youtube.com/watch?v=-W9F__D3oY4
        Focusing on scalability.
      • https://github.com/donnemartin/system-design-primer
    • Concurrency:
      • https://www.facebook.com/notes/facebook-engineering/a-dismal-guide-to-concurrency/379717628919/
      • http://cs.lmu.edu/~ray/notes/introconcurrency/
darkmode