An overview of MIT 6.824
April 08, 2021
What I understand of Distributed System?
Distributed systems tackle the problem of Performance Scalability, in particular horizontal scaliability: how can we improve throughput by N times, given we have ~ N machines.
- Usually this is not straightforward, as there can be bottlenecks introduced or failures.
Why I chose to take this course?
- interesting — hard problems, non-obvious solutions
- active research area — lots of progress + big unsolved problems
- used by real systems — unlike 10 years ago — driven by the rise of big Web sites
- hands-on — you’ll build a real system in the labs
Main topics cover by the course
-
3 main abstractions for distributed infrastructure
- Storage systems (e.g. Memcache at Facebook)
- Computation systems (e.g. MapReduce)
- Communication models (e.g. Network, Reliability)
-
Implementation of these abstractions including
- RPC
- Threads
-
Fault tolerance in the form of:
- Availability
- Recoverability
- Common strategies include Replication, Non-volatile Storage
-
Consistency
- Intuition: Usually we have multiple replicas of the same data, how can we talk about consistency between these replicas?
Labs
The labs will be written in Go. Previous iterations used C++. Golang avoids the various problems that come with C++ such as handling memory, keeping manual track of when the last thread has completed before freeing the memory etc.
(Still considering whether to work on the labs - might try a few first)
Useful links for the course
- Official 2020 Course Schedule https://pdos.csail.mit.edu/6.824/schedule.html
- 2015 Notes for the course https://wizardforcel.gitbooks.io/distributed-systems-engineering-lecture-notes/content/l01-intro.html
Written by Melodies Sim who lives and works in the Bay Area, California turning novel ideas into reality. Check out her projects on Github.