This course covers techniques for analyzing very large data sets, with an emphasis on approaches that scale out effectively as more compute nodes are added. Introduces principles of distributed data management and strategies for problem-driven data partitioning through a selection of design patterns from various application domains, including graph analysis, databases, text processing, and data mining.
In this edition of the course, we will be focusing predominantly on Apache Spark and Apache Kafka.
This course is part traditional lecture, part studio course. That is, the first half of most course sessions will be traditional lectures. The second half will vary from live coding, to hands-on exercises (based on the lecture material), to student codewalks.
Communication with the instructor and teaching assistants is exclusively through Piazza. If you wish to contact the instructor privately, send a private note.
Grades will be managed and assignments will be collected through the course’s Blackboard page.
There is no required textbook for this course. Given that this is a quickly evolving area, there are several specialized developer books that you will find useful for reference and self-study.
Some of these books are available online (and for free) for Northeastern University students from Safari Books Online.
Week | Date | Topic |
---|---|---|
1 | Jan 11 | No class (POPL) |
2 | Jan 18 | Intro, Data Parallelism, and Scala |
3 | Jan 25 | Intro to Spark |
4 | Feb 1 | Key-Value Pairs and Joins |
5 | Feb 8 | Shuffling, Partitioning |
6 | Feb 15 | DataFrames & Datasets |
7 | Feb 22 | Midterm |
8 | Mar 1 | Other Big Data Tools, Introduction to Streaming |
9 | Mar 8 | No class (Spring Break) |
10 | Mar 15 | Spark Streaming |
11 | Mar 22 | Stateful & Structured Streaming |
12 | Mar 29 | Apache Kafka |
13 | Apr 5 | TensorFlow & Hadoop |
14 | Apr 12 | Final project presentations |
Throughout the semester, there will be short quizzes at the end of most lectures. These quizzes will be very brief, and are not meant to be challenging or cause students grief. They will contain very straightforward questions right out of the current lecture. The purpose of these quizzes is merely to encourage students to follow along with the lecture material.
The lowest quiz score will be dropped; this should make up for a bad day or unexcused absence from class.
Thus, as a consequence, attendance is expected.
There will be a handful of projects distributed approximately every two weeks. Projects are due by noon on Thursdays.
Assignments can be submitted on the course’s Blackboard page
Assignment | Distributed | Due |
---|---|---|
Assignment 1: Anagrams | January 19, noon | January 25, noon |
Assignment 2: Wikipedia | January 25, noon | February 1, noon |
Assignment 3: StackOverflow | February 1, noon | February 15, noon |
Assignment 4: Time Usage | February 8, noon | February 22, noon |
Students join up in pairs, and will propose a significant data processing application as a final project. A one page project proposal will be due midway through the semester describing the project plan.
More details about the final project.
March 1st Names of partners for final project due in class.
March 15-19th Brainstorming meeting about final project topic with Heather. Claim an appointment slot here.
March 22nd Project proposal due.
April 12th Project demos/presentations in class.
April 22nd Project reports in Blackboard before midnight.
If the Disability Resource Center has formally approved you for an academic accommodation in this class, please present the instructor with your “Professor Notification Letter” during the class session, so that we can address your specific needs as early as possible.
Slides and other materials will be posted here.
Slides:
Intro to Scala
Slides:
Slides:
Slides:
Slides
Slides
Slides
Slides
Slides
Slides