Pre-requisites: Algorithms, Probability.

(CS2030 Data Structures and Algorithms OR CS3070/CS2040 Design and Analysis of Algorithms OR CS5009 Algorithms OR CS4999) AND (MA2041 - Probability and Statistics OR MA5007 - Probability and Statistics OR MA4999 OR CS4999 OR PH4999)

XX4999 stands for having completed a bachelor’s degree in XX program.

Course Overview

Classical Algorithm Design assumes that all the required data is stored on a fast memory and any piece of it can be accessed in unit time. This assumption is no longer true when one has to deal with massive amounts of data - much more than what could be stored in the main memory of any computer. This course describes models and algorithmic techniques developed for handling such massive amounts of data that is typically made available with access limitations. Topics covered include data stream algorithms, sampling techniques, sketching techniques, sparsification and lower bounds. The emphasis will be in the design of these techniques, proving performance and efficiency guarantees and understanding the fundamental limitations of these approaches.

Course Content

Introduction. Introduction to massive data problems. Introduction to techniques like randomisation, streaming, sampling, and sketching. [2 lectures ]

Streaming Algorithms. The Data Stream Model. Finding Frequent Items, Frequency Estimation. Estimating the Number of Distinct Elements: The Tidemark Algorithm, The Median Trick, The BJKST Algorithm, Hyperloglog Algorithm, CVM algorithm. Approximate Counting. [12 lectures]

Sketches and Linear Sketches. Finding Frequent Items via Linear Sketching. CountSketch and Count-Min Sketch Algorithms. Comparison of Frequency Estimation Methods. Estimating Frequency Moments: The AMS Estimator, The Tug-of-War Sketch. [10 lectures]

Sampling, Weight-Based Sampling: The $\ell_0$-Sampling, $\ell_2$- Sampling. [2 lectures]

Graph Streams. Basic Algorithms: The Connectedness Problem, The Bipartiteness Problem, Shortest Paths and Distance Estimation via Spanner. Finding Maximum Matchings: Maximum Cardinality Matching, Maximum Weight Matching. [6 lectures]

Graph Sketching. Testing Connectivity Using Boundary Edges, Testing Bipartiteness, The AGM Sketch. Counting Triangles: A Sampling-Based Algorithm, A Sketch-Based Algorithm. [4 lectures]

Lower Bounds. Communication Complexity and Lower Bounds. Communication Games, Protocols, Complexity. Two-Player Communication Games. Data Streaming Lower Bounds: for Majority, for Frequency Estimation. [6 lectures]

Text Books

  1. Amit Chakrabarti. Data Stream Algorithms Lecture Notes. (2024). (Available online)

References

  1. Avrim Blum, John Hopcroft, and Ravindran Kannan. Foundations of data science. Cambridge University Press, 2020. (ISBN: 978-1108485067. Also available online)
  2. Jelani Nelson, Sketching Algorithms, Lecture Notes (2020) (Available online)

Past Offerings

Course Metadata

Item Details
Course Title Algorithms for Big Data
Course Code CS5654
Course Credits 3-0-0-3
Course Category PME
Proposing Faculty Deepak Rajendraprasad and Krithika Ramaswamy
Approved on Senate of IIT Palakkad
Course status New