Prerequisite: Familiarity with Algorithms, Probability, Linear Algebra, Programming

Course Content

  1. Data Collection: Various sources and types of data: text, video, audio, biology etc (3 hours)

  2. Data Preprocessing: Cleaning data, missing data imputation, noise elimination, feature selection and dimensionality reduction, normalization (6 hours)

  3. Data Storage: Database, Schema, ER diagram, SQL, functions, stored procedures, indexing B+tree, MongoDB, Client-Server Architecture (9 hours)

  4. Information Retrieval: index construction, scoring models, complete search engine mechanism, evaluation methods. (6 hours)

  5. Data Processing: Data structures. Stack, Queue, Linked List, Associated memory, Graphs. Algorithms. Searching, Sorting, Graph traversal, Complexity (9 hours)

  6. Data Analysis: regression, principal component analysis, canonical correlation analysis, analysis of variance (6 hours)

  7. Data Visualization: table, graph, histogram, pie-chart, area-plot, box-plot, scatter-plot, bubble-plot, waffle charts, word clouds. (3 hours)

Learning Outcomes

To be able to state and analyse

  • Preprocessing techniques for various datasets,
  • Standard database systems concepts like tables, relations, query
  • Information retrieval techniques such as indexing, scoring, ranking, evaluation
  • Data processing algorithms and data structures
  • Visualization techniques

Learning Objectives: To be able to learn about the entire pipeline of a typical system involving data, collection, preprocessing, storage, retrieval, processing, analysis, and visualization.

Text Books

  1. Introduction to Algorithms. Cormen, Leiserson, Rivest, Stein. MIT Press 3ed. ISBN-13: 978-0262533058
  2. Database System Concepts. Silberschatz, Korth, Sudarshan. McGraw Hill Education; Sixth edition.ISBN-13: 978-9332901384
  3. Introducing Data Science: Big Data, Machine Learning, and More, Using Python Tools. Cielen, Meysman,Ali. Dreamtech Press. ISBN-13: 978-9351199373

References

  1. Data Engineering: A Novel Approach to Data Design. Brian Shive. Technics Publications. ISBN-13: 978-1935504603
  2. Python Data Science Handbook: Essential Tools for Working with Data. Joel Grus. O’Reilly. ISBN-13: 978-9352134915

Past Offerings

(Note: Past offerings could be under a different course number.)
  • Offered in Jul-Dec, 2020 by Mrinal

Course Metadata

Item Details
Course Title Data Engineering
Course Code CS5015
Course Credits 3-0-0-3
Course Category PMT
Approved on Senate of IIT Palakkad
Course pre-revision code DS5003