Prerequisite: Probability and Statistics, Programming

Course Content

  1. Introduction to Information Retrieval (3 hours)
    • Basic Text Processing: Tokenization, Stopwords, Stemming, Lemmatization, Zipf’s and Heap’s law
    • Spelling correction and Edit distances: Hamming distance, Longest common Subsequence, Levenstein edit distance
    • Boolean Retrieval Model
  2. Basic Ranking and Evaluation Measures (4 hours)
    • Vector Space Model
    • TF*IDF
    • IR Evaluation: Precision, Recall, F-measures, Mean Reciprocal Rank (MRR), Mean Average Precision (MAP), Normalized Discounted Cumulative Gain (NDCG)
    • designing test collection, relevance judgments
  3. Probabilistic Retrieval Model
    • Introduction: Generative Model
    • Probabilistic Ranking Principle
    • Binary Independence Model
    • Okapi 25
    • Bayesian Networks for IR
  4. Statistical Language Model
    • Basics of Language Model
    • Query-likelihood Approach and different Smoothing Methods
    • Advance Query Type: Query expansion,
    • Relevance feedback, Novelty & Diversity
  5. Topic Model
    • Introduction to topic model
    • Latent Semantic Indexing
    • Probabilistic Latent Semantic Indexing
    • Latent Dirichlet Allocation
    • Topic model for IR
  6. Link Analysis
    • Introduction: World Wide Web as Graph
    • PageRank
    • HITS
    • Topic-specific and Personalized PageRank
  7. Indexing and Searching
    • Different Compression Methods: Ziv-Lempel, Variable-Byte, Gamma, Golomb, Gap encoding
    • Query Processing: TAAT, DAAT, WAND, Fagin’s algorithm
    • Near Duplicate Detection: Shingling, Min-wise independent permutations, locality sensitive hashing
  8. Retrieval using unsupervised techniques
    • Retrieval using word-embeddings and clustering
  9. Retrieval using Supervised ML (4 hours)
    • Introduction to Learning to Rank for retrieval
    • Retrieval using classification.
  10. Advance topic : One or two contemporary topics which can change from semester to semester. For example , Fairness in raking (

Learning Outcomes

This course is designed to provide an in-depth understanding of how unstructured texts are processed, indexed, and queried to meet users’ information needs. It also discusses different methods for clustering and classifying documents to enhance the efficiency of the retrieval system.

Text Books

  1. Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. Introduction to Information Retrieval, Cambridge University Press, 2008. ISBN-13: 978-0521865715 ebook
  2. Stefan Büttcher, Charles L. A. Clarke, Gordon V. Cormack. Information Retrieval: Implementing and Evaluating Search Engines, MIT Press, ISBN-13: 978-0262026512.

References

  1. Jure Leskovec, Anand Rajaraman , Jeffrey D. Ullman. Mining of Massive Datasets, Cambridge University Press, 2011. ISBN: 978-1107077232. ebook
  2. Larry Wasserman. All of Statistics, Springer, 2004. ISBN-13: 978-0387402727

Past Offerings

(Note: Past offerings could be under a different course number.)
  • Offered in Jul-Dec, 2021 by Mrinal, Koninika