605.744: Information Retrieval, Spring 2017
- Paul McNamee
Note: the overview below is for convenience. Consult the official syllabus for section-specific details.
Course Times and Location
- Lecture: Mondays, 7:20pm - 10:00pm
- Location: Room K7.
- Email should be the primary means of out-of-class communication; however
I can meet with students in person by appointment.
- Course Overview
- This course covers the storage and retrieval of unstructured digital
information. Topics include automatic index construction,
retrieval models, textual representations, efficiency issues,
search engines, text classification, and multilingual retrieval.
- Grading Policy
- Work for the class includes homework assignments, an independent research project, exams, and scholarly engagement.
Refer to the course outline for details. I assign grades with plus/minus modifiers (e.g., A+, B-, etc...)
- 1/23/17 Chapters 1 and 2 in Manning, Raghavan, and Schütze.
- 1/23/17 Michael Lesk, The Seven Ages of Information Retrieval (1995)
- 1/30/17 Chapters 3 and 4 in Manning, Raghavan, and Schütze.
- Entirely optional: If you want more discussion of inverted files, their compression, and construction algorithms, Zobel and Moffat have written a nice survey article on inverted files in ACM Computing Surveys.
- 2/6/17 Chapter 5 in Manning, Raghavan, and Schütze.
- 2/13/17 Chapters 6 and 7 in Manning, Raghavan, and Schütze.
- 2/13/17 G. Salton and C Buckley, Term-Weighting Approaches in Automatic Text Retrieval, IPM 24(5), pp. 513-523, 1988.
- 2/20/17 Chapters 8 and 9 in Manning, Raghavan, and Schütze.
- 2/20/17 Economic Impact of TREC (2010). ONLY read the Executive Summary and Sections 1-3
- 2/27/17 Chapters 11 and 12 in Manning, Raghavan, and Schütze.
- 3/13/17 Chapters 13, 14, and 15 in Manning, Raghavan, and Schütze.
- Entirely optional: If you want more discussion of SVMs for text classification, you can read ths paper by Joachims: Text categorization with support vector machines: learning with many relevant features.
- Entirely optional: If you want to read about email spam detection: Goodman et al., Spam and the on-going battle for the inbox., CACM 50(2), pp. 24-33, 2007.
- 3/27/17 Chapters 19, 20, and 21 in Manning, Raghavan, and Schütze.
Course related web-links
- Sources for on-line papers:
ACM Digital Library
- IR Textbooks:
Information Retrieval: Implementing and Evaluating Search Engines,
Information Retrieval: Algorithms and Heuristics
Readings in Information Retrieval (Amazon),
Foundations of Statistical Natural Language Processing
- IR Evaluations: TREC,
ROMIP (a Russian language evaluation)
- Organizations that distribute corpora:
- IR Journals:
ACM Transactions on Speech and Language Processing
- IR-related conferences:
CEAS 2010 (email spam)
AIRWeb (web spam)
ISMIR (music IR)
- On-line magazines:
The Noisy Channel,
Search Engine Watch,
- Peter Norvig's tutorial on spelling correction.
- Berkeley Primer: Finding Information on the Internet
- HLT Central Repository
- John Sowa's Discrete Mathematics Primer
- Web Protocols:
Z39.50 (Information Retrieval)
- Lucene a popular open-source search engine software (see also Solr)
- Wumpus system (Univ. Waterloo)
- Lemur / Indri: a language modelling IR toolkit.
- Cornell's SMART system (predates the birth of Sergey Brin or Larry Page)
- Martin Porter's Snowball stemming tool (includes Porter Stemmer):
- Jacques Savoy's stoplists in various languages (and some stemmers too)
- Managing Gigabytes mg system
- Very nice list of NLP, IR, CL, resources (i.e. parsers, taggers) at
- University of Michigan tool suite: Clairlib
(TnT) toolkit, a visible markov model tagger written by Thorsten
Brants (now of Google).
a probabilistic POS-tagger.
- On-line translators: Systran,
FreeTranslation.com, Google Translate, Bing's Translator
- WordNet, a
lexical database for English
- Andrew McCallum's MALLET
toolkit, a Java-based API for machine learning applications
using Conditional Random Fields
- Perl LWP library (at CPAN).
- Machine Learning / Data Mining tool: WEKA
- Joachim's Support Vector Machine toolkit: SVMlight
- SVM-Multiclass, a multi-class version of SVMlight.
- Python-based set of tools for NLP tasks (parsing, POS tagging, etc...): NLTK
- Machine learning in Python: scikit
- Parsing HTML (robustly) in Python: Beautiful Soup
- A 'meta' search engine: Dogpile
- A question-answering system: START
- An online joke recommendation system that demonstrates
- A faux computer science paper generator,
SCIgen, from MIT
- No IR system with 3 billion queries a day is going to be perfect. Best of Google Bloopers ;-).
IR Test collections
Several Web Search Engines