605.744: Information Retrieval
Prospective students: 605.744 is next being offered as an online
course in Fall 2019. The next offering after that will be Fall 2020,
I will also be co-teaching Natural Language Processing (605.646) with James Mayfield in Fall 2019.
be an in-person
course at the APL campus. Registration for Fall '19 opens July 11th.
- Paul McNamee
Note: the overview below is for convenience. Consult the official syllabus for section-specific details. The course catalog page also gives an overview.
Course Times and Location
- Location: online
- Email should be the primary means of out-of-class communication; however
I can meet with students in person by appointment.
- Course Overview
- This course covers the storage and retrieval of unstructured digital
information. Topics include automatic index construction,
retrieval models, textual representations, efficiency issues,
search engines, text classification, and multilingual retrieval.
- Grading Policy
- Work for the class includes homework assignments, an independent research project, exams, and scholarly engagement.
Refer to the course outline and syllabus for details. I assign grades with plus/minus modifiers (e.g., A+, B-, etc...)
Course related web-links
- Sources for on-line papers:
ACM Digital Library
- IR Textbooks:
Information Retrieval: Implementing and Evaluating Search Engines,
Information Retrieval: Algorithms and Heuristics
Readings in Information Retrieval (Amazon),
Foundations of Statistical Natural Language Processing
- IR Evaluations: TREC,
ROMIP (a Russian language evaluation)
- Organizations that distribute corpora:
- IR Journals:
ACM Transactions on Speech and Language Processing
- IR-related conferences:
CEAS 2010 (email spam)
AIRWeb (web spam)
ISMIR (music IR)
- On-line magazines:
The Noisy Channel,
Search Engine Watch,
- Peter Norvig's tutorial on spelling correction.
- Berkeley Primer: Finding Information on the Internet
- HLT Central Repository
- John Sowa's Discrete Mathematics Primer
- Web Protocols:
Z39.50 (Information Retrieval)
- Lucene a popular open-source search engine software (see also Solr)
- Wumpus system (Univ. Waterloo)
- Lemur / Indri: a language modelling IR toolkit.
- Cornell's SMART system (predates the birth of Sergey Brin or Larry Page)
- Martin Porter's Snowball stemming tool (includes Porter Stemmer):
- Jacques Savoy's stoplists in various languages (and some stemmers too)
- Managing Gigabytes mg system
- Very nice list of NLP, IR, CL, resources (i.e. parsers, taggers) at
- University of Michigan tool suite: Clairlib
(TnT) toolkit, a visible markov model tagger written by Thorsten
Brants (now of Google).
a probabilistic POS-tagger.
- On-line translators: Systran,
FreeTranslation.com, Google Translate, Bing's Translator
- WordNet, a
lexical database for English
- Andrew McCallum's MALLET
toolkit, a Java-based API for machine learning applications
using Conditional Random Fields
- Perl LWP library (at CPAN).
- Machine Learning / Data Mining tool: WEKA
- Joachim's Support Vector Machine toolkit: SVMlight
- SVM-Multiclass, a multi-class version of SVMlight.
- Python-based set of tools for NLP tasks (parsing, POS tagging, etc...): NLTK
- Machine learning in Python: scikit
- Parsing HTML (robustly) in Python: Beautiful Soup
- A 'meta' search engine: Dogpile
- A question-answering system: START
- An online joke recommendation system that demonstrates
- A faux computer science paper generator,
SCIgen, from MIT
- No IR system with 3 billion queries a day is going to be perfect. Best of Google Bloopers ;-).
IR Test collections
Several Web Search Engines