CMPSCI 585 - Intro to Natural Language Processing

CMPSCI 585 Home

Course Description
Textbook & Resources
Syllabus & Slides
Homework assignments
Policies & Grading

Introduction to Natural Language Processing

CMPSCI 585
Spring 2004

Homework

Each homework assignment consists of writing a short program (which can be done in the programming language of your choice), performing a few experiments on text data we will provide, and writing brief descriptions of your findings. We will provide detailed directions and hints that should allow you to focus on the NLP aspects of the assignment, rather than software engineering.

Below are some brief descriptions. Full details of each homework will be made available when it is assigned.

Homework #1: Implement a naive Bayes document classifier, apply it to junk email filtering (or some other document collection of your choice), perform a few simple experiments. Additional helpful material: a paper describing the multinomial event model, Tom Mitchell's textbook has an excellent introduction to naive Bayes document classification.

Homework #2: We will provide data relevant to word-sense disambiguation. Building on your naive Bayes code, implement Expectation Maximization, apply it to this data, and describe your results.

Homework #3: Implement a simple hidden Markov model, trained in a non-hidden fashion, run the Viterbi algorithm on some test part-of-speech tagging data. Try some variations, and describe your results.

Homework #4: Written assignment. Due April 27th.

Final Project: Implement and explore an NLP task of your choosing. Examples, might include (1) extracting from the Web names and job titles of business people who used to go to UMass, (2) clustering text from different languages to discover a family tree of languages, (3) implementing an improved parser, (4) a simple machine translation system, (5) clustering your email to create folders, (6) a lexical acquisition system for a particular technical domain, (7) part-of-speech tagging, trained from labeled and unlabeled data, (8) Chinese word segmentation with HMMs. Check out papers at recent NLP conferences to get more ideas: EMNLP 2003, HLT 2003, ACL 2003.