Overview

The overall goal of this project is to have a search engine that will take a query and return jokes. The twist is, we want it to recognize that a group of jokes is essentially "the same joke," even if it's told in different ways (using different words).

For instance, a joke told about a lawyer, a doctor, and an engineer stuck on a desert island, might appear in different form told about a priest, a rabbi, and Bugs Bunny on an airplane.

What's more, we also want to construct queries that would return (versions of) a joke, even if the person querying didn't remember the joke exactly or had heard a different version.

So, there are two labeling tasks.

Tasks

1. Given one copy of a joke, find us all others in the corpus that are basically the same thing. Being "the same joke" is a fuzzy concept, but here are some guidelines:

You may have to be creative in the search terms. If it's an airplane joke about a priest & a rabbi, searching on the punch line term "parachute" may be the best way to find the variations.

2. Given a joke, construct queries that someone might use to search for it. Maybe they can't remember the punch line . . . or maybe they can only remember the punch line. Or maybe they know it, but want to see exactly how it's told.

Queries should be a few words long. For each cluster of "same" jokes, construct 3-5 queries of each of these types:

a. Easy = designed to bring up one of the jokes in the cluster.
b. Hard = same as above, except there are words in the query that aren't actually in the joke. As if the user had heard a different version, or else just didn't remember it exactly.
c. Medium = the user was just lucky enough that though their query terms don't match an individual document, the terms are all in some document or another in the cluster.

Example

Here's an example for both tasks.

Original joke (will be given to you):
(id = 5) A nun in the convent walked into the bathroom where mother superior was taking a shower. "There is a blind man to see you," she says. "Well, if he is a blind man, than it does not matter if I'm in the shower. Send him in." The blind man walks into the bathroom, and mother superior starts to tell him how much she appreciates him working at the convent for them. She goes on and on and 10 minutes later the man interrupts: "That's nice and all, ma'am, but you can put your clothes on now. Where do you want me to put these blinds?

Searching using these keywords--blinds man--brings up two more copies:
(id = 4722) A Nun was taking a shower one day and she heard the door bell ring, she yelled "Who is it?" And the person ringing the door bell yelled, "I'm the blind man." So the Nun got out of the shower and wrapped her hair in a towel, she didn't bother putting a towel around herself because the person behind the door was blind. She opened the door and said, "What do you want?", and the man said, "I'm here to check your blinds."

(id = 955) A blonde girl just stepped into the bathtub when the doorbell rang. "Who is it?" "Blind man," came the response. Feeling charitable, the blonde dashed from the tub without bothering to put on any clothes, grabbed her purse, and opened the door. The man's jaw dropped and he stammered, "Wh-where do you want me to put these blinds, lady?"

And with these words--blind nun--one more:
(id = 8068) A nun is undressing for a bath and while she's standing naked, there's a knock at the door. The nun calls, "Who is it?" A voice answers, "A blind salesman." The nun decides to get a thrill by having the blind man in the room while she's naked so she lets him in. The man walks in, looks straight at the nun and says, "Uhhhh, well hello there, can I sell you a blind, dearie...?"

Here are queries constructed for it (admittedly, the hard queries aren't very realistic; try to do a little better):

Other details

The corpus is available to be searched at http://fitzroy.cs.umass.edu:8024/jokes. Click on the "Cached" links to actually see the jokes.

In the search engine above, you can put quotes around phrases, and you can also use "+" in front of a term to require it.

If there aren't other versions of a joke in this corpus, that's fine: there aren't any.

If all the versions are identical (apart from a couple words), note this (write "identical" next to the id). (We can only use them if there are truly different versions.) If you can't decide if a joke is similar enough, write "(?)" next to it.

So, if you're given (e.g.) joke 68 to start with, turn in a list of matches that might look like:
68 1027 25(?) 11213-identical 7720
Plus, a list of queries.

One last thing. There are documents in the corpus that aren't jokes at all--rather, more like "20 things to do to annoy telemarketers" or 1-line quotes. Write down the id's of any of these you come across, and I'll remove them from the corpus. Similarly, write down id's of documents that look like truncated jokes, or that contain multiple jokes.

This link gives the seed jokes to start from.

Ten more here (for Haotian).

Eleven more here.

Nine more here.

Nine more here.