Requirement
Write a program that uses hashing for the following problem. Given a natural language text, generate a table of distinct words with the number of occurrences of each word in the text. A word is defined as a series of alphanumeric characters. Note that punctuation and spaces can both (separately or together) delineate a word. There are some hyphenated words in the text; for these words there will be no space before or after the hyphen and the hyphen should be considered part of the word. Capitalized versions of a word are considered the same as the lowercase version, so feel free to change all words to either upper- or lowercase. Plural possessives and contractions should remain and be counted as a unique word (such as can’t or widows’).
You must use the provided input file, which contains the first chapter of Tolkien’s “The Hobbit” found on herePreview the documentView in a new window.
Your resulting program will be an interactive one allowing the user to type in a word (that may or may not appear in the table) and receive the number of occurrences of that word in the text until the user wishes to quit.
The challenge
Find the combination of hash function and collision resolution that will minimize the collisions. A collision is defined as an attempt to store a new word in a location already occupied by another word. Consequently, it is possible for a new word to collide more than once before it is finally stored. To qualify for full credit on the request, you must have fewer than 625 collisions and your table size must be appropriate for the number of words (see below) and the collision resolution used. Your instructor was able to reach 514 collisions simply using a version of one of the hash functions discussed in class and linear probe. The student(s) with the least collisions and proper table size will receive 5 bonus points.
Hints
There are 663 unique words in the file (using the above definition of a unique word). There are 1976 words in total.
Program description
Your program should hash the values from the file and then print (neatly formatted) the number of collisions, number of unique words and total number of words to the standard output (screen) before allowing the user to input a word to see the number of times it appeared. The user should be able to input as many words as desired before choosing to quit.
Suggestions & Notes
Reading in one word at a time from the file, create a function that pre-processes a word, changing capital letters to lowercase and removing any quotation marks or other unnecessary punctuation that may have been read from the file. Then send the pre-processed string into your hash function.
When counting collisions, do not count additional instances of a word. So you will want to keep a “temporary” collision count for a word until you know if the word is already in the table.
Note that the table size must be “reasonable” and depends upon the number of entries expected and the collision resolution used. Using a larger table than required by these factors (simply to reduce collisions) will mean a deduction. The final load factor must be between 0.5 and 0.7.
Get your program up and running by using one of the hash functions discussed in class and a simple collision resolution (such as linear probe). Make sure you are pre-processing your words correctly, etc. before attempting to minimize the collisions.