Goals
- Translate written descriptions of behavior into code.
- Practice representing state in a class.
- Practice interacting with the Map, Set, and List abstractions.
- Test code using unit tests.
Downloading and importing the starter code
As in previous requests, download and save (but do not decompress) the provided archive file containing the starter code. Then import it into Eclipse in the same way; you should end up with a information-retrieval- project in the “Project Explorer”.
Search engine behavior
For the purposes of this request, a search engine is a stateful object that “knows about” a set of documents and supports various queries on those documents and their contents.
Documents are identified by a unique ID and consist of a sequence of terms. Terms are (always) the lowercase version of words; operations on the search engine and documents are should therefore be case-insensitive. Documents are added one-by-one to the search engine.
The search engine function as an index. That is, given a term, the search engine can return the set of documents (that it knows about) that contain that term.
The search engine can also find a list of documents (again, from among the set it knows about) relevant to a given term, ordered from most-relevant to least-relevant. It does so using a specific version of the tf-idf statistic, which sounds intimidating but is actually fairly straightforward to calculate - so long as you have the data structures to support doing so.
What to do
The SearchEngine needs to keep track of the documents for two things: to do index lookups of terms, returning a set of documents (in indexLookup), and to compute the two components of the tf-idf statistics (in termFrequency and inverseD). You can hold this state with whatever data structures you like, but my suggestions follow.
I suggest you get addDocument and indexLooku working first. To support the index, a straightforward mapping of terms to DocumentIDs will work. (To be clear: a Map). It turns out you don’t need to create this structure; you can use the one you’ll make to support tf-idf instead, but creating this Map might be a good warmup. In any case, declare the structure(s) as instance variables, create the empty structure(s) you’ll use in the constructor, fill it/them in addDocument, and examine it/them in indexLookup. When turning the document itself into terms, use the same approach as in request 05: String.split using “\W+“, and remember toLowercase the result.
termFrequency requires that you compute the number of times a given term appears in a given document. This suggests you should have a data structure that keeps track of the count of terms per document: a Map. But this frequency-counting structure is per-document; you need to keep track of each document’s counts. So overall, I suggest a Map. The outer map goes from DocumentIds to the inner frequency-counting structure. You’ll have to update addDocument to populate and update these structures. Be sure to get clear in your head the different times you’ll use get, put, containsKey, and getOrDefault.
Once you have the structure described above, inverseDocumentFrequenc is fairly straightforward. Be sure to read the javadoc comment above the method for the exact equation the tests are expecting. Use Math.log to compute the logarithm (not Math.log10 or Math.log2).
Use these two methods to compute a given document-term pair’s tfIdf.
Finally, implement relevanceLookup, which returns a list of all documents containing a given term, sorted from largest tf-idf to smallest. You’ll probably need to implement TfIdfComparator., but note that no tests test the comparator directly, so if you have another method in mind to sort the list, go ahead. If you do implement it, make sure it returns a value that will result in the list being sorted largest-to-smallest, and mind the tie-breaker requirement.