COMP10002-Indexed-Document-Retrieval

Learning Outcomes

In this project you will demonstrate your understanding of dynamic memory and linked data structures, and extend your skills in terms of program design, testing, and debugging. You will also learn about inverted indexes, and the basic principles of web search algorithms.

Indexed Document Retrieval

The idea of an inverted index was mentioned briefly in class. To build an index for some input text, the words are isolated, together with their document numbers (in our case, line numbers in the input file), and arranged so that, for every word, a list of the documents that contain that word is constructed. In terms of notation, if t is an indexed term, then f t is the number of documents in the collection that contain that term at least once; and for any given document d, the value f d,t is the number of times that t appears in d. For example, consider the text:

1
2
3
4
5
6
line one has one word twice
line two has words once only
line three follows lines one and two, but not four
line four is like the other lines, not like line five
line five has word one and word two and word three
six is the littlest one

If each line of that input file is taken to be a “document”, then the first few lines of a simple document-level inverted index for it would be

1
2
3
4
5
6
and 2 3 1 5 2
but 1 3 1
five 2 4 1 5 1
follows 1 3 1
four 2 3 1 4 1
has 3 1 1 2 1 5 1

where the first integer following each word is the f(t) value for that term t, with exactly that many (d, f(d,t)) pairs after it all on the same input line. For example, term t = “and” appears in f(t) = 2 documents, document d = 3 (with f(d,t) = 1, that is, one occurrence), and in document d = 5 (with f(d,t) = 2 occurrences). The full index file for the original six lines is available on the LMS, together with larger examples. Make sure that you understand the structure, and what the values represent.

Stage 1 – Reading the Index

Write a program that reads an index file with this format, specified as the first (and only) argument on the command-line, and builds (using realloc() and malloc()) a data structure to store that index information. The only assumption you may make, purely for the purposes of reading the input strings, is that each term in the index will be at most 999 characters long. Apart from a single buffer of that size, all stored strings and lists of (d, f(d,t)) pairs should be held in dynamic arrays of the correct length for the data they contain (or within a factor of two of that minimum length). As evidence of the operation of this stage of your program, it should report the number of terms in the index that was read, the total number of (d, f(d,t)) pairs in the index, and up to ten of the pairs associated with the first two and the last two terms in the index, using exactly the output format that is shown here and in the LMS examples. Note that the terms are labeled from 1:

1
2
3
4
5
6
7
8
9
10
11
mac: ./ass2-soln test0-ind.txt
Stage 1 Output
index has 23 terms and 43 (d,fdt) pairs
term 1 is "and":
3,1; 5,2
term 2 is "but":
3,1
term 22 is "word":
1,1; 5,3
term 23 is "words":
2,1

You may assume throughout that all input files you will be provided with will be “correct”, according to the description given above – there won’t be any tricksy-wicksy deviations. That is, all strings will be lower-case alpha only; all elements will be separated by a single space; all lines will have the correct number of values; all lines (including the last one) will be ended by a single (DOS-style) CR-LF combination (see the discussion of this in the request 1 FAQ page); etc. Note that the terms are provided in the index file in strict dictionary order, and that the d values in the set of pairs associated with each term are also are strictly increasing; you will need to make these two facts in Stage 2 and Stage 3. Your Stage 1 program must open and read from a named file (and not from stdin yet, see Stage 2 for that input) so you will need to be familiar with Chapter 11 of the textbook. Output should be written to stdout (but you may also write error messages to stderr if you wish).

Stage 2 – Queries and Term Lookup

Extend your program so that it reads “queries” from stdin and looks up their corresponding term numbers in the index. Terms that do not appear in the index should be noted accordingly. For example (not showing the Stage 1 output again):

1
2
3
4
5
6
7
8
9
10
mac: ./ass2-soln test0-ind.txt
four and five
Stage 2 Output
"four" is term 5
"and" is term 1
"five" is term 3
line seven
Stage 2 Output
"line" is term 9
"seven" is not indexed

The eventual expectation in this stage is that you will use binary search to do this lookup; but it might be prudent to make use of linear search in the first instance, while you are debugging this stage and working on Stage 3, and then switch to binary search when you are sure of yourself, and ready to extend again. You may assume that each input line will contain at most 999 characters, and at most 20 different query terms. Lines should be processed interactively, so that the output for each line is generated before the next line is typed, with query terms limited to alphabetic characters, and case-folded to lowercase before the lookup is undertaken, to match the index. You do not need to retain each query line once it has been processed by Stage 2 and then Stage 3.

Stage 3 – Ranked Querying

Ok, now for the fun part. Suppose that the collection has N documents, and suppose that (as before) term t appears in f(t) of them, and appears in document d a total of f(d,t) times. To determine the documents d(here, lines) that are the “strongest” matches for each query q, the following calculation is carried out on a per-document basis.

The average document length over the N documents (computed as the arithmetic mean), and k = 1.2 and b = 0.75 are constants. Broadly speaking, there are three key relationships being balanced in this equation: rare words with low f(t) values are regarded as being more important than common words; words that appear multiple times in any given document have more weight than words that only appear a few times; and long documents are discounted relative to shorter ones. This computation is known as the “BM25” similarity mechanism. A wide range of other such scoring mechanisms have been proposed and used in web search systems. The ones actually used by Google and similar are, of course, closely-guarded commercial secrets, but still embed these same three principles. For web documents, commercial search providers also make use of features such as page title text; information from relative font-sizes being used; the anchor text that points to each page; language cues; the link relationships between pages; query click-through rates from other users; and so on.

Add further functionality to your program so that as each query is read from stdin, it is processed against the index, and the top-3 matching documents (by score, with ties on score resolved according to increasing document number) identified and reported. For example (with the interleaved Stage 2 outputs suppressed, see the LMS for full examples of the output required):

1
2
3
4
5
6
7
8
9
10
11
mac: ./ass2-soln test0-ind.txt
four and five
Stage 3 Output
document 5: score 3.619
document 3: score 3.115
document 4: score 2.978
line seven
Stage 3 Output
document 4: score 0.474
document 1: score 0.425
document 2: score 0.425

For this process to be efficient, you need to combine a number of different sources of information, bringing components in as they are required. The document lengths |d| should be stored as part of your index data structure, and can be computed once the full index has been read. A value for N can similarly be determined from the input data and stored as a component of the index structure you build. But that still leaves the task of processing the term lists to generate document scores, and determining the top few (where “few” is three in this project, but is 10 for Google and Bing, for example) in an appropriate manner. Be sure to start simple, and check your computations at every step as you build towards the required output.

As an overall outline of the process to be implemented, to compute the required score(d, q) values you should: (a) create and set to zero an array of N document score accumulators; (b) iterate over the index entries for the (indexed) terms that appear in the query, and for each such term, iterate over its (d, f(d,t)) pairs, computing contributions to document scores one term at a time and adding them to the corresponding accumulator; and then (c) partially sort the array of accumulators (including their corresponding document numbers) to bring the biggest ones to the front. Note you don’t have to fully sort the score array in order to generate the required output (that is, there is an algorithmic efficiency question here for you to consider).

My program to implement all three stages is a little under 500 lines long, including detailed comments and debugging output, about 100 lines longer than the solution for request 1. Start early! If you want more of a challenge, try writing (but please don’t submit) the indexing program that generated the index files. And, of course, a real system would generate a query-biased caption, or snippet, for each answer document (rather than just the document number), you might like to think about how that process gets done.