Learning Outcomes
In this project you will demonstrate your understanding of arrays, strings, and functions. You may also use typedefs and structs if you wish - and will find the program easier to assemble if you do - but you are not required to use them in order to obtain full marks. Nor do you need to make any use of malloc() in this project.
Text Search
The Unix command-line tool grep provides the ability to identify the lines in a file that exactly match a pattern supplied as a command-line argument. There are also useful options in grep for the search to be case-insensitive (“-i”) and to match on whole words only (“-w”).
However there are also times when we want to perform a less precisely defined search, looking for partial matches to (possibly multiple) strings, rather than exact matches relative to one string. For example, we might be unsure of how to spell “latitude” and “longitude”, and want to be able to use “lat long” as a query to identify - in some kind of decreasing-score order - the lines in an input document that contain one or more words that start with those two strings. Similarly, we might be unsure of how to spell someone’s name, and not be bothered if a search for “alis mof” finds all of “Alistair Moffat”, “Alison Mofet”, “Alistair, plus Bob Moffat”, and so on.
In this project you will write a program that reads text from stdin, generates a “match score” for each line relative to a query supplied on the command-line, and then prints out the lines that have the highest scores - a bit like documents are scored and ranked in web search engines. But, unlike Google and Bing, you will not use an index, and instead you are encouraged (in this project) to make use of obvious approaches. Over a query of up to a five or eight words, and an input text of up to a few megabytes (which is actually quite big), your program should (and had better!) still operate in a second or so. That is, you do not need to implement the pattern search algorithms that are being discussed in class; and may use straightforward matching techniques, including any suitable functions in string.h.
Input Data
Input to your program will come in two parts: a query, specified on the command-line (see Section 7.11 of the textbook) as a sequence of lowercase alphanumeric strings; and a stream of text, to be (always) read from stdin. If you vary away from these interface requirements the automated testing process will fail your program!
A range of text input data will be used during the post-submission testing. As you develop your program according to the stages listed below, the output will evolve. Output examples for both the alice-eg.txt and full Alice’s Adventures in Wonderland pg11.txt file are linked from the FAQ page. You should also check your program against other queries and inputs, of course. Testing and debugging is your responsibility.
Stage 1 - Checking the Command-Line
In this stage you are to demonstrate that you can access the first of the two required inputs, the query from the command-line. The query itself will be provided to you via argc and argv. If argc is zero when your program is called you should print the required error message and exit; and if any character in any of the strings making up the query is not a lowercase alphabetic or numeric character, you should print that particular string and the required error message. For example:1
2
3
4
5mac: ./ass1 < alice-eg.txt
S1: No query specified, must provide at least one word
mac: ./ass1 lat 66 loNg 32 words < alice-eg.txt
S1: query = lat 66 loNg 32 words
S1: loNg: invalid character(s) in query
mac:
Note how each output line is prefixed by the stage number that generated it.
Stage 2 - Reading the Input
In this next stage, you are to demonstrate that you can correctly access the text from stdin, by printing out each input line, its length in characters, and the number of words it contains. For the example query and text, the first few lines of your output should be (exactly):1
2
3
4
5
6mac: ./ass1 lat long < alice-eg.txt
S1: query = lat long
Down, down, down. Would the fall NEVER come to an end! 'I wonder how
S2: line = 1, bytes = 68, words = 14
---
and so on, see the FAQ page for the full required output. A word is defined to be a maximal length sequence of alphanumeric characters.
You may assume that no input line contains more than 1,000 characters. Note the item in the FAQ page about newline differences between PC and Unix systems. You should copy the mygetchar() function into your program and use it (and only it) when you are reading input lines.
Stage 3 - Scoring Lines
Each input line next needs to be given a score relative to the query. If there are q query words specified on the command-line (that is when argc = q + 1), if wi is the i th of the query words (that is, when argv[i] = wi), and if fi is the number of times that word wi is a case-insensitive prefix match against a word that appears in that input line, A prefix match occurs if every character of the query term matches at the beginning of a words in the input line. For example, “ali” is a prefix match of all of “Ali”, “Alistair”, “Alison”, and “alimentary”; and is not a prefix match of any of “alloy”, “al”, “ai”, or “malice”. Lines that have no matches against any query terms will automatically get a score of zero according to this formula. Scores are to be calculated and represented as doubles; be aware that rounding in double arithmetic might lead to your program giving slightly different values to mine in some cases. When printed to three decimal places, the values are probably going to agree, but small implementation-dependent (based primarily on the exact order the operations are carried out by the compiled program) differences in computed values are always possible.
The required output for this stage is a score per line, interleaved with the previous Stage 2 output. The FAQ shows example executions, so that you can confirm that you understand what it is that you are to compute and how it is to be output - look for the lines that commence with “S3:”. Note that log2 x can be computed via log(x)/log(2.0), with log (natural logs) available in math.h.
Stage 4 - Ranked Summary Output
Once you have the Stage 3 scoring regime working correctly, it is time to move on to the main goal - presenting lines in decreasing score order. Add data structures to your program that retain the (up to) five highest-scoring lines, and their scores. Then, once all of the input lines have been read, print those five lines (or up to five lines, if there are not five lines with non-zero scores) and their line numbers and scores, in decreasing score order. If the scores are tied (when doing simple comparisons on double values using == and <=, don’t try and be clever), then lines should be presented in line number order.
For the test file alice-eg.txt and the three-term query “ali lat long”, the required output from this stage is:1
2
3
4
5
6
7
8
9
10
11
12
13-----------------------------------------------
S4: line 9, score = 0.668
or Longitude I've got to?' (Alice had no idea what Latitude was, or
--
S4: line 10, score = 0.233
Longitude either, but thought they were nice grand words to say.)
--
S4: line 4, score = 0.229
thousand miles down, I think--' (for, you see, Alice had learnt several
--
S4: line 8, score = 0.226
'--yes, that's about the right distance--but then I wonder what Latitude
The FAQ page shows some other example interactions with the desired program.
Note that you may not retain every input line in an array of strings, and your program may not assume some maximum number of lines in the input. You can only retain five (as a #defined value, of course) lines and their scores at any given time, plus the current line that is being processed; plus access the supplied query via argv.
General tips..
You will probably find it helpful to include a DEBUG mode in your program that prints out intermediate data and variable values. Use #if DEBUG and #endif around such blocks of code, and then #define DEBUG 1 or #define DEBUG 0 at the top. Disable the debug mode when making your final submission, but leave the debug code in place. The FAQ page has more information about this.
Finally, note that the sequence of stages described in this handout is deliberate - it represents a sensible path though to the required program. You can, of course, ignore the advice and try and write final program in a single effort, without developing it incrementally and testing it in phases. You might even get away with it, this time and at this somewhat limited scale, and develop a program that works. But in general, one of the key things that makes some people better at programming than others is the ability to see a design path through simple programs, to more comprehensive programs, to final programs, that keeps the complexity under control at all times.