INFO1105-Prefix-Map

Overview

In this request each student (individually) will write a class that could form part of a collection library. The intended domain of application is in bioinformatics, where parts of someone’s DNA can be represented as strings where each character is one of A, C, G and T; for example “GATTACA”. The collection consists of keys, each of which is a string that represents a DNA sequence, and each key has an associated value (which is a string that gives some textual information about the sequence, such as its discoverer).

The code you write must implement a particular interface that we have defined, called PrefixMap. The code that you write must be built according to a particular data structure, called a Trie, that we describe below in more detail.

The PrefixMap interface

The PrefixMap interface has some methods inspired by the usual Map ADT, with some additional methods used to group and select keys based on their prefixes. The interface restricts the set of keys so that each is a string built from the alphabet of four characters A, C, G and T.

public interface PrefixMap {
  public boolean isEmpty();

  /*
   * How many keys are stored in the map
   */
  public int size();

  /*
   * Get the value corresponding to the key (or null if the key is not found)
   * if the key contains any character other than A, C, G, T, throw MalformedKeyException
   * if the key is null, throw IllegalArgumentException
   */
  public String get(String key);

  /*
   * Insert the value into the data structure, using the given key. If the key
   * already existed, replace and return the old value (otherwise return null)
   * if the key contains any character other than A, C, G, T, throw MalformedKeyException
   * if the key or value is null, throw IllegalArgumentException
   */
  public String put(String key, String value);

  /*
   * Remove the value corresponding to the given key from the data structure,
   * if it exists. Return the old value, or null if no value was found.
   * if the key contains any character other than A, C, G, T, throw MalformedKeyException
   * if the key is null, throw IllegalArgumentException
   */
  public String remove(String key);

  /*
   * return the number of keys which start with the given prefix if the prefix
   * contains any character other than A, C, G, T, throw MalformedKeyException
   * if the prefix is null, throw IllegalArgumentException
   */
  public int countKeysMatchingPrefix(String prefix);

  /*
   * return the collection of keys which start with the given prefix if the
   * prefix contains any character other than A, C, G, T, throw MalformedKeyException
   * if the prefix is null, throw IllegalArgumentException
   */
  public List <String> getKeysMatchingPrefix(String prefix);

  /*
   * Return the number of unique prefixes
   * e.g. if the tree stores keys GAT, GATTC, GATTACA, this method will return 8
   * because the prefixes are G, GA, GAT, GATT, GATTC, GATTA, GATTAC, GATTACA
   * In an uncompressed trie, this is the number of trie nodes, excluding the root
   */

  public int countPrefixes();
  /*
   * Return the sum of the lengths of all keys
   * e.g. if the tree stores keys GAT, GATTC, GATTACA, this method will return 15
   */

  public int sumKeyLengths();
}

Trie Data Structure

Description

A Trie (also sometimes called a Prefix Tree), is a type of search tree where instead of storing the keys in each node, instead the path to reach that node defines the key. In data sets where keys often share a common prefix this can be an efficient way to store them, as those common prefixes are only represented once instead of many times. The Trie structure was described in lecture in week 10. In this request you will implement a variation of the Trie structure for the case where the keys are made up of only 4 possible characters (A, C, G and T) and the values are arbitrary strings. Thus each Node can have only 4 possible children (one where the next character is A, one where the next character is C, etc), and so we can define a Node class where there is an array of length 4 to hold the references to children Nodes. When we do this, we do not store the character in the Node, instead the character is found by looking at which child of the parent this Node is. Each Node can also hold a value, if the sequence of characters used to reach that position is one of the keys.

The diagram below shows the data structure storing the following key-value pairs: (G, Suzy), (GAC, Bill), (GAT, Kate), (CG, Fred), (CT, Jane). Notice how the path to the node containing Kate goes from the root to its the third child (corresponding to the character G), then from that node to its first child (corresponding to A), and from that node to its fourth child (T).

In this request you will write the code for a class request which implements the PrefixMap interface, using a Prefix Tree. The full skeleton code for the request is
available for download on the resources section of Ed.

Deliverables

Code

You must produce a class called request that is suitable to be in a collection library. It must implement the PrefixMap interface exactly as we have defined that. Your class should contain an appropriate private nested class Node that represents the Node objects in the Trie.

You are advised to use recursion when writing the methods, but this is not a requirement.

Report

You must write a short report:

For each of the interface methods, describe the algorithm used, state the running time of this algorithm in big-Oh notation, and give a brief argument justifying that this cost is correct. You should express the costs in terms of n (the number of key-value pairs in the collection), m (the length in characters of the argument key or prefix), and k (the number of keys that start with the given prefix).
Describe how you tested your code. List the test cases you wrote, stating briefly the purpose of each test.

Analysis of runtime

This is based on your report. Note that you can gain these marks even if you don’t write any code, as long as you analyse algorithms that you describe which operate on the Trie.

Pass level: You state the correct big-Oh for majority of public methods, when each is implemented from the algorithm as you described it.
Distinction level: As for Pass, and also you provide convincing and valid arguments in most cases.