CS180-Repetition

Purpose

The purpose of this project is to help you become more familiar with scanners, String manipulation, conditionals, and advanced usage of the for loop.

These skills will help you in job interviews and future projects.

Introduction

One of the great unsolved problems of the 20th century was sequencing the entire human genome. Until our genome was sequenced we had an incomplete view of our evolution, diseases, migrations as a species and genetic differences.

One of the most challenging parts of sequencing the genome was putting it all together. This was left to computer scientists. Gene sequencing machines work by splitting DNA down into small fragments, replicating them thousands of times with radioactive nucleotices, and finally reading them back.

The problem is that a human genome has three billion (3,000,000,000) base pairs and the sequencer processes about 500 random pairs at a time. Sequencing software pieces all of these together matching the overlaps to reconstruct the whole thing.

You are going to build sequencing software that:

  • Reads in and re-constructs overlapping sequences
  • Checks to make sure the DNA is valid
  • Looks for genes in the DNA
  • Prints out an analysis of the gene
    Note:

  • Please do the project chronologically as you cannot test a section without first completely its predecessor.

Part 1 - Reconstruction

Problem
The first part of your DNA analysis program is going to be finding the longest terminating overlap between the DNA you have so far and the new DNA the sequencer gives you. Look at the following example:

1
2
3
4
5
6
7
8
9
10
11
12
We start with no DNA:
Your DNA: ""
Sequencer DNA: "ATATATATA"
New Sequence: "ATATATATA"

Your DNA: "ATATATATA"
Sequencer DNA: "ATACATGA"
New Sequence: "ATATATATACATGA"

Your DNA: "ATATATATA"
Sequencer DNA: "ATACATGA"
New Sequence: "ATATATATACATGA"

Preparation

Know how to use the Scanner

Know how to use the substring, startsWith and endsWith methods in String

Get a sheet of paper and write down instructions telling your roommate how to find the longest terminating overlap.

Things To Do

  1. Create a class Sequencer
  2. Create a public static void main(String[] args) method in Sequencer, this is where you will put all your code.
  3. Ask the user for input by printing “Input lowercase DNA fragments one line at a time. End with a blank line.”
    Read lines of DNA from the input
  4. Convert the line to lower case
  5. Join the line with the current DNA on their longest terminating overlap like the example above.
  6. Stop scanning when the user enters a blank line.
  7. If the input contains any characters that are not a, t, c or g print “DNA is invalid” and return otherwise print “Input DNA: “ followed by the joined DNA.

Notes

You may assume the input DNA is lower-case

You may assume that each line of input will have at lease one overlap

Example

Here is an example up to this point with the DNA ccatgctaatttag:

1
2
3
4
5
Input lowercase DNA fragments one line at a time. End with a blank line.
ccatgctaa
taatttag

Input DNA: ccatgctaatttag

1
2
3
4
5
6
7
8
9
10
11
12
13
Input lowercase DNA fragments one line at a time. End with a blank line.
atgaccggcagtctatatgactctgatgccgcaggctgcctctga

Input DNA: atgaccggcagtctatatgactctgatgccgcaggctgcctctga
Start codon position: 0
End codon position: 42
Gene: atgaccggcagtctatatgactctgatgccgcaggctgcctc

Analysis Results

Eye color: brown
Hair color: red
Can roll tongue? no