DNA Overview
DNA is the carrier of genetic information that all living things are made of. We can use DNA to tell us useful information, such as finding to whom it belongs and their relatives. So, DNA profiling can take a look at DNA to identify to whom it belongs.
DNA is a sequence of protein molecules called nucleotides, arranged into a particular shape (a double helix). Each nucleotide of DNA contains one of four different bases: adenine (A), cytosine (C), guanine (G), or thymine (T). Every human cell has billions of these nucleotides arranged in sequence (genome).
How can this identification be accomplished? DNA is composed of genes (which all humans share) and “junk” DNA (sections of non-coding DNA in between the genes that are highly variable between individuals). By looking at these sections of “junk DNA” we can identify a person through a DNA sample or find relatives.
What are we looking for in this junk DNA? Inside of these sections of the DNA, there is a high variance nucleotide patterns. Due to this high variance we can uniquely identify individuals based on the similarities in the STRs (Short Tandem Repeats) within the junk DNA.
What are STRs and how can I use them to tell me what I want to know? STRs are Short Tandem Repeats, which are short sequences of DNA that repeat back-to-back in DNA sequences numerous times. AGAT and ACGT are examples of short DNA sequences which we can use as STRs. Similarities in STRs would indicate genetic relationships.
Example DNA Sequence: AGATAGATAGATACGTACGTHere, you see that the STR “AGAT” is repeated three times followed by another STR “ACGT” repeated twice (this is just an example of many possibilities). The more STRs we use to analyze a DNA sequence the more accurate our findings are. When looking at STRs in a DNA sequence we want to look at the maximum consecutive times it is repeated within one sequence to find the highest variance of the STR.
The FBI uses 20 different STRs when processing DNA for their Combined DNA Index System (CODIS) database.
For simplicity, we will be using 2 to 4 STRs that are a few nucleotides long.
The goal of this assignment is to carry out a DNA profiling method called STR analysis, where we look at STRs to:
Find out who the unknown person is
Find out who their possible parents could be
Implementation
Overview of Files
DO NOT edit ANY class besides DNA.java
DNA.java: contains some provided methods in addition to annotated method signatures for all the methods you are expected to fill in. You will write your solutions in this file. This is the only file which will be submitted for grading.Profile.java: This represents the profile of a person. More specifically it holds the information about the 23rd pair of chromosomes that is used for identifying parent/offspring relationships.
It contains the person’s name and two DNA sequences (one from each parent). These sequences are read from the input file in createDatabaseOfProfiles().
STR.java: This class represents the Short Tandem Repeats. An STR is a short sequence of DNA nucleotides that repeat back-to-back numerous times.
StdIn: is used by the driver.
StdOut: is used by the driver.
Driver: you can use this to test any of your methods interactively. Feel free to edit this class, as it is provided only to help you test. It is not submitted and it is not used to grade your code. Scroll down for instructions on how to compile and test your assignment.
Multiple txt files: files that can be read by the driver as test cases.
profiles.txt – this file contains the actual DNA sequences of the people. These will be long Strings of A, C, G, and T, which we will scan through for the existence of STRs.
allSTRs0.txt and allSTRs1.txt – will be used to process the profiles.txt file to make the actual database. allSTRs0.txt has 1 STR, and allSTRs1.txt has 3 STRs. The more STRs there are, the more precise the results will be.
matching0.txt – will be used to test if an unknown DNA sequence exists in the database
offspring0.txt – will be used to test if potential parents of an unknown DNA sequence exist in the database.
You are allowed (and encouraged) to make your own test cases to test different edge cases! You may use the data.txt file to make new test cases, or come up with your own version of the profiles.txt file!
DNA.java
This class has two instance variables:
What are instance variables? Variables declared outside the scope of methods but inside the class. Each time an instance of the class is created a new instance variable is created.
database holds the profiles of all people in the database.
STRsOfInterested holds the STRs (Short TAndem Repeats) we are interested in looking for while DNA profiling.
DNA.java
This class has two instance variables:
database holds the profiles of all people in the database.
STRsOfInterest holds the STRs (Short Tandem Repeats) we are interested in looking for while DNA profiling.
Methods implemented by you:
For all of these, look at the comments above the methods for more information.
DNA (String databaseFile, String STRsFile)
This method is already written for you. DO NOT EDIT THIS METHOD. This method creates a DNA object and calls createDatabaseOfProfiles and readSTRsOfInterest.
createDatabaseOfProfiles (String filename)
Input parameter: profile files
Input file format:
1 line containing an integer, p, with the number of profiles/people in the file
for each p profiles in the file
1 line containing the person’s name
1 line containing the first sequence of STRs that comes from one parent
1 line containing the second sequence of STRs that comes from the other parent
This method creates and populates the database array with p number of profiles from the input file.
Approach:
Create the database array to hold the profiles (don’t forget about the instance variables)
Read the number of profiles from the input file
Read the profile information from the input file
For each person in the file
creates a Profile object with the information from file (see input file format below)
insert the newly created profile into the next position in the database array
💡*Hint: You can use StdIn.String() to read 1 (one) string from the file.
What if we don’t have an attribute at the time of making the object? We set the attribute as NULL.
readAllSTRsOfInterest (String filename)
Input parameter: allSTRs files
Input file format:
1 line containing an integer, s, with the number of STRs in the file
for each s in the file
1 line containing the STR
This method creates and populates the STRsOfInterest array with s number of STRs from the input file.
Similar approach as createDatabaseOfProfiles
💡*Hint: You can use StdIn.readString() to read 1 (one) string from the file.
How do we get/set each attribute of an object? objectname.getAttribute() or objectname.setAttribute(value) or objectname.attribute = value
After createDatabaseOfProfiles and readAllSTRsOfInterest are completed, you can run the following in the driver.
Option 1
profiles.txt
allSTRs0.txt
Option 4
Notice how the STRs are still NULL since we initialized them as NULL in the createDatabaseOfProfiles method.
In addition if the constructor is called using DNA(profiles.txt, allSTRs0.txt) the DNA object would look like the figure below where:
database instance variable array contains four profiles, one for Avery, Blake, Casey and Drew
STRsOfInterest instance variable array contains one string for STR AGAT
createUnkownProfile (String filename)
Input parameter: matching files
Input file format:
first line containing a DNA sequence
second line containing a DNA sequence
This method creates and returns the profile for the known DNA sequence from the input file.
Approach:
Set profile name to “Unknown” because they are currently unknown
Set the S1_STRs and S2_STRs to NULL
Set sequence1 to the first line of the file
Set sequence2 to the second line of the file
Return the Profile object
💡*Hint: You can use StdIn.readString() to read 1 (one) string from the file.
Option 1
profiles.txt
allSTRs0.txt
Option 1
matching0.txt
Option 2
Option 5
Notice how the STRs are still NULL since we initialized them as NULL.
Notice how their name is “Unknown” since we manually set their name in createUnknownProfile by this method.
findSTRInSequence(String sequence, String STR)
Input parameter: String sequence, String STR (called from later methods)
Given a DNA sequence and a singular STR, this method finds the longest number of consecutive repeats that STR appears in the sequence and then returns an STR object with:
the STR name
longest number of consecutive repeats
Approach:
Set new STR with parameters STR and 0 as numberOfRepeats
Check if the STR is longer than the sequence
Traverse through the sequence checking for repeating STRs
Return the STR object
*Note: If STR has more characters than the sequence then the return object will have 0 (zero) numberOfRepeats.
💡*Hint: How do I search for the longest number of consecutive STR repeats in the sequence? look at the Java String class and the included methods
createProfileSTRs(Profile profile, String[] allSTR)
Input parameter: Profile profile, String[] allSTR
This method takes a profile and a String[] and populates then adds the STRs to the profile. (Remember we set them as NULL before).
Approach:
Populate some new S1_STRs and S2_STRs arrays by traversing the allSTR[]
Use setter method to set S1_STR and S2_STR to profile
*Note: You will need to use the findSTRInSequence method you just wrote
💡*Hint: Think about the attributes of STRs
createDatabaseSTRs()
Input parameter: NA
This method creates and updates the STRs for each profile in the database.
Approach:
Similar strategy used in createDatabaseOfProfiles and readAllSTRsOfInterest methods
*Note: You will need to use the createProfileSTRs that you just wrote, for each profile in the database
We are now ready to do some analysis! Now that the database is created and the STRs for each Profile have also been created according to the STRsOfInterest!🎉
identicalSTRs (STR[] s1, STR[] s2)