Core Description
For the core, you will implement a program that creates a model of a music artist’s lyrics. This model receives lyric data as input and ultimately generates new lyrics in the style of that artist. To do this, you will leverage an NLP concept called an n-gram and use an NLP technique called language modeling.
Your understanding of the linked concepts and definitions is crucial to your success, so make sure to understand n-grams, language modeling, Python dictionaries as taught in the warmup, and classes and inheritance in Python before attempting to implement the core.
The core does not require you to include any external libraries beyond what has already been included for you. Use of any other external libraries is prohibited on this part of the project.
Core Structure
In the language-models/folder, you will find four files which contain class definitions: nGramModel.py, unigramModel.py, bigramModel.py, and trigramModel.py. You must complete the prepData, weightedChoice, and getNextToken functions in nGramModel.py. You must also complete the trainModel, trainingDataHasNGram, and getCandidateDictionary functions in each of the other three files.
In the root CreativeAI repository, there is a file called generate.py, which will be the driver for generating both lyrics and music. For the core, you will implement the trainLyricsModels, selectNGramModel, generateSentence, and runLyricsGenerator functions; these functions will be called, directly or indirectly, by main, which is written for you.
We recommend that you implement the functions in the order they are listed in the spec; start with prepData and work your way down to runLyricsGenerator.
Getting New Lyrics (Optional)
If your group chooses to use lyrics from an artist other than the Beatles, you can use the web scraper we have written to get the lyrics of the new artist and save them in the data/lyrics directory for you. A web scraper is a program that gets information from web pages: ours, which lives in the data/scrapers directory.
If you navigate to the data/scrapers folder and run the lyricsWikiaScraper.py file, you will be prompted to input the name of an artist. If that artist is found on lyrics.wikia.com, the program will make a folder in the data/lyrics directory for that artist, and save each of the artist’s songs as a .txt file in that folder.
Explanation of Functions to Implement
prepData
The purpose of this function is to take input data in the form of a list of lists, and return a copy of that list with symbols added to both ends of each inner list.
For the core, these inner lists will be sentences, which are represented as lists of strings. The symbols added to the beginning of each sentence will be ^::^ followed by ^:::^, and the symbol added to the end of each sentence will be $:::$. These are arbitrary symbols, but make sure to use them exactly and in the correct order.
For example, if the function is passed this list of lists:1
[ ['hey', 'jude'], ['yellow', 'submarine'] ]
Then it would return a new list that looks like this:1
[ ['^::^', '^:::^', 'hey', 'jude', '$:::$'], ['^::^', '^:::^', 'yellow', 'submarine', '$:::$'] ]
The purpose of adding two symbols at the beginning of each sentence is so that you can look at a trigram containing only the first English word of that sentence. This captures information about which words are most likely to begin a sentence; without these symbols, you would not be able to use the trigam model at the beginning of sentences because there would be no trigrams to look at until the third word.
The purpose of adding a symbol to the end of each sentence is to be able to generate sentence endings. If you ever see $:::$ while generating a sentence in the generateSentence function, you know the sentence is complete.
trainModel
This function trains the NGramModel child classes on the input data by building their dictionary of n-grams and respective counts, self.nGramCounts. Note that the special starting and ending symbols also count as words for all NGramModels, which is why you should use the return value of prepData before you create the self.nGramCounts dictionary for each language model.
- For the unigram model, self.nGramCounts will be a one-dimensional dictionary of {unigram: unigramCount} pairs, where each unique unigram is somewhere in the input data, and unigramCount is the number of times the model saw that particular unigram appear in the data. The unigram model should not consider the special symbols ‘^::^’ and ‘^:::^’ as words, but it should consider the ending symbol $:::$ as a word. The bigram and trigram modles should consider all special symbols as words.
- For the bigram model, the dictionary will be two-dimensional. It will be structured as {unigramOne: {unigramTwo: bigramCount}}, where bigramCount is the count of how many times this model has seen unigramOne + unigramTwo appear as a bigram in the input data. For example, if the only song you were looking at was Strawberry Fields Forever, part of the BigramModel’s self.nGramCounts dictionary would look like this.
- For the trigram model, the dictionary will be three-dimensional. It will be structured as {unigramOne: {unigramTwo: {unigramThree: trigramCount}}}, where trigramCount is the count of how many times this model has seen unigramOne + unigramTwo + unigramThree appear as a trigram in the input data.
getCandidateDictionary
This function returns a dictionary of candidate next words to be added to the current sentence. More specifically, it returns the set of words that are legal to follow the sentence passed in, given the particular language model’s training data. So it looks at the sentence, figures out what word the model thinks can follow the last words in the sentence, and returns that set of words and counts. Note: when you write this function, you may assume that that the trainingDataHasNGram function for this specific language model instance has returned True.
For each n-gram model, this function will look at the last n - 1 words in the current sentence, index into self.nGramCounts using those words, and return a dictionary of possible n-th words and their counts. For example, the unigram model is an n-gram model for which n = 1, so the unigram model looks at the previous 0 words in the sentence. Therefore, the unigram model sees every word in its training data as a candidate; in other words, the unigram model version of getCandidateDictionary should return its entire self.nGramCounts dictionary. Based on this knowledge, what dictionaries should the bigram and trigram models return?
Hint: the indexing method you use here will be syntactically very similar to what you did in trainingDataHasNGram.
printSongLyrics
This function takes three parameters which are lists of lists of strings: verseOne, verseTwo, and chorus. It then prints out the song in this order: verse one, chorus, verse two, chorus.
getUserInput
This function takes three parameters: teamName, which should be the name of your group; lyricsSource, which should be the name of the artist that you’re generating lyrics for; and musicSource, which should be the name of the source from which you got your music data for the reach.
The function returns a user’s choice between 1 and 3, looping while the user does not input a valid choice. Choice 1 is for generating lyrics; choice 2 is for generating music; and choice 3 is to quit the program.
main
This function first trains instances of language models on the lyrics and music data by calling the trainLyricsModels and trainMusicModels functions. Then, it calls getUserInput and uses the return value of that function to either generate new lyrics by calling runLyricsGenerator, or generate a song by calling runMusicGenerator. Note that the trainMusicModels and runMusicGenerator functions don’t need to be touched for the core.
At the beginning of main there are several string variables to hold your group’s name, the name of the artist you’re using, etc. Make sure to update these values with your team’s name and your choices of data.
Tips for Speeding Up Your Program
If your program is taking a long time to load the data and train the models, it’s likely that inefficiencies in your code are slowing down your program. The most common cause of inefficiency is too many nested loops in your trainModel functions. For example, if you have 10 words, and you run through the words once for each word in the list (i.e. 10 times), that will be 100 steps total, which is not too bad. But if you have 10,000 words in the dataset, and you look at each one 10,000 times, then that will be 100,000,000, which is bad.
Each version of the trainModel function can be written correctly with at most two levels of nested for loops, and a typical program should not take more than around 30 seconds to load. Try experimenting with different loop structures if your program is taking too long to load.
How to Run Your Program to Generate Lyrics
If you are using PyCharm, open generate.py and click “Run…” in the top navigation bar. If you are working from the command line, navigate to the root directory where your CreativeAI project is stored and type:
python generate.py
Even if you have not implemented any of the functions in the project, the starter code should work out of the box. Therefore, you can play around with it and get a feel for how the driver in main works.