We’ve all been there. You counted down the seconds until tickets went on sale, scrambled to checkout in time, and waited for months until the big day. You tailgated all day, made it through the gates and through the opener, and found the perfect spot on the lawn. Now your seeing your favorite band live, they’re 15 songs in, and you haven’t heard your favorite one yet! What if there was a way to predict what was going to be played next?
Thanks to the extensive data found at Setlist.fm, paired with a little bit of applied statistics, there is. In this article I’ll walk through how to collect and clean the data, train a machine learning model to make predictions, and validate the accuracy of those prediction. And at the end: the world’s first AI generated Jimmy Buffett show.
Data Collection and Cleaning
But why Jimmy Buffett? Two reasons:
- I’m a parrothead.
- He and his band have played a huge number of shows and are famous for playing a reliable number of classics at every show, which should help with model development.
The data for this analysis comes from Setlist.fm, which has a publicly available API to obtain data on setlists played at live shows from nearly any artist you think of. We’ll start by writing a quick bit of code to pull that data back. To replicate this analysis, you need to obtain a free API key from Setlist.fm; defined here as variable api_key. Additionally you would need to poke around their API documentation to find the “mbid” for the artist of interest.
The function “get_data()” returns data in json format of the 20 concerts on the given page number. Since the API only allows for one page at a time, the small for-loop appends 20 pages of setlist data together to return the last 400 shows.
The next step is the most ‘fun’ part of any data science project: cleaning your data and formatting for use in a model. The data required for this project is an ordered list of each song played at each concert in the larger data set. The requirements for obtaining this involved extracting from a deeply nested set of dictionaries. The following code snippet does that. This sort of coding is commonly known as an API wrapper.
Now we have a list-of-lists called “master_setlist_list”, which contains 400 ordered lists of songs played at the last 400 Jimmy Buffett shows. This data includes some one-off appearances (TV shows, private events, etc.), so the last step in the cleaning is the remove any shows with <10 songs played. The list comprehension at line 19 does that. After adjusting for those performances, 356 shows remain.
This is an example of what a show looks like:
Model Development and Parameter Tuning
The model we’ll use here is a classification model. In the context of machine learning, classification models determine what category something is in based on input data. Specifically here we’ll use a Naive Bayes classifier, a probabilistic model that is computationally inexpensive and that predicts the class of a new data point based on the features and classes of pre-existing data points.
In short, the Bayes Classifier produces a list of the probabilities that a new data points belongs to each of the classes in a set of possible classes, given what we know about the new point and what we know about previous points. In practice, the highest probability is the chosen class (the probability cutoff for selection is a parameter that can be tuned to business/use case). For this model, the prediction classes will be the pool of songs played in the 356 shows in question. The data will be converted to a binary representation where the feature=1 if a song has already been played and the feature=0 if the song has not been played.
In order to bootstrap the dataset and allow for predictions of the next song given any input length, the data needs to be modified a little further. The for-loop below goes through each song list and splits the list into two elements:
- The songs that have already been played, which go into X_songs
- The song that is going to be played next, which go into Y
Doing this split creates two vectors, each containing 8,351 records, which is the number of times in between songs at a show in the data set. The first three items look like this:
And so on it goes until all the data has been processed. We’ll use X_songs and Y as input and output to train the model.
The next step is the most important concept in machine learning: splitting the data into test and train samples. We’ll use an extremely helpful function from scikit-learn to do that, obtaining X_train, X_test, Y_train, and Y_test.
Finally, we need to binarize the data for use in the Naive Bayes classifier. This will allow the model to consider input of all possible length previously seen in the data. After using CountVectorize to binarize the X data, the data is now a 284 column matrix of 1’s and 0’s; as a reminder, each column represents a song, and a value of 1 is present in that column if the song has already been played.
Now we can move on to the actual fun part, namely training the model and testing its accuracy. Because this data is binary, we’ll need to import a special version of the Naive Bayes classifier, implemented in scikit-learn as BernoulliNB. This model type takes one parameter called alpha, and accepts as input an X matrix of binary data and a Y array of different classes.
Training the model is quite simple to do, and the complete implementation can be found in the function “predict_next_song()”. Given five random songs of my choice, it returns the most likely next song. In this case, expect to hear Come Monday next.
Remember my previous statement about the most important step in a data science project being the split of data into Test/Train? Now we’ll come to terms with that by testing the model on the subset of data that was held out for testing. The blurb under #Test Accuracy does that.
Remember, this is a data subset that has never been seen by the model. Here are the results:
16.4% correct! While that may not seem like a high number at first glance, consider that the likelihood of guessing correctly at random is ~1/284 = 3.5%. An edge like that in the stock market or at the casino would make you a millionaire.
Model Tuning and Hyperparameter Selection
While 16.4% accuracy is very good, we can probably do better. The ‘alpha’ parameter of a Bernoulli Naive Bayes classifier is a smoothing parameter, which functionally does some adjustment to account for infrequent classes of data points (which in this project would be songs that are rarely played).
To find the best value of alpha for this model, we’ll train 10 models at 10 different alpha values and return the one that gives the highest accuracy in prediction.
After running this loop, which is the most computationally intensive part of this entire exercise, we obtain the accuracy for models with value of alpha between 0 and 1, at intervals of .1:
After visualizing the results, we can see that alpha=.2 is the optimal parameter value. The accuracy of the model with the new alpha value is 17.2%, a quick boost in model accuracy of .8% just by spending the time to tune the parameter.
Full Setlist Generation and Model Use
Having trained a model that successfully predicts the next song at a reasonable level of accuracy, we can extrapolate and use the results to generate an entire set list. A setlist consists of two key pieces of information:
- The length of the setlist
- The first song
- The ordered list of the rest of the songs in the setlist
We’ll start by fitting a distribution to the data on the length of the setlists, and building a function to draw from that distribution and determine the “typical” length of a setlist.
The distribution of the data plots like this:
Lucky for us, this appears to be pretty close to normal (with a slight left tail). To simplify things, we’ll call it normal and move on.
This code gist uses scipy.stats.norm to fit parameters onto a normal distribution using the data we have on the length of setlists. Finally, the “estimate_setlist_length()” function samples from that distribution and returns an integer prediction of the length of a Jimmy Buffett show.
Now that we’ve defined a function to predict the setlist length, we need to find a way to initialize our setlist. The best way to do this is to draw at random from the first songs played at the shows in data set.
And at long last, we can do what was promised in the first paragraph. The following code block modifies the next song prediction function to work down the probability list and avoid repeating songs. Then, in a for loop of a length generated by the fitted normal distribution, we can use the model to build an AI generated setlist.
And here’s what it looks like:
A mix of classics, covers, and some of the new stuff. If you have a spare couple of hours, you can listen to it here.
Notes and Acknowledgements