Last year, I helped give an introductory python course to some MSc maths students here at the University of Southampton. As part of this course, we introduced them to some basic data analysis with pandas and machine learning with scikit-learn. For this, we needed some data to analyse. In many learning materials, the Iris flower dataset is used. However, I didn’t think that comparing the lengths and widths of flower petals would be particularly inspiring so decided to see if I could find something else.

After scouring the internet for some more interesting datasets, I came across this blog post. In the post, the author uses R and the Spotify and Genius Lyrics APIs to find the most depressing Radiohead song. Being a Radiohead fan, this certainly looked like an interesting dataset, and not having any prior experience with R or web APIs, working out how to do something similar in python looked like a nice challenge.

I began by first replicating the analysis carried out in the original post, then extended it to investigate the properties of music from different genres. In the process, I learnt a lot about web APIs, pandas, scikit-learn and just how great plotly is. Consequently, I decided to write up the steps I took in case anyone else finds music-related data as interesting as I do!

In this post, I will begin by getting the music data using the Spotify web API, then grab the lyrics using the Genius lyrics API. I’ll then do some analysis of Radiohead data and compare musical genres (using a pretty plotly figure), finishing with some basic machine learning.

This post was originally written as a Jupyter Notebook, then converted to markdown using nbconvert. If you wish to download the original notebook, it can be found here.

Getting the music data

Spotify provide a web API which can be used to download data about its music. This data includes the audio features of a track, a set of measures including ‘acousticness’, ‘danceability’, ‘speechiness’ and ‘valence’:

A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Fortunately, there already exists a python library for interacting with the spotify API: spotipy. At this point, I am also going to import pandas for data analysis, BeautifulSoup for some web scraping, requests for HTTP requests to pull the source code from websites and lxml for processing XML.

import spotipy
spotify = spotipy.Spotify()
import sys
import pandas as pd
from spotipy.oauth2 import SpotifyClientCredentials
from bs4 import BeautifulSoup
import requests
import lxml

We’ll begin by defining two functions: one which reads the spotify credentials required to use the API from a file (for this you’ll need a spotify dev account, which you can sign up for free here), and another which then gets the audio features data for a given artist’s tracks. The get_spotify_credentials function takes the name of a text file containing the client ID and secret in the format:

credentials.txt

client_id ########
client_secret #######
def get_spotify_credentials(filename):
    if filename is None:
        raise IOError('Credentials file is none.')

    f = open(filename)

    txt = f.readlines()
    client_id = None
    client_secret = None
    for l in txt:
        l = l.replace('\n', '').split(' ')
        if l[0] == 'client_id':
            client_id = l[1]
        elif l[0] == 'client_secret':
            client_secret = l[1]

    client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
    sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
    sp.trace = True

    return sp

def get_spotify_data(artist_name, credentials_file):

    # get authorisation stuff
    sp = get_spotify_credentials(credentials_file)

    # first get spotify artist uri
    results = sp.search(q='artist:' + artist_name, type='artist')
    items = results['artists']['items']
    if len(items) > 0:
        artist = items[0]

    uri = artist['uri']

    # now get album uris
    results = sp.artist_albums(uri, album_type='album')
    albums = results['items']
    while results['next']:
        results = sp.next(results)
        albums.extend(results['items'])

    uris = []
    track_names = []
    album_names = []

    # get track data
    for album in albums:
        for t in sp.album(album['uri'])['tracks']['items']:
            uris.append(t['uri'])
            track_names.append(t['name'])
            album_names.append(album['name'])
    features = []
    for i in range(len(uris)// 100 + 1):
        fs = sp.audio_features(uris[i*100:min((i+1)*100, len(uris))])
        if fs[0] is not None:
            features.extend(fs)

    # make dataframe
    dat = pd.DataFrame(features)
    dat['track_name'] = track_names
    dat['album'] = album_names
    dat['artists'] = artist_name

    # ignore live, remix and deluxe album versions
    mask = [('live' not in s.lower() and 'deluxe' not in s.lower()
             and 'remix' not in s.lower() and 'rmx' not in s.lower()
            and 'remastered' not in s.lower()) for s in dat.album.values]
    dat = dat[mask]
    mask2 = [(('remix' not in s.lower()) and
              'remastered' not in s.lower() and 'live' not in s.lower()
             and 'version' not in s.lower()) for s in dat.track_name.values]
    dat = dat[mask2]

    dat.set_index('track_name', inplace=True)
    dat.drop_duplicates(inplace=True)
    dat = dat[~dat.index.duplicated(keep='first')]

    return dat

Let’s try running that on an artist.

white_stripes = get_spotify_data('The White Stripes', 'credentials.txt')

Let’s look at the data spotify has given us:

white_stripes.columns
Index(['acousticness', 'analysis_url', 'danceability', 'duration_ms', 'energy',
       'id', 'instrumentalness', 'key', 'liveness', 'loudness', 'mode',
       'speechiness', 'tempo', 'time_signature', 'track_href', 'type', 'uri',
       'valence', 'album', 'artists'],
      dtype='object')
white_stripes.head()
acousticness analysis_url danceability duration_ms energy id instrumentalness key liveness loudness mode speechiness tempo time_signature track_href type uri valence album artists
track_name
Icky Thump 0.0210 https://api.spotify.com/v1/audio-analysis/09QZ... 0.424 254533 0.632 09QZAEmdbq28OaNyqTOEvY 0.011700 9 0.0545 -7.243 1 0.0922 97.694 4 https://api.spotify.com/v1/tracks/09QZAEmdbq28... audio_features spotify:track:09QZAEmdbq28OaNyqTOEvY 0.356 Icky Thump The White Stripes
You Don't Know What Love Is (You Just Do As You're Told) 0.0233 https://api.spotify.com/v1/audio-analysis/6atU... 0.427 234400 0.745 6atUkoZ6Cj4w8QzOUjQucI 0.000062 2 0.1830 -5.516 1 0.0490 83.960 4 https://api.spotify.com/v1/tracks/6atUkoZ6Cj4w... audio_features spotify:track:6atUkoZ6Cj4w8QzOUjQucI 0.562 Icky Thump The White Stripes
300 M.P.H. Torrential Outpour Blues 0.5480 https://api.spotify.com/v1/audio-analysis/2Zbn... 0.537 328560 0.435 2Zbnh37ISbOOaTu4If4lhu 0.027000 9 0.1070 -10.762 1 0.0868 85.589 4 https://api.spotify.com/v1/tracks/2Zbnh37ISbOO... audio_features spotify:track:2Zbnh37ISbOOaTu4If4lhu 0.227 Icky Thump The White Stripes
Conquest 0.0573 https://api.spotify.com/v1/audio-analysis/1ITd... 0.469 168307 0.761 1ITd91pncftIqQ2tCJsSFT 0.004390 7 0.1140 -4.791 1 0.0731 136.831 4 https://api.spotify.com/v1/tracks/1ITd91pncftI... audio_features spotify:track:1ITd91pncftIqQ2tCJsSFT 0.466 Icky Thump The White Stripes
Bone Broke 0.0850 https://api.spotify.com/v1/audio-analysis/3eMw... 0.329 194360 0.811 3eMwOh7qeMJfBZohwQeSJQ 0.822000 2 0.0783 -6.109 1 0.1180 84.205 4 https://api.spotify.com/v1/tracks/3eMwOh7qeMJf... audio_features spotify:track:3eMwOh7qeMJfBZohwQeSJQ 0.401 Icky Thump The White Stripes

As hoped, we can see that for each track we have obtained the desired acoustic properties (along with some info about its location in the Spotify database). We can also define a function that gets the data for a user’s playlist. This will be useful later on when we want to look at music from different musical genres.

def get_spotify_playlist_data(username='spotify', playlist=None, credentials_file=None):

    # set a limit to total number of tracks to analyse
    track_number_limit = 500

    # get authorisation stuff
    sp = get_spotify_credentials(credentials_file)

    # get user playlists
    p = None
    results = sp.user_playlists(username)
    playlists = results['items']

    if playlist is None: # use first of the user's playlists
        playlist = playlists[0]['name']

    for pl in playlists:
        if pl['name'] is not None and pl['name'].lower() == playlist.lower():
            p = pl
            break
    while results['next'] and p is None:
        results = sp.next(results)
        playlists = results['items']
        for pl in playlists:
            if pl['name'] is not None and pl['name'].lower() == playlist.lower():
                p = pl
                break

    if p is None:
        print('Could not find playlist')
        return

    results = sp.user_playlist(p['owner']['id'], p['id'], fields="tracks,next")['tracks']
    tracks = results['items']
    while results['next'] and len(tracks) < track_number_limit:
        results = sp.next(results)
        if results['items'][0] is not None:
            tracks.extend(results['items'])

    ts = []
    track_names = []

    for t in tracks:
        track = t['track']
        track['album'] = track['album']['name']
        track_names.append(t['track']['name'])
        artists = []
        for a in track['artists']:
            artists.append(a['name'])
        track['artists'] = ', '.join(artists)
        ts.append(track)

    uris = []
    dat = pd.DataFrame(ts)

    dat.drop(['available_markets', 'disc_number', 'external_ids', 'external_urls'], axis=1, inplace=True)

    features = []
    # loop to take advantage of spotify being able to get data for 100 tracks at once
    for i in range(len(dat)// 100 + 1):
        fs = sp.audio_features(dat.uri.iloc[i*100:min((i+1)*100, len(dat))])
        if fs[0] is not None:
            features.extend(fs)

    fs = pd.DataFrame(features)

    dat = pd.concat([dat, fs], axis=1)
    dat['track_name'] = track_names

    # ignore live, remix and deluxe album versions
    mask = [(('live' not in s.lower()) and ('deluxe' not in s.lower())
             and ('remix' not in s.lower())) for s in dat.album.values]
    dat = dat[mask]
    mask2 = [(('remix' not in s.lower()) and
              'remastered' not in s.lower()
             and 'version' not in s.lower()) for s in dat.track_name.values]
    dat = dat[mask2]

    dat.set_index('track_name', inplace=True)
    dat = dat[~dat.index.duplicated(keep='first')]
    dat = dat.T[~dat.T.index.duplicated(keep='first')].T

    return dat
acoustic_grit = get_spotify_playlist_data(playlist="acoustic grit", credentials_file='credentials.txt')
acoustic_grit.head()
album artists duration_ms episode explicit href id is_local name popularity ... instrumentalness key liveness loudness mode speechiness tempo time_signature track_href valence
track_name
Dry Dirt (Stripped) Spirit's Furnace The Bones of J.R. Jones 214036 False False https://api.spotify.com/v1/tracks/7g4fX37Y3lzi... 7g4fX37Y3lziLMoxrTTGI3 False Dry Dirt (Stripped) 42 ... 0.358 9 0.112 -12.715 1 0.0588 126.195 4 https://api.spotify.com/v1/tracks/7g4fX37Y3lzi... 0.326
Thousand Mile Night Thousand Mile Night - Single Jonah Tolchin 235173 False False https://api.spotify.com/v1/tracks/5yeM63cXgGvS... 5yeM63cXgGvSN2VrcHbv6x False Thousand Mile Night 51 ... 0.0719 2 0.0944 -13.942 1 0.0471 140.652 4 https://api.spotify.com/v1/tracks/5yeM63cXgGvS... 0.498
Lead Me Home - The Walking Dead Soundtrack The Walking Dead (AMC’s Original Soundtrack – ... Jamie N Commons 117426 False False https://api.spotify.com/v1/tracks/2DBFAJgsqhYk... 2DBFAJgsqhYk5Z1AF7tAMH False Lead Me Home - The Walking Dead Soundtrack 46 ... 0.00015 6 0.118 -11.069 0 0.0323 136.278 5 https://api.spotify.com/v1/tracks/2DBFAJgsqhYk... 0.175
Whispered Words (Pretty Lies) Keep It Hid Dan Auerbach 246131 False False https://api.spotify.com/v1/tracks/4pPWNeApSiOR... 4pPWNeApSiORRQucFZt85y False Whispered Words (Pretty Lies) 3 ... 0.000453 9 0.093 -6.9 1 0.0734 86.697 4 https://api.spotify.com/v1/tracks/4pPWNeApSiOR... 0.481
Set My Soul on Fire Down to the River The War and Treaty 299680 False False https://api.spotify.com/v1/tracks/5yuqWMCOtMY0... 5yuqWMCOtMY0IBaQCBzqT5 False Set My Soul on Fire 48 ... 0 11 0.128 -10.767 0 0.0382 122.416 4 https://api.spotify.com/v1/tracks/5yuqWMCOtMY0... 0.299

5 rows × 29 columns

Getting the lyrical data

To get the lyrical data, we shall be using the Genius lyrics API. We do this by first submitting a query to find the artist id, then submit another query to download the lyrics. There does not (to my knowledge) exist an existing python library for interacting with this API (as there was for the Spotify API), so we need to do a bit more work here.

To submit the API requests, we shall be using the python requests library. This will then return us a JSON object containing the url of the page containing the song lyrics. To then get the lyrics, we will use requests to get the lyrics page’s source code, then use Beautiful Soup with the lxml parser to find the div container which contains the lyrics.

Note that the search_genius function takes a credentials file as an argument. This credentials file contains the token required to interact with the API.

def search_genius(query, credentials_file, return_artist_id=True):
    f = open(credentials_file)
    txt = f.readlines()
    genius_token = None
    for l in txt:
        l = l.replace('\n', '').split(' ')
        if l[0] == 'genius_token':
            genius_token = l[1]

    API = 'https://api.genius.com'
    HEADERS = {'Authorization': 'Bearer ' + genius_token}

    search_endpoint = API + '/search?'
    payload = {'q': query}
    search_request_object = requests.get(search_endpoint, params=payload, headers=HEADERS)

    if search_request_object.status_code == 200:
        s_json_response = search_request_object.json()
        api_call = search_request_object.url

        if len(s_json_response['response']['hits']) == 0:
            return None
        else:
            hit = s_json_response['response']['hits'][0]

            artist = hit['result']['primary_artist']['name']
            artist_url = hit['result']['primary_artist']['url']
            artist_id = hit['result']['primary_artist']['id']
            title = hit['result']['title']
            song_id = hit['result']['id']
            url = hit['result']['url']

            if return_artist_id:
                return artist_id #, artist_url, title, url, hit
            else:
                return url

    elif 400 <= search_request_object.status_code < 500:
        print('[!] Uh-oh, something seems wrong...')
        print('[!] Please submit an issue at https://github.com/donniebishop/genius_lyrics/issues')
        sys.exit(1)

    elif search_request_object.status_code >= 500:
        print('[*] Hmm... Genius.com seems to be having some issues right now.')
        print('[*] Please try your search again in a little bit!')
        sys.exit(1)

    return

def get_songs(artist_id, credentials_file):
    f = open(credentials_file)
    txt = f.readlines()
    genius_token = None
    for l in txt:
        l = l.replace('\n', '').split(' ')
        if l[0] == 'genius_token':
            genius_token = l[1]

    API = 'https://api.genius.com'
    HEADERS = {'Authorization': 'Bearer ' + genius_token}
    songs = []
    page = 1
    search_endpoint = API + '/artists/' + str(artist_id) + '/songs'

    while True:
        payload = {'per_page': 50, 'page': page}
        search_request_object = requests.get(search_endpoint, params=payload, headers=HEADERS)

        if search_request_object.status_code != 200:
            break
        else:
            s_json_response = search_request_object.json()
            if len(s_json_response['response']['songs']) == 0:
                break
            for song in s_json_response['response']['songs']:
                songs.append([song['title'], song['id'], song['url']])
            page += 1
    return songs

def get_lyrics(url):
    get_url = requests.get(url)
    song_soup = BeautifulSoup(get_url.text, 'lxml')
    divs = song_soup.find_all('div')

    lyrics = []
    for d in divs:
        try:
            if d['class'][0] == 'lyrics':
                strings = d.stripped_strings
        except KeyError:
            pass

    for s in strings:
        if s[0] != '[':
            lyrics.append(s)

    ls = ' '.join(lyrics)

    return ls

Let’s try and run that on our White Stripes data:

artist_id = search_genius("The White Stripes", 'credentials.txt')
genius_songs = get_songs(artist_id, 'credentials.txt')
name = genius_songs[0][0]
url = genius_songs[0][2]

lyrics = get_lyrics(url)

print(name, '\n')
print(lyrics)
300 M.P.H. Torrential Outpour Blues

I'm bringing back ghosts That are no longer there I'm getting hard on myself Sitting in my easy chair
Well, there's three people in the mirror And I'm wondering which one of them i should choose
Well, I can't keep from laughing Spitting out these 300 mile per hour out-pour blues I'm breaking my teeth off
Trying to bite my lip There's all kinds of red-headed women That I ain't supposed to kiss
And it's that color that never fails To turn me blue So I just swallow it and hold on to it
And use it to scare the hell out of you I have a woman 'Says come and watch me bleed
And I'm wondering just how I can do that And still give her everything that she needs
Well, there's three people in my head that have the answer And one of them's got to be you
But you're holding tight to it -- the answer Singing these three hundred mile per hour out-pour blues
Put on gloves, a tied scarf and wrap up warm On this winter night Every-time you get defensive
You're just looking for a fight It's safe to sing somebody out there's got a problem
With almost anything you'll do Well, next time they stab you don't fight back just play the victim
Instead of playing the fool And the roads are covered with a million Little molecules
Of cigarette ashes and the school floors are covered With pieces of pencil eraser too
Well sooner or later the ground's gonna be holding all Of my ashes too But I can't help but wonder if after
I'm gone will i still have these three hundred mile per Hour, finger breaking, no answers making,
battered dirty hands, bee stung and busted up, empty Cup torrential out pour blues
One thing's for sure: in that graveyard I'm gonna have the shiniest pair of shoes

We’re going to be interested in getting the lyrics for a load of different tracks in a playlist. Let’s create a function that finds the urls of the lyrics pages for all the tracks in a playlist.

def get_playlist_urls(df, credentials_file):
    # get the urls for the lyrics of all the songs in a dataframe
    if 'genius_url' in df.columns and df.iloc[0].genius_url is not None:
        return
    df['genius_url'] = None

    for i, r in df.iterrows():
        try:
            url = search_genius(r['artists'] + ', ' + i, credentials_file, return_artist_id=False)
            df.set_value(i, 'genius_url', url)
        except IndexError:
            pass
get_playlist_urls(acoustic_grit, 'credentials.txt')
acoustic_grit.genius_url.head()
track_name
Dry Dirt (Stripped)                           https://genius.com/Big-d-the-impossible-time-o...
Thousand Mile Night                           https://genius.com/Jonah-tolchin-thousand-mile...
Lead Me Home - The Walking Dead Soundtrack    https://genius.com/Jamie-n-commons-lead-me-hom...
Whispered Words (Pretty Lies)                 https://genius.com/Dan-auerbach-whispered-word...
Set My Soul on Fire                           https://genius.com/President-james-k-polk-pres...
Name: genius_url, dtype: object

Now let’s wrap this inside a function which grabs the lyrics for all the songs in our playlist.

def get_playlist_lyrics(df, credentials_file):
    # get the lyrics for all the songs in a dataframe
    get_playlist_urls(df, credentials_file)

    if 'lyrics' in df.columns and df.iloc[0].lyrics is not None:
        return
    df['lyrics'] = None

    for i, r in df.iterrows():
        if r['genius_url'] is not None:
            lyrics = get_lyrics(r['genius_url'])
            df.set_value(i, 'lyrics', lyrics)
get_playlist_lyrics(acoustic_grit, 'credentials.txt')
acoustic_grit.lyrics.head()
track_name
Dry Dirt (Stripped)                           Time out lay out Body build re-circle lyric Jo...
Thousand Mile Night                           Thousand mile night, Mobile to Michigan Old ra...
Lead Me Home - The Walking Dead Soundtrack    Oh lord live inside me Lead me on my way Oh lo...
Whispered Words (Pretty Lies)                 I hear words, pretty lies Like the ones they t...
Set My Soul on Fire                           James K. Polk XI President of the United State...
Name: lyrics, dtype: object

Analysing the data

Now we have created all the functions required to grab the required data, let’s do some analysis. Let’s begin by first trying to recreate the analysis done in the original blog post. First, let’s download the Radiohead spotify data.

radiohead = get_spotify_data('Radiohead', 'credentials.txt')

Let’s now sort the dataset by valence to find the most depressing songs:

radiohead[['album','valence']].sort_values(by='valence', ascending=True).head(10)
album valence
track_name
We Suck Young Blood Hail To the Thief 0.0378
True Love Waits A Moon Shaped Pool 0.0379
MK 1 In Rainbows Disk 2 0.0389
MK 2 In Rainbows Disk 2 0.0390
The Tourist OK Computer 0.0398
Motion Picture Soundtrack Kid A 0.0435
Go Slowly In Rainbows Disk 2 0.0439
Videotape In Rainbows 0.0466
Life In a Glasshouse Amnesiac 0.0497
Tinker Tailor Soldier Sailor Rich Man Poor Man Beggar Man Thief A Moon Shaped Pool 0.0507

And the least depressing songs:

radiohead[['album','valence']].sort_values(by='valence', ascending=False).head(10)
album valence
track_name
15 Step In Rainbows 0.847
Jigsaw Falling Into Place In Rainbows 0.808
Where Bluebirds Fly Com Lag: 2+2=5 0.746
Fitter Happier OK Computer 0.744
Backdrifts Hail To the Thief 0.732
Feral The King Of Limbs 0.729
Bodysnatchers In Rainbows 0.727
There, There Hail To the Thief 0.717
I Am a Wicked Child Com Lag: 2+2=5 0.692
Paperbag Writer Com Lag: 2+2=5 0.682

In the original post, the author tries to improve on spotify’s valence measure by instead calculating a ‘gloom index’, based on this post by Myles Harrison. This takes into account the lyrics of the song, calculating which percentage of the lyrics are ‘sad’

I thought it would be interesting to similarly calculate a ‘happiness index’, which does a similar calculation but instead uses the percentage of happy words in the song lyrics

In order to calculate the number of happy and sad words in the songs, I used the NRC Emotion Lexicon.

lex = pd.read_table('NRC-emotion-lexicon-wordlevel-v0.92.txt', names=['TargetWord','AffectCategory','AssociationFlag'])

sad_words = lex[(lex.AssociationFlag==1) & (lex.AffectCategory == 'sadness')]['TargetWord'].values

happy_words = lex[(lex.AssociationFlag==1) & (lex.AffectCategory == 'joy')]['TargetWord'].values

ignore = ['a', 'i', 'it', 'the', 'and', 'in', 'he', 'she',
              'to', 'at', 'of', 'that', 'as', 'is', 'his', 'my',
              'for', 'was', 'me', 'we', 'be', 'on', 'so', 'by' ,'you',
              "it's", "i'm", 'oh']

def gloom(df, ignore=ignore, sad_words=sad_words):
    if 'gloom' in df.columns and df.iloc[0].gloom != -1:
        return
    df['gloom'] = -1.
    for i, r in df.iterrows():
        v = r.valence
        try:
            filtered = r.lyrics.lower()
            for j in ignore:
                filtered = filtered.replace(' ' + j + ' ', ' ')
            num_sad = 0.
            filtered = filtered.split(' ')
            for w in filtered:
                if w in sad_words:
                    num_sad += 1.

            percentage_sad = num_sad / len(filtered)
            density = len(filtered) / r.duration_ms * 1000.

            gloom = 0.5 * ((1. - v) + percentage_sad * (1. + density))
        except AttributeError: # song has no lyrics
            gloom = 0.5 * (1. - v)

        df.set_value(i, 'gloom', gloom)

def joy(df, ignore=ignore, happy_words=happy_words):
    if 'happiness' in df.columns and df.iloc[0].happiness != -1:
        return
    df['happiness'] = -1.
    for i, r in df.iterrows():
        v = r.valence
        try:
            filtered = r.lyrics.lower()
            for j in ignore:
                filtered = filtered.replace(' ' + j + ' ', ' ')
            num_happy = 0.
            filtered = filtered.split(' ')
            for w in filtered:
                if w in happy_words:
                    num_happy += 1.

            percentage_happy = num_happy / len(filtered)
            density = len(filtered) / r.duration_ms * 1000.

            happiness = 0.5 * (v + percentage_happy * (1. + density))
        except AttributeError: # song has no lyrics
            happiness = 0.5 * v
        df.set_value(i, 'happiness', happiness)
get_playlist_lyrics(radiohead, 'credentials.txt')
gloom(radiohead)
joy(radiohead)
radiohead[['album','valence', 'gloom']].sort_values(by='gloom', ascending=False).head(10)
album valence gloom
track_name
True Love Waits A Moon Shaped Pool 0.0379 0.591282
Give Up The Ghost The King Of Limbs 0.1590 0.507669
We Suck Young Blood Hail To the Thief 0.0378 0.502327
Tinker Tailor Soldier Sailor Rich Man Poor Man Beggar Man Thief A Moon Shaped Pool 0.0507 0.502317
Dollars & Cents Amnesiac 0.0881 0.499827
Pyramid Song Amnesiac 0.0686 0.496156
Let Down OK Computer 0.1450 0.494404
Life In a Glasshouse Amnesiac 0.0497 0.494085
The Tourist OK Computer 0.0398 0.489094
Bullet Proof ... I Wish I Was The Bends 0.0773 0.488363

We now see that some songs with higher valence have been deemed sadder based on their lyrical contents. For example Give up the Ghost contains the sad words hurt, lost, impossible, which are repeated a lot so end up making a significant percentage of the song’s lyrics.

radiohead.loc['Give Up The Ghost'].lyrics
"Don't hurt me, don't haunt me Don't hurt me, don't haunt me Don't hurt me Gather up the lost and their souls
(Don't haunt me) Into your arms (Don't hurt me) Into your arms (Don't haunt me) Gather up the pitiful
(Don't hurt me, don't haunt me) Into your arms (Don't hurt me) Into your arms (Don't haunt me)
Into your arms (Don't hurt me) Into your arms (Don't haunt me) (Into your arms) What seems impossible
(Don't hurt me, don't haunt me) Into your arms (Don't hurt me) Into your arms (Don't haunt me) Into your arms
I think I have had my fill (Don't hurt me, don't haunt me) Into your arms (Don't hurt me) Into your arms
(Don't haunt me) Into your arms (Don't hurt me) Into your arms (Don't haunt me) I think I should give up the ghost
(Don't hurt me, don't haunt me) Into your arms (Don't hurt me, don't haunt me) Into your arms (Don't hurt me)
Into your arms (Don't haunt me) Into your arms (Don't hurt me) Into your arms (Don't haunt me) Into your arms
(Don't hurt me) Into your arms (Don't haunt me) Into your arms (Don't hurt me) Into your arms (Don't haunt me)
Into your arms (Don't hurt me) Into your arms (Don't haunt me)"

Let’s similarly look for the ‘happiest’ songs:

radiohead[['album','valence', 'happiness', 'gloom']].sort_values(by='happiness', ascending=False).head(10)
album valence happiness gloom
track_name
Fitter Happier OK Computer 0.744 0.435214 0.191214
I Am a Wicked Child Com Lag: 2+2=5 0.692 0.425438 0.183789
15 Step In Rainbows 0.847 0.423500 0.076500
Jigsaw Falling Into Place In Rainbows 0.808 0.414006 0.131020
Sulk The Bends 0.671 0.405095 0.171460
Where Bluebirds Fly Com Lag: 2+2=5 0.746 0.392560 0.136780
I Promise OK Computer OKNOTOK 1997 2017 0.487 0.369156 0.256500
Backdrifts Hail To the Thief 0.732 0.366000 0.199295
Separator The King Of Limbs 0.659 0.365018 0.197138
There, There Hail To the Thief 0.717 0.364853 0.166911

It is noticeable that the songs with the ‘happiest’ lyrics don’t necessarily have the highest valence. I Promise scores low on the valence scale, but is judged to have the 7th happiest set of lyrics overall. Looking at the lyrics, we see this is likely to be due to the repetition of the word ‘promise’, which is classed as a happy word. It also has a fairly high gloom index as the other words in the song are fairly negative.

radiohead.loc['I Promise'].lyrics
"I won't run away no more, I promise Even when I get bored, I promise Even when you lock me out,
I promise I say my prayers every night, I promise I know which side I'm spread, I promise
The tantrums and the chitty chats, I promise Even when the ship is wrecked, I promise
Tie me to the rotting deck, I promise I won't run away no more, I promise Even when I get bored,
I promise Even when the ship is wrecked, I promise Tie me to the rotting deck, I promise
I won't run away no more, I promise"

Looking at musical genres

After replicating the Radiohead analysis, I thought it might be interesting to look at tracks from different musical genres and compare their characteristics. To do this, I downloaded the data for Spotify playlists containing songs from a variety of genres. From this, I then made use of plotly’s interactivity to produce a plot that allows us to investigate the different measures.

# RapCaviar, Pop Rising, Ultimate Indie, Top picks country, truly deeply house, metal essentials

dfs = {'indie': pd.read_csv('indie.csv'), 'pop': pd.read_csv('pop.csv'), 'country': pd.read_csv('country.csv'),
       'metal': pd.read_csv('metal.csv'), 'house': pd.read_csv('house.csv'), 'rap': pd.read_csv('rap.csv')}
import plotly
plotly.offline.init_notebook_mode(connected=True)
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.tools as tls
# function to make list of traces given dictionary of dataframes and the dataframe keys to be plotted
def make_traces(x, y, dfs):
    ts = []
    for name, df in dfs.items():
        ts.append(go.Scatter(x=df[x], y=df[y], mode='markers',
                       name=name, text=df.name + ' - ' + df.artists))
    return ts

data = dict()

# define which categories we want to include
categories = ['duration_ms', 'popularity', 'acousticness', 'danceability',
       'energy', 'instrumentalness', 'key', 'liveness', 'loudness',
       'mode', 'speechiness', 'tempo', 'time_signature', 'valence']

# for each category make a list of the data from each dataframe
for cat in categories:
    data[cat] = [df[cat] for df in dfs.values()]

# define behaviour of dropdown menus
# we've defined the buttons using list comprehensions - on selection, the x/y data and axis label are updated
updatemenus = list([
    dict(x=-0.05, y=0.8,
         buttons=list([   
            dict(label = cat, method = 'update',
                 args = [{'x': data[cat]}, {'xaxis': dict(title = cat)}]) for cat in categories
        ])
    ),
    dict(x=-0.05,  y=1,
         buttons=list([   
            dict(label = cat, method = 'update',
                 args = [{'y': data[cat]}, {'yaxis': dict(title = cat)}]) for cat in categories
        ])
    )
])

# set the initial data
initial_dat = go.Data(make_traces('duration_ms', 'duration_ms', dfs))

# make the layout
layout = dict(title='Compare genres', showlegend=True,
              updatemenus=updatemenus)

fig = dict(data=initial_dat, layout=layout)
fig['layout'].update(hovermode='closest')
plotly.offline.iplot(fig)

This is pretty interesting: looking at energy vs danceability, if you select just the metal, country and house datasets (click on the name of a dataset in the legend to hide it), you can see that the data form 3 pretty distinct clusters. The rap and house datasets surprisingly occupy a similar region in the plot. The metal dataset is the most tightly clustered, all tracks being high energy but not very danceable. We can also see that house tracks tend to have a very consistent bpm (120) and be the longest, rap music tends to be the most popular and almost all songs have a 4/4 time signature.

Machine learning

From the above plot, it can be seen that the data is quite clustered. This suggests it may be possible to create some kind of machine learning genre classifier using scikit-learn. Knowing very little about machine learning, I pretty much stuck to the examples in the documentation to create this, so the resulting classifier is almost certainly much less successful than it could be some tuning.

from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import accuracy_score
for genre, df in dfs.items():
    df['genre'] = genre

dat = pd.concat(dfs.values())

data = dat[['duration_ms', 'explicit', 'popularity', 'acousticness', 'danceability',
       'energy', 'instrumentalness', 'key', 'liveness', 'loudness',
       'mode', 'speechiness', 'tempo', 'time_signature', 'valence', 'genre']].values[:,:-1]
labels = dat[['duration_ms', 'explicit', 'popularity', 'acousticness', 'danceability',
       'energy', 'instrumentalness', 'key', 'liveness', 'loudness',
       'mode', 'speechiness', 'tempo', 'time_signature', 'valence', 'genre']].values[:,-1]

data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size = 0.3)

classifier = tree.DecisionTreeClassifier()
classifier.fit(data_train, labels_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
accuracy = accuracy_score(labels_test, classifier.predict(data_test))
print("Decision Tree Accuracy with 50/50: {}".format(accuracy))
Decision Tree Accuracy with 50/50: 0.6298701298701299

This doesn’t have the best accuracy, however if we drop one of the genres (especially one of the less tightly clustered ones such as pop, indie or rap), our accuracy increases significantly. Not knowing anything about machine learning classifiers, it’s also very likely I’ve chosen one that isn’t great for this dataset, and the accuracy would also improve if I chose one that was more suitable.

nopop_dat = dat[dat.genre != 'pop']
data = nopop_dat[['duration_ms', 'explicit', 'popularity', 'acousticness', 'danceability',
       'energy', 'instrumentalness', 'key', 'liveness', 'loudness',
       'mode', 'speechiness', 'tempo', 'time_signature', 'valence', 'genre']].values[:,:-1]
labels = nopop_dat[['duration_ms', 'explicit', 'popularity', 'acousticness', 'danceability',
       'energy', 'instrumentalness', 'key', 'liveness', 'loudness',
       'mode', 'speechiness', 'tempo', 'time_signature', 'valence', 'genre']].values[:,-1]

data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size = 0.3)

classifier = tree.DecisionTreeClassifier()
classifier.fit(data_train, labels_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
accuracy = accuracy_score(labels_test, classifier.predict(data_test))
print("Decision Tree Accuracy with 50/50: {}".format(accuracy))
Decision Tree Accuracy with 50/50: 0.8173076923076923

That’s much better!

Summary

In this post, I have described how I was able to download a load of music data from Spotify using spotipy, analyse it using pandas, produce an interactive plot using plotly, and do some (v. basic) machine learning using scikit-learn. In the process, I learnt a great deal about pandas and how web APIs work, such that I’m pretty keen to explore some more interesting datasets in the future. I also learnt that Radiohead songs really are quite depressing, and that metal music may be very energetic but is not judged by Spotify as being very danceable.