Visualising music data

Last year, I helped give an introductory python course to some MSc maths students here at the University of Southampton. As part of this course, we introduced them to some basic data analysis with pandas and machine learning with scikit-learn. For this, we needed some data to analyse. In many learning materials, the Iris flower dataset is used. However, I didn’t think that comparing the lengths and widths of flower petals would be particularly inspiring so decided to see if I could find something else.

After scouring the internet for some more interesting datasets, I came across this blog post. In the post, the author uses R and the Spotify and Genius Lyrics APIs to find the most depressing Radiohead song. Being a Radiohead fan, this certainly looked like an interesting dataset, and not having any prior experience with R or web APIs, working out how to do something similar in python looked like a nice challenge.

I began by first replicating the analysis carried out in the original post, then extended it to investigate the properties of music from different genres. In the process, I learnt a lot about web APIs, pandas, scikit-learn and just how great plotly is. Consequently, I decided to write up the steps I took in case anyone else finds music-related data as interesting as I do!

In this post, I will begin by getting the music data using the Spotify web API, then grab the lyrics using the Genius lyrics API. I’ll then do some analysis of Radiohead data and compare musical genres (using a pretty plotly figure), finishing with some basic machine learning.

This post was originally written as a Jupyter Notebook, then converted to markdown using nbconvert. If you wish to download the original notebook, it can be found here.

Getting the music data

Spotify provide a web API which can be used to download data about its music. This data includes the audio features of a track, a set of measures including ‘acousticness’, ‘danceability’, ‘speechiness’ and ‘valence’:

A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Fortunately, there already exists a python library for interacting with the spotify API: spotipy. At this point, I am also going to import pandas for data analysis, BeautifulSoup for some web scraping, requests for HTTP requests to pull the source code from websites and lxml for processing XML.

import spotipy
spotify = spotipy.Spotify()
import sys
import pandas as pd
from spotipy.oauth2 import SpotifyClientCredentials
from bs4 import BeautifulSoup
import requests
import lxml

We’ll begin by defining two functions: one which reads the spotify credentials required to use the API from a file (for this you’ll need a spotify dev account, which you can sign up for free here), and another which then gets the audio features data for a given artist’s tracks. The get_spotify_credentials function takes the name of a text file containing the client ID and secret in the format:

credentials.txt

client_id ########
client_secret #######

def get_spotify_credentials(filename):
    if filename is None:
        raise IOError('Credentials file is none.')

    f = open(filename)

    txt = f.readlines()
    client_id = None
    client_secret = None
    for l in txt:
        l = l.replace('\n', '').split(' ')
        if l[0] == 'client_id':
            client_id = l[1]
        elif l[0] == 'client_secret':
            client_secret = l[1]

    client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
    sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
    sp.trace = True

    return sp

def get_spotify_data(artist_name, credentials_file):

    # get authorisation stuff
    sp = get_spotify_credentials(credentials_file)

    # first get spotify artist uri
    results = sp.search(q='artist:' + artist_name, type='artist')
    items = results['artists']['items']
    if len(items) > 0:
        artist = items[0]

    uri = artist['uri']

    # now get album uris
    results = sp.artist_albums(uri, album_type='album')
    albums = results['items']
    while results['next']:
        results = sp.next(results)
        albums.extend(results['items'])

    uris = []
    track_names = []
    album_names = []

    # get track data
    for album in albums:
        for t in sp.album(album['uri'])['tracks']['items']:
            uris.append(t['uri'])
            track_names.append(t['name'])
            album_names.append(album['name'])
    features = []
    for i in range(len(uris)// 100 + 1):
        fs = sp.audio_features(uris[i*100:min((i+1)*100, len(uris))])
        if fs[0] is not None:
            features.extend(fs)

    # make dataframe
    dat = pd.DataFrame(features)
    dat['track_name'] = track_names
    dat['album'] = album_names
    dat['artists'] = artist_name

    # ignore live, remix and deluxe album versions
    mask = [('live' not in s.lower() and 'deluxe' not in s.lower()
             and 'remix' not in s.lower() and 'rmx' not in s.lower()
            and 'remastered' not in s.lower()) for s in dat.album.values]
    dat = dat[mask]
    mask2 = [(('remix' not in s.lower()) and
              'remastered' not in s.lower() and 'live' not in s.lower()
             and 'version' not in s.lower()) for s in dat.track_name.values]
    dat = dat[mask2]

    dat.set_index('track_name', inplace=True)
    dat.drop_duplicates(inplace=True)
    dat = dat[~dat.index.duplicated(keep='first')]

    return dat

Let’s try running that on an artist.

white_stripes = get_spotify_data('The White Stripes', 'credentials.txt')

Let’s look at the data spotify has given us:

white_stripes.columns

Index(['acousticness', 'analysis_url', 'danceability', 'duration_ms', 'energy',
       'id', 'instrumentalness', 'key', 'liveness', 'loudness', 'mode',
       'speechiness', 'tempo', 'time_signature', 'track_href', 'type', 'uri',
       'valence', 'album', 'artists'],
      dtype='object')

white_stripes.head()

	acousticness	analysis_url	danceability	duration_ms	energy	id	instrumentalness	key	liveness	loudness	mode	speechiness	tempo	time_signature	track_href	type	uri	valence	album	artists
track_name
Icky Thump	0.0210	https://api.spotify.com/v1/audio-analysis/09QZ...	0.424	254533	0.632	09QZAEmdbq28OaNyqTOEvY	0.011700	9	0.0545	-7.243	1	0.0922	97.694	4	https://api.spotify.com/v1/tracks/09QZAEmdbq28...	audio_features	spotify:track:09QZAEmdbq28OaNyqTOEvY	0.356	Icky Thump	The White Stripes
You Don't Know What Love Is (You Just Do As You're Told)	0.0233	https://api.spotify.com/v1/audio-analysis/6atU...	0.427	234400	0.745	6atUkoZ6Cj4w8QzOUjQucI	0.000062	2	0.1830	-5.516	1	0.0490	83.960	4	https://api.spotify.com/v1/tracks/6atUkoZ6Cj4w...	audio_features	spotify:track:6atUkoZ6Cj4w8QzOUjQucI	0.562	Icky Thump	The White Stripes
300 M.P.H. Torrential Outpour Blues	0.5480	https://api.spotify.com/v1/audio-analysis/2Zbn...	0.537	328560	0.435	2Zbnh37ISbOOaTu4If4lhu	0.027000	9	0.1070	-10.762	1	0.0868	85.589	4	https://api.spotify.com/v1/tracks/2Zbnh37ISbOO...	audio_features	spotify:track:2Zbnh37ISbOOaTu4If4lhu	0.227	Icky Thump	The White Stripes
Conquest	0.0573	https://api.spotify.com/v1/audio-analysis/1ITd...	0.469	168307	0.761	1ITd91pncftIqQ2tCJsSFT	0.004390	7	0.1140	-4.791	1	0.0731	136.831	4	https://api.spotify.com/v1/tracks/1ITd91pncftI...	audio_features	spotify:track:1ITd91pncftIqQ2tCJsSFT	0.466	Icky Thump	The White Stripes
Bone Broke	0.0850	https://api.spotify.com/v1/audio-analysis/3eMw...	0.329	194360	0.811	3eMwOh7qeMJfBZohwQeSJQ	0.822000	2	0.0783	-6.109	1	0.1180	84.205	4	https://api.spotify.com/v1/tracks/3eMwOh7qeMJf...	audio_features	spotify:track:3eMwOh7qeMJfBZohwQeSJQ	0.401	Icky Thump	The White Stripes

As hoped, we can see that for each track we have obtained the desired acoustic properties (along with some info about its location in the Spotify database). We can also define a function that gets the data for a user’s playlist. This will be useful later on when we want to look at music from different musical genres.

def get_spotify_playlist_data(username='spotify', playlist=None, credentials_file=None):

    # set a limit to total number of tracks to analyse
    track_number_limit = 500

    # get authorisation stuff
    sp = get_spotify_credentials(credentials_file)

    # get user playlists
    p = None
    results = sp.user_playlists(username)
    playlists = results['items']

    if playlist is None: # use first of the user's playlists
        playlist = playlists[0]['name']

    for pl in playlists:
        if pl['name'] is not None and pl['name'].lower() == playlist.lower():
            p = pl
            break
    while results['next'] and p is None:
        results = sp.next(results)
        playlists = results['items']
        for pl in playlists:
            if pl['name'] is not None and pl['name'].lower() == playlist.lower():
                p = pl
                break

    if p is None:
        print('Could not find playlist')
        return

    results = sp.user_playlist(p['owner']['id'], p['id'], fields="tracks,next")['tracks']
    tracks = results['items']
    while results['next'] and len(tracks) < track_number_limit:
        results = sp.next(results)
        if results['items'][0] is not None:
            tracks.extend(results['items'])

    ts = []
    track_names = []

    for t in tracks:
        track = t['track']
        track['album'] = track['album']['name']
        track_names.append(t['track']['name'])
        artists = []
        for a in track['artists']:
            artists.append(a['name'])
        track['artists'] = ', '.join(artists)
        ts.append(track)

    uris = []
    dat = pd.DataFrame(ts)

    dat.drop(['available_markets', 'disc_number', 'external_ids', 'external_urls'], axis=1, inplace=True)

    features = []
    # loop to take advantage of spotify being able to get data for 100 tracks at once
    for i in range(len(dat)// 100 + 1):
        fs = sp.audio_features(dat.uri.iloc[i*100:min((i+1)*100, len(dat))])
        if fs[0] is not None:
            features.extend(fs)

    fs = pd.DataFrame(features)

    dat = pd.concat([dat, fs], axis=1)
    dat['track_name'] = track_names

    # ignore live, remix and deluxe album versions
    mask = [(('live' not in s.lower()) and ('deluxe' not in s.lower())
             and ('remix' not in s.lower())) for s in dat.album.values]
    dat = dat[mask]
    mask2 = [(('remix' not in s.lower()) and
              'remastered' not in s.lower()
             and 'version' not in s.lower()) for s in dat.track_name.values]
    dat = dat[mask2]

    dat.set_index('track_name', inplace=True)
    dat = dat[~dat.index.duplicated(keep='first')]
    dat = dat.T[~dat.T.index.duplicated(keep='first')].T

    return dat

acoustic_grit = get_spotify_playlist_data(playlist="acoustic grit", credentials_file='credentials.txt')

acoustic_grit.head()

	album	artists	duration_ms	episode	explicit	href	id	is_local	name	popularity	...	instrumentalness	key	liveness	loudness	mode	speechiness	tempo	time_signature	track_href	valence
track_name
Dry Dirt (Stripped)	Spirit's Furnace	The Bones of J.R. Jones	214036	False	False	https://api.spotify.com/v1/tracks/7g4fX37Y3lzi...	7g4fX37Y3lziLMoxrTTGI3	False	Dry Dirt (Stripped)	42	...	0.358	9	0.112	-12.715	1	0.0588	126.195	4	https://api.spotify.com/v1/tracks/7g4fX37Y3lzi...	0.326
Thousand Mile Night	Thousand Mile Night - Single	Jonah Tolchin	235173	False	False	https://api.spotify.com/v1/tracks/5yeM63cXgGvS...	5yeM63cXgGvSN2VrcHbv6x	False	Thousand Mile Night	51	...	0.0719	2	0.0944	-13.942	1	0.0471	140.652	4	https://api.spotify.com/v1/tracks/5yeM63cXgGvS...	0.498
Lead Me Home - The Walking Dead Soundtrack	The Walking Dead (AMC’s Original Soundtrack – ...	Jamie N Commons	117426	False	False	https://api.spotify.com/v1/tracks/2DBFAJgsqhYk...	2DBFAJgsqhYk5Z1AF7tAMH	False	Lead Me Home - The Walking Dead Soundtrack	46	...	0.00015	6	0.118	-11.069	0	0.0323	136.278	5	https://api.spotify.com/v1/tracks/2DBFAJgsqhYk...	0.175
Whispered Words (Pretty Lies)	Keep It Hid	Dan Auerbach	246131	False	False	https://api.spotify.com/v1/tracks/4pPWNeApSiOR...	4pPWNeApSiORRQucFZt85y	False	Whispered Words (Pretty Lies)	3	...	0.000453	9	0.093	-6.9	1	0.0734	86.697	4	https://api.spotify.com/v1/tracks/4pPWNeApSiOR...	0.481
Set My Soul on Fire	Down to the River	The War and Treaty	299680	False	False	https://api.spotify.com/v1/tracks/5yuqWMCOtMY0...	5yuqWMCOtMY0IBaQCBzqT5	False	Set My Soul on Fire	48	...	0	11	0.128	-10.767	0	0.0382	122.416	4	https://api.spotify.com/v1/tracks/5yuqWMCOtMY0...	0.299

5 rows × 29 columns

Getting the lyrical data

To get the lyrical data, we shall be using the Genius lyrics API. We do this by first submitting a query to find the artist id, then submit another query to download the lyrics. There does not (to my knowledge) exist an existing python library for interacting with this API (as there was for the Spotify API), so we need to do a bit more work here.

To submit the API requests, we shall be using the python requests library. This will then return us a JSON object containing the url of the page containing the song lyrics. To then get the lyrics, we will use requests to get the lyrics page’s source code, then use Beautiful Soup with the lxml parser to find the div container which contains the lyrics.

Note that the search_genius function takes a credentials file as an argument. This credentials file contains the token required to interact with the API.

def search_genius(query, credentials_file, return_artist_id=True):
    f = open(credentials_file)
    txt = f.readlines()
    genius_token = None
    for l in txt:
        l = l.replace('\n', '').split(' ')
        if l[0] == 'genius_token':
            genius_token = l[1]

    API = 'https://api.genius.com'
    HEADERS = {'Authorization': 'Bearer ' + genius_token}

    search_endpoint = API + '/search?'
    payload = {'q': query}
    search_request_object = requests.get(search_endpoint, params=payload, headers=HEADERS)

    if search_request_object.status_code == 200:
        s_json_response = search_request_object.json()
        api_call = search_request_object.url

        if len(s_json_response['response']['hits']) == 0:
            return None
        else:
            hit = s_json_response['response']['hits'][0]

            artist = hit['result']['primary_artist']['name']
            artist_url = hit['result']['primary_artist']['url']
            artist_id = hit['result']['primary_artist']['id']
            title = hit['result']['title']
            song_id = hit['result']['id']
            url = hit['result']['url']

            if return_artist_id:
                return artist_id #, artist_url, title, url, hit
            else:
                return url

    elif 400 <= search_request_object.status_code < 500:
        print('[!] Uh-oh, something seems wrong...')
        print('[!] Please submit an issue at https://github.com/donniebishop/genius_lyrics/issues')
        sys.exit(1)

    elif search_request_object.status_code >= 500:
        print('[*] Hmm... Genius.com seems to be having some issues right now.')
        print('[*] Please try your search again in a little bit!')
        sys.exit(1)

    return

def get_songs(artist_id, credentials_file):
    f = open(credentials_file)
    txt = f.readlines()
    genius_token = None
    for l in txt:
        l = l.replace('\n', '').split(' ')
        if l[0] == 'genius_token':
            genius_token = l[1]

    API = 'https://api.genius.com'
    HEADERS = {'Authorization': 'Bearer ' + genius_token}
    songs = []
    page = 1
    search_endpoint = API + '/artists/' + str(artist_id) + '/songs'

    while True:
        payload = {'per_page': 50, 'page': page}
        search_request_object = requests.get(search_endpoint, params=payload, headers=HEADERS)

        if search_request_object.status_code != 200:
            break
        else:
            s_json_response = search_request_object.json()
            if len(s_json_response['response']['songs']) == 0:
                break
            for song in s_json_response['response']['songs']:
                songs.append([song['title'], song['id'], song['url']])
            page += 1
    return songs

def get_lyrics(url):
    get_url = requests.get(url)
    song_soup = BeautifulSoup(get_url.text, 'lxml')
    divs = song_soup.find_all('div')

    lyrics = []
    for d in divs:
        try:
            if d['class'][0] == 'lyrics':
                strings = d.stripped_strings
        except KeyError:
            pass

    for s in strings:
        if s[0] != '[':
            lyrics.append(s)

    ls = ' '.join(lyrics)

    return ls

Let’s try and run that on our White Stripes data:

artist_id = search_genius("The White Stripes", 'credentials.txt')
genius_songs = get_songs(artist_id, 'credentials.txt')

name = genius_songs[0][0]
url = genius_songs[0][2]

lyrics = get_lyrics(url)

print(name, '\n')
print(lyrics)

300 M.P.H. Torrential Outpour Blues

I'm bringing back ghosts That are no longer there I'm getting hard on myself Sitting in my easy chair
Well, there's three people in the mirror And I'm wondering which one of them i should choose
Well, I can't keep from laughing Spitting out these 300 mile per hour out-pour blues I'm breaking my teeth off
Trying to bite my lip There's all kinds of red-headed women That I ain't supposed to kiss
And it's that color that never fails To turn me blue So I just swallow it and hold on to it
And use it to scare the hell out of you I have a woman 'Says come and watch me bleed
And I'm wondering just how I can do that And still give her everything that she needs
Well, there's three people in my head that have the answer And one of them's got to be you
But you're holding tight to it -- the answer Singing these three hundred mile per hour out-pour blues
Put on gloves, a tied scarf and wrap up warm On this winter night Every-time you get defensive
You're just looking for a fight It's safe to sing somebody out there's got a problem
With almost anything you'll do Well, next time they stab you don't fight back just play the victim
Instead of playing the fool And the roads are covered with a million Little molecules
Of cigarette ashes and the school floors are covered With pieces of pencil eraser too
Well sooner or later the ground's gonna be holding all Of my ashes too But I can't help but wonder if after
I'm gone will i still have these three hundred mile per Hour, finger breaking, no answers making,
battered dirty hands, bee stung and busted up, empty Cup torrential out pour blues
One thing's for sure: in that graveyard I'm gonna have the shiniest pair of shoes

We’re going to be interested in getting the lyrics for a load of different tracks in a playlist. Let’s create a function that finds the urls of the lyrics pages for all the tracks in a playlist.

def get_playlist_urls(df, credentials_file):
    # get the urls for the lyrics of all the songs in a dataframe
    if 'genius_url' in df.columns and df.iloc[0].genius_url is not None:
        return
    df['genius_url'] = None

    for i, r in df.iterrows():
        try:
            url = search_genius(r['artists'] + ', ' + i, credentials_file, return_artist_id=False)
            df.set_value(i, 'genius_url', url)
        except IndexError:
            pass

get_playlist_urls(acoustic_grit, 'credentials.txt')

acoustic_grit.genius_url.head()

track_name
Dry Dirt (Stripped)                           https://genius.com/Big-d-the-impossible-time-o...
Thousand Mile Night                           https://genius.com/Jonah-tolchin-thousand-mile...
Lead Me Home - The Walking Dead Soundtrack    https://genius.com/Jamie-n-commons-lead-me-hom...
Whispered Words (Pretty Lies)                 https://genius.com/Dan-auerbach-whispered-word...
Set My Soul on Fire                           https://genius.com/President-james-k-polk-pres...
Name: genius_url, dtype: object

Now let’s wrap this inside a function which grabs the lyrics for all the songs in our playlist.

def get_playlist_lyrics(df, credentials_file):
    # get the lyrics for all the songs in a dataframe
    get_playlist_urls(df, credentials_file)

    if 'lyrics' in df.columns and df.iloc[0].lyrics is not None:
        return
    df['lyrics'] = None

    for i, r in df.iterrows():
        if r['genius_url'] is not None:
            lyrics = get_lyrics(r['genius_url'])
            df.set_value(i, 'lyrics', lyrics)

get_playlist_lyrics(acoustic_grit, 'credentials.txt')

acoustic_grit.lyrics.head()

track_name
Dry Dirt (Stripped)                           Time out lay out Body build re-circle lyric Jo...
Thousand Mile Night                           Thousand mile night, Mobile to Michigan Old ra...
Lead Me Home - The Walking Dead Soundtrack    Oh lord live inside me Lead me on my way Oh lo...
Whispered Words (Pretty Lies)                 I hear words, pretty lies Like the ones they t...
Set My Soul on Fire                           James K. Polk XI President of the United State...
Name: lyrics, dtype: object

Analysing the data

Now we have created all the functions required to grab the required data, let’s do some analysis. Let’s begin by first trying to recreate the analysis done in the original blog post. First, let’s download the Radiohead spotify data.

radiohead = get_spotify_data('Radiohead', 'credentials.txt')

Let’s now sort the dataset by valence to find the most depressing songs:

radiohead[['album','valence']].sort_values(by='valence', ascending=True).head(10)

	album	valence
track_name
We Suck Young Blood	Hail To the Thief	0.0378
True Love Waits	A Moon Shaped Pool	0.0379
MK 1	In Rainbows Disk 2	0.0389
MK 2	In Rainbows Disk 2	0.0390
The Tourist	OK Computer	0.0398
Motion Picture Soundtrack	Kid A	0.0435
Go Slowly	In Rainbows Disk 2	0.0439
Videotape	In Rainbows	0.0466
Life In a Glasshouse	Amnesiac	0.0497
Tinker Tailor Soldier Sailor Rich Man Poor Man Beggar Man Thief	A Moon Shaped Pool	0.0507

And the least depressing songs:

radiohead[['album','valence']].sort_values(by='valence', ascending=False).head(10)

	album	valence
track_name
15 Step	In Rainbows	0.847
Jigsaw Falling Into Place	In Rainbows	0.808
Where Bluebirds Fly	Com Lag: 2+2=5	0.746
Fitter Happier	OK Computer	0.744
Backdrifts	Hail To the Thief	0.732
Feral	The King Of Limbs	0.729
Bodysnatchers	In Rainbows	0.727
There, There	Hail To the Thief	0.717
I Am a Wicked Child	Com Lag: 2+2=5	0.692
Paperbag Writer	Com Lag: 2+2=5	0.682

In the original post, the author tries to improve on spotify’s valence measure by instead calculating a ‘gloom index’, based on this post by Myles Harrison. This takes into account the lyrics of the song, calculating which percentage of the lyrics are ‘sad’

$gloom\_index = \frac{(1-valence) + pct\_sad (1 + lyrical\_density)}{2}$

I thought it would be interesting to similarly calculate a ‘happiness index’, which does a similar calculation but instead uses the percentage of happy words in the song lyrics

$happiness\_index = \frac{valence + pct\_happy (1 + lyrical\_density)}{2}$ In order to calculate the number of happy and sad words in the songs, I used the NRC Emotion Lexicon.

lex = pd.read_table('NRC-emotion-lexicon-wordlevel-v0.92.txt', names=['TargetWord','AffectCategory','AssociationFlag'])

sad_words = lex[(lex.AssociationFlag==1) & (lex.AffectCategory == 'sadness')]['TargetWord'].values

happy_words = lex[(lex.AssociationFlag==1) & (lex.AffectCategory == 'joy')]['TargetWord'].values

ignore = ['a', 'i', 'it', 'the', 'and', 'in', 'he', 'she',
              'to', 'at', 'of', 'that', 'as', 'is', 'his', 'my',
              'for', 'was', 'me', 'we', 'be', 'on', 'so', 'by' ,'you',
              "it's", "i'm", 'oh']

def gloom(df, ignore=ignore, sad_words=sad_words):
    if 'gloom' in df.columns and df.iloc[0].gloom != -1:
        return
    df['gloom'] = -1.
    for i, r in df.iterrows():
        v = r.valence
        try:
            filtered = r.lyrics.lower()
            for j in ignore:
                filtered = filtered.replace(' ' + j + ' ', ' ')
            num_sad = 0.
            filtered = filtered.split(' ')
            for w in filtered:
                if w in sad_words:
                    num_sad += 1.

            percentage_sad = num_sad / len(filtered)
            density = len(filtered) / r.duration_ms * 1000.

            gloom = 0.5 * ((1. - v) + percentage_sad * (1. + density))
        except AttributeError: # song has no lyrics
            gloom = 0.5 * (1. - v)

        df.set_value(i, 'gloom', gloom)

def joy(df, ignore=ignore, happy_words=happy_words):
    if 'happiness' in df.columns and df.iloc[0].happiness != -1:
        return
    df['happiness'] = -1.
    for i, r in df.iterrows():
        v = r.valence
        try:
            filtered = r.lyrics.lower()
            for j in ignore:
                filtered = filtered.replace(' ' + j + ' ', ' ')
            num_happy = 0.
            filtered = filtered.split(' ')
            for w in filtered:
                if w in happy_words:
                    num_happy += 1.

            percentage_happy = num_happy / len(filtered)
            density = len(filtered) / r.duration_ms * 1000.

            happiness = 0.5 * (v + percentage_happy * (1. + density))
        except AttributeError: # song has no lyrics
            happiness = 0.5 * v
        df.set_value(i, 'happiness', happiness)

get_playlist_lyrics(radiohead, 'credentials.txt')

gloom(radiohead)
joy(radiohead)

radiohead[['album','valence', 'gloom']].sort_values(by='gloom', ascending=False).head(10)

	album	valence	gloom
track_name
True Love Waits	A Moon Shaped Pool	0.0379	0.591282
Give Up The Ghost	The King Of Limbs	0.1590	0.507669
We Suck Young Blood	Hail To the Thief	0.0378	0.502327
Tinker Tailor Soldier Sailor Rich Man Poor Man Beggar Man Thief	A Moon Shaped Pool	0.0507	0.502317
Dollars & Cents	Amnesiac	0.0881	0.499827
Pyramid Song	Amnesiac	0.0686	0.496156
Let Down	OK Computer	0.1450	0.494404
Life In a Glasshouse	Amnesiac	0.0497	0.494085
The Tourist	OK Computer	0.0398	0.489094
Bullet Proof ... I Wish I Was	The Bends	0.0773	0.488363

We now see that some songs with higher valence have been deemed sadder based on their lyrical contents. For example Give up the Ghost contains the sad words hurt, lost, impossible, which are repeated a lot so end up making a significant percentage of the song’s lyrics.

radiohead.loc['Give Up The Ghost'].lyrics

"Don't hurt me, don't haunt me Don't hurt me, don't haunt me Don't hurt me Gather up the lost and their souls
(Don't haunt me) Into your arms (Don't hurt me) Into your arms (Don't haunt me) Gather up the pitiful
(Don't hurt me, don't haunt me) Into your arms (Don't hurt me) Into your arms (Don't haunt me)
Into your arms (Don't hurt me) Into your arms (Don't haunt me) (Into your arms) What seems impossible
(Don't hurt me, don't haunt me) Into your arms (Don't hurt me) Into your arms (Don't haunt me) Into your arms
I think I have had my fill (Don't hurt me, don't haunt me) Into your arms (Don't hurt me) Into your arms
(Don't haunt me) Into your arms (Don't hurt me) Into your arms (Don't haunt me) I think I should give up the ghost
(Don't hurt me, don't haunt me) Into your arms (Don't hurt me, don't haunt me) Into your arms (Don't hurt me)
Into your arms (Don't haunt me) Into your arms (Don't hurt me) Into your arms (Don't haunt me) Into your arms
(Don't hurt me) Into your arms (Don't haunt me) Into your arms (Don't hurt me) Into your arms (Don't haunt me)
Into your arms (Don't hurt me) Into your arms (Don't haunt me)"

Let’s similarly look for the ‘happiest’ songs:

radiohead[['album','valence', 'happiness', 'gloom']].sort_values(by='happiness', ascending=False).head(10)

	album	valence	happiness	gloom
track_name
Fitter Happier	OK Computer	0.744	0.435214	0.191214
I Am a Wicked Child	Com Lag: 2+2=5	0.692	0.425438	0.183789
15 Step	In Rainbows	0.847	0.423500	0.076500
Jigsaw Falling Into Place	In Rainbows	0.808	0.414006	0.131020
Sulk	The Bends	0.671	0.405095	0.171460
Where Bluebirds Fly	Com Lag: 2+2=5	0.746	0.392560	0.136780
I Promise	OK Computer OKNOTOK 1997 2017	0.487	0.369156	0.256500
Backdrifts	Hail To the Thief	0.732	0.366000	0.199295
Separator	The King Of Limbs	0.659	0.365018	0.197138
There, There	Hail To the Thief	0.717	0.364853	0.166911

It is noticeable that the songs with the ‘happiest’ lyrics don’t necessarily have the highest valence. I Promise scores low on the valence scale, but is judged to have the 7th happiest set of lyrics overall. Looking at the lyrics, we see this is likely to be due to the repetition of the word ‘promise’, which is classed as a happy word. It also has a fairly high gloom index as the other words in the song are fairly negative.

radiohead.loc['I Promise'].lyrics

"I won't run away no more, I promise Even when I get bored, I promise Even when you lock me out,
I promise I say my prayers every night, I promise I know which side I'm spread, I promise
The tantrums and the chitty chats, I promise Even when the ship is wrecked, I promise
Tie me to the rotting deck, I promise I won't run away no more, I promise Even when I get bored,
I promise Even when the ship is wrecked, I promise Tie me to the rotting deck, I promise
I won't run away no more, I promise"

Looking at musical genres

After replicating the Radiohead analysis, I thought it might be interesting to look at tracks from different musical genres and compare their characteristics. To do this, I downloaded the data for Spotify playlists containing songs from a variety of genres. From this, I then made use of plotly’s interactivity to produce a plot that allows us to investigate the different measures.

# RapCaviar, Pop Rising, Ultimate Indie, Top picks country, truly deeply house, metal essentials

dfs = {'indie': pd.read_csv('indie.csv'), 'pop': pd.read_csv('pop.csv'), 'country': pd.read_csv('country.csv'),
       'metal': pd.read_csv('metal.csv'), 'house': pd.read_csv('house.csv'), 'rap': pd.read_csv('rap.csv')}

import plotly
plotly.offline.init_notebook_mode(connected=True)
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.tools as tls

# function to make list of traces given dictionary of dataframes and the dataframe keys to be plotted
def make_traces(x, y, dfs):
    ts = []
    for name, df in dfs.items():
        ts.append(go.Scatter(x=df[x], y=df[y], mode='markers',
                       name=name, text=df.name + ' - ' + df.artists))
    return ts

data = dict()

# define which categories we want to include
categories = ['duration_ms', 'popularity', 'acousticness', 'danceability',
       'energy', 'instrumentalness', 'key', 'liveness', 'loudness',
       'mode', 'speechiness', 'tempo', 'time_signature', 'valence']

# for each category make a list of the data from each dataframe
for cat in categories:
    data[cat] = [df[cat] for df in dfs.values()]

# define behaviour of dropdown menus
# we've defined the buttons using list comprehensions - on selection, the x/y data and axis label are updated
updatemenus = list([
    dict(x=-0.05, y=0.8,
         buttons=list([   
            dict(label = cat, method = 'update',
                 args = [{'x': data[cat]}, {'xaxis': dict(title = cat)}]) for cat in categories
        ])
    ),
    dict(x=-0.05,  y=1,
         buttons=list([   
            dict(label = cat, method = 'update',
                 args = [{'y': data[cat]}, {'yaxis': dict(title = cat)}]) for cat in categories
        ])
    )
])

# set the initial data
initial_dat = go.Data(make_traces('duration_ms', 'duration_ms', dfs))

# make the layout
layout = dict(title='Compare genres', showlegend=True,
              updatemenus=updatemenus)

fig = dict(data=initial_dat, layout=layout)
fig['layout'].update(hovermode='closest')
plotly.offline.iplot(fig)

This is pretty interesting: looking at energy vs danceability, if you select just the metal, country and house datasets (click on the name of a dataset in the legend to hide it), you can see that the data form 3 pretty distinct clusters. The rap and house datasets surprisingly occupy a similar region in the plot. The metal dataset is the most tightly clustered, all tracks being high energy but not very danceable. We can also see that house tracks tend to have a very consistent bpm (120) and be the longest, rap music tends to be the most popular and almost all songs have a 4/4 time signature.

Machine learning

From the above plot, it can be seen that the data is quite clustered. This suggests it may be possible to create some kind of machine learning genre classifier using scikit-learn. Knowing very little about machine learning, I pretty much stuck to the examples in the documentation to create this, so the resulting classifier is almost certainly much less successful than it could be some tuning.

from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import accuracy_score

for genre, df in dfs.items():
    df['genre'] = genre

dat = pd.concat(dfs.values())

data = dat[['duration_ms', 'explicit', 'popularity', 'acousticness', 'danceability',
       'energy', 'instrumentalness', 'key', 'liveness', 'loudness',
       'mode', 'speechiness', 'tempo', 'time_signature', 'valence', 'genre']].values[:,:-1]
labels = dat[['duration_ms', 'explicit', 'popularity', 'acousticness', 'danceability',
       'energy', 'instrumentalness', 'key', 'liveness', 'loudness',
       'mode', 'speechiness', 'tempo', 'time_signature', 'valence', 'genre']].values[:,-1]

data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size = 0.3)

classifier = tree.DecisionTreeClassifier()
classifier.fit(data_train, labels_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

accuracy = accuracy_score(labels_test, classifier.predict(data_test))
print("Decision Tree Accuracy with 50/50: {}".format(accuracy))

Decision Tree Accuracy with 50/50: 0.6298701298701299

This doesn’t have the best accuracy, however if we drop one of the genres (especially one of the less tightly clustered ones such as pop, indie or rap), our accuracy increases significantly. Not knowing anything about machine learning classifiers, it’s also very likely I’ve chosen one that isn’t great for this dataset, and the accuracy would also improve if I chose one that was more suitable.

nopop_dat = dat[dat.genre != 'pop']
data = nopop_dat[['duration_ms', 'explicit', 'popularity', 'acousticness', 'danceability',
       'energy', 'instrumentalness', 'key', 'liveness', 'loudness',
       'mode', 'speechiness', 'tempo', 'time_signature', 'valence', 'genre']].values[:,:-1]
labels = nopop_dat[['duration_ms', 'explicit', 'popularity', 'acousticness', 'danceability',
       'energy', 'instrumentalness', 'key', 'liveness', 'loudness',
       'mode', 'speechiness', 'tempo', 'time_signature', 'valence', 'genre']].values[:,-1]

data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size = 0.3)

classifier = tree.DecisionTreeClassifier()
classifier.fit(data_train, labels_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

accuracy = accuracy_score(labels_test, classifier.predict(data_test))
print("Decision Tree Accuracy with 50/50: {}".format(accuracy))

Decision Tree Accuracy with 50/50: 0.8173076923076923

That’s much better!

Summary

In this post, I have described how I was able to download a load of music data from Spotify using spotipy, analyse it using pandas, produce an interactive plot using plotly, and do some (v. basic) machine learning using scikit-learn. In the process, I learnt a great deal about pandas and how web APIs work, such that I’m pretty keen to explore some more interesting datasets in the future. I also learnt that Radiohead songs really are quite depressing, and that metal music may be very energetic but is not judged by Spotify as being very danceable.