Visualising music data
Last year, I helped give an introductory python course to some MSc maths students here at the University of Southampton. As part of this course, we introduced them to some basic data analysis with pandas
and machine learning with scikit-learn
. For this, we needed some data to analyse. In many learning materials, the Iris flower dataset is used. However, I didn’t think that comparing the lengths and widths of flower petals would be particularly inspiring so decided to see if I could find something else.
After scouring the internet for some more interesting datasets, I came across this blog post. In the post, the author uses R and the Spotify and Genius Lyrics APIs to find the most depressing Radiohead song. Being a Radiohead fan, this certainly looked like an interesting dataset, and not having any prior experience with R or web APIs, working out how to do something similar in python looked like a nice challenge.
I began by first replicating the analysis carried out in the original post, then extended it to investigate the properties of music from different genres. In the process, I learnt a lot about web APIs, pandas, scikit-learn and just how great plotly is. Consequently, I decided to write up the steps I took in case anyone else finds music-related data as interesting as I do!
In this post, I will begin by getting the music data using the Spotify web API, then grab the lyrics using the Genius lyrics API. I’ll then do some analysis of Radiohead data and compare musical genres (using a pretty plotly figure), finishing with some basic machine learning.
This post was originally written as a Jupyter Notebook, then converted to markdown using nbconvert. If you wish to download the original notebook, it can be found here.
Getting the music data
Spotify provide a web API which can be used to download data about its music. This data includes the audio features of a track, a set of measures including ‘acousticness’, ‘danceability’, ‘speechiness’ and ‘valence’:
A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
Fortunately, there already exists a python library for interacting with the spotify API: spotipy
. At this point, I am also going to import pandas
for data analysis, BeautifulSoup
for some web scraping, requests
for HTTP requests to pull the source code from websites and lxml
for processing XML.
import spotipy
spotify = spotipy.Spotify()
import sys
import pandas as pd
from spotipy.oauth2 import SpotifyClientCredentials
from bs4 import BeautifulSoup
import requests
import lxml
We’ll begin by defining two functions: one which reads the spotify credentials required to use the API from a file (for this you’ll need a spotify dev account, which you can sign up for free here), and another which then gets the audio features data for a given artist’s tracks. The get_spotify_credentials
function takes the name of a text file containing the client ID and secret in the format:
credentials.txt
client_id ########
client_secret #######
def get_spotify_credentials(filename):
if filename is None:
raise IOError('Credentials file is none.')
f = open(filename)
txt = f.readlines()
client_id = None
client_secret = None
for l in txt:
l = l.replace('\n', '').split(' ')
if l[0] == 'client_id':
client_id = l[1]
elif l[0] == 'client_secret':
client_secret = l[1]
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
sp.trace = True
return sp
def get_spotify_data(artist_name, credentials_file):
# get authorisation stuff
sp = get_spotify_credentials(credentials_file)
# first get spotify artist uri
results = sp.search(q='artist:' + artist_name, type='artist')
items = results['artists']['items']
if len(items) > 0:
artist = items[0]
uri = artist['uri']
# now get album uris
results = sp.artist_albums(uri, album_type='album')
albums = results['items']
while results['next']:
results = sp.next(results)
albums.extend(results['items'])
uris = []
track_names = []
album_names = []
# get track data
for album in albums:
for t in sp.album(album['uri'])['tracks']['items']:
uris.append(t['uri'])
track_names.append(t['name'])
album_names.append(album['name'])
features = []
for i in range(len(uris)// 100 + 1):
fs = sp.audio_features(uris[i*100:min((i+1)*100, len(uris))])
if fs[0] is not None:
features.extend(fs)
# make dataframe
dat = pd.DataFrame(features)
dat['track_name'] = track_names
dat['album'] = album_names
dat['artists'] = artist_name
# ignore live, remix and deluxe album versions
mask = [('live' not in s.lower() and 'deluxe' not in s.lower()
and 'remix' not in s.lower() and 'rmx' not in s.lower()
and 'remastered' not in s.lower()) for s in dat.album.values]
dat = dat[mask]
mask2 = [(('remix' not in s.lower()) and
'remastered' not in s.lower() and 'live' not in s.lower()
and 'version' not in s.lower()) for s in dat.track_name.values]
dat = dat[mask2]
dat.set_index('track_name', inplace=True)
dat.drop_duplicates(inplace=True)
dat = dat[~dat.index.duplicated(keep='first')]
return dat
Let’s try running that on an artist.
white_stripes = get_spotify_data('The White Stripes', 'credentials.txt')
Let’s look at the data spotify has given us:
white_stripes.columns
Index(['acousticness', 'analysis_url', 'danceability', 'duration_ms', 'energy',
'id', 'instrumentalness', 'key', 'liveness', 'loudness', 'mode',
'speechiness', 'tempo', 'time_signature', 'track_href', 'type', 'uri',
'valence', 'album', 'artists'],
dtype='object')
white_stripes.head()
acousticness | analysis_url | danceability | duration_ms | energy | id | instrumentalness | key | liveness | loudness | mode | speechiness | tempo | time_signature | track_href | type | uri | valence | album | artists | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
track_name | ||||||||||||||||||||
Icky Thump | 0.0210 | https://api.spotify.com/v1/audio-analysis/09QZ... | 0.424 | 254533 | 0.632 | 09QZAEmdbq28OaNyqTOEvY | 0.011700 | 9 | 0.0545 | -7.243 | 1 | 0.0922 | 97.694 | 4 | https://api.spotify.com/v1/tracks/09QZAEmdbq28... | audio_features | spotify:track:09QZAEmdbq28OaNyqTOEvY | 0.356 | Icky Thump | The White Stripes |
You Don't Know What Love Is (You Just Do As You're Told) | 0.0233 | https://api.spotify.com/v1/audio-analysis/6atU... | 0.427 | 234400 | 0.745 | 6atUkoZ6Cj4w8QzOUjQucI | 0.000062 | 2 | 0.1830 | -5.516 | 1 | 0.0490 | 83.960 | 4 | https://api.spotify.com/v1/tracks/6atUkoZ6Cj4w... | audio_features | spotify:track:6atUkoZ6Cj4w8QzOUjQucI | 0.562 | Icky Thump | The White Stripes |
300 M.P.H. Torrential Outpour Blues | 0.5480 | https://api.spotify.com/v1/audio-analysis/2Zbn... | 0.537 | 328560 | 0.435 | 2Zbnh37ISbOOaTu4If4lhu | 0.027000 | 9 | 0.1070 | -10.762 | 1 | 0.0868 | 85.589 | 4 | https://api.spotify.com/v1/tracks/2Zbnh37ISbOO... | audio_features | spotify:track:2Zbnh37ISbOOaTu4If4lhu | 0.227 | Icky Thump | The White Stripes |
Conquest | 0.0573 | https://api.spotify.com/v1/audio-analysis/1ITd... | 0.469 | 168307 | 0.761 | 1ITd91pncftIqQ2tCJsSFT | 0.004390 | 7 | 0.1140 | -4.791 | 1 | 0.0731 | 136.831 | 4 | https://api.spotify.com/v1/tracks/1ITd91pncftI... | audio_features | spotify:track:1ITd91pncftIqQ2tCJsSFT | 0.466 | Icky Thump | The White Stripes |
Bone Broke | 0.0850 | https://api.spotify.com/v1/audio-analysis/3eMw... | 0.329 | 194360 | 0.811 | 3eMwOh7qeMJfBZohwQeSJQ | 0.822000 | 2 | 0.0783 | -6.109 | 1 | 0.1180 | 84.205 | 4 | https://api.spotify.com/v1/tracks/3eMwOh7qeMJf... | audio_features | spotify:track:3eMwOh7qeMJfBZohwQeSJQ | 0.401 | Icky Thump | The White Stripes |
As hoped, we can see that for each track we have obtained the desired acoustic properties (along with some info about its location in the Spotify database). We can also define a function that gets the data for a user’s playlist. This will be useful later on when we want to look at music from different musical genres.
def get_spotify_playlist_data(username='spotify', playlist=None, credentials_file=None):
# set a limit to total number of tracks to analyse
track_number_limit = 500
# get authorisation stuff
sp = get_spotify_credentials(credentials_file)
# get user playlists
p = None
results = sp.user_playlists(username)
playlists = results['items']
if playlist is None: # use first of the user's playlists
playlist = playlists[0]['name']
for pl in playlists:
if pl['name'] is not None and pl['name'].lower() == playlist.lower():
p = pl
break
while results['next'] and p is None:
results = sp.next(results)
playlists = results['items']
for pl in playlists:
if pl['name'] is not None and pl['name'].lower() == playlist.lower():
p = pl
break
if p is None:
print('Could not find playlist')
return
results = sp.user_playlist(p['owner']['id'], p['id'], fields="tracks,next")['tracks']
tracks = results['items']
while results['next'] and len(tracks) < track_number_limit:
results = sp.next(results)
if results['items'][0] is not None:
tracks.extend(results['items'])
ts = []
track_names = []
for t in tracks:
track = t['track']
track['album'] = track['album']['name']
track_names.append(t['track']['name'])
artists = []
for a in track['artists']:
artists.append(a['name'])
track['artists'] = ', '.join(artists)
ts.append(track)
uris = []
dat = pd.DataFrame(ts)
dat.drop(['available_markets', 'disc_number', 'external_ids', 'external_urls'], axis=1, inplace=True)
features = []
# loop to take advantage of spotify being able to get data for 100 tracks at once
for i in range(len(dat)// 100 + 1):
fs = sp.audio_features(dat.uri.iloc[i*100:min((i+1)*100, len(dat))])
if fs[0] is not None:
features.extend(fs)
fs = pd.DataFrame(features)
dat = pd.concat([dat, fs], axis=1)
dat['track_name'] = track_names
# ignore live, remix and deluxe album versions
mask = [(('live' not in s.lower()) and ('deluxe' not in s.lower())
and ('remix' not in s.lower())) for s in dat.album.values]
dat = dat[mask]
mask2 = [(('remix' not in s.lower()) and
'remastered' not in s.lower()
and 'version' not in s.lower()) for s in dat.track_name.values]
dat = dat[mask2]
dat.set_index('track_name', inplace=True)
dat = dat[~dat.index.duplicated(keep='first')]
dat = dat.T[~dat.T.index.duplicated(keep='first')].T
return dat
acoustic_grit = get_spotify_playlist_data(playlist="acoustic grit", credentials_file='credentials.txt')
acoustic_grit.head()
album | artists | duration_ms | episode | explicit | href | id | is_local | name | popularity | ... | instrumentalness | key | liveness | loudness | mode | speechiness | tempo | time_signature | track_href | valence | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
track_name | |||||||||||||||||||||
Dry Dirt (Stripped) | Spirit's Furnace | The Bones of J.R. Jones | 214036 | False | False | https://api.spotify.com/v1/tracks/7g4fX37Y3lzi... | 7g4fX37Y3lziLMoxrTTGI3 | False | Dry Dirt (Stripped) | 42 | ... | 0.358 | 9 | 0.112 | -12.715 | 1 | 0.0588 | 126.195 | 4 | https://api.spotify.com/v1/tracks/7g4fX37Y3lzi... | 0.326 |
Thousand Mile Night | Thousand Mile Night - Single | Jonah Tolchin | 235173 | False | False | https://api.spotify.com/v1/tracks/5yeM63cXgGvS... | 5yeM63cXgGvSN2VrcHbv6x | False | Thousand Mile Night | 51 | ... | 0.0719 | 2 | 0.0944 | -13.942 | 1 | 0.0471 | 140.652 | 4 | https://api.spotify.com/v1/tracks/5yeM63cXgGvS... | 0.498 |
Lead Me Home - The Walking Dead Soundtrack | The Walking Dead (AMC’s Original Soundtrack – ... | Jamie N Commons | 117426 | False | False | https://api.spotify.com/v1/tracks/2DBFAJgsqhYk... | 2DBFAJgsqhYk5Z1AF7tAMH | False | Lead Me Home - The Walking Dead Soundtrack | 46 | ... | 0.00015 | 6 | 0.118 | -11.069 | 0 | 0.0323 | 136.278 | 5 | https://api.spotify.com/v1/tracks/2DBFAJgsqhYk... | 0.175 |
Whispered Words (Pretty Lies) | Keep It Hid | Dan Auerbach | 246131 | False | False | https://api.spotify.com/v1/tracks/4pPWNeApSiOR... | 4pPWNeApSiORRQucFZt85y | False | Whispered Words (Pretty Lies) | 3 | ... | 0.000453 | 9 | 0.093 | -6.9 | 1 | 0.0734 | 86.697 | 4 | https://api.spotify.com/v1/tracks/4pPWNeApSiOR... | 0.481 |
Set My Soul on Fire | Down to the River | The War and Treaty | 299680 | False | False | https://api.spotify.com/v1/tracks/5yuqWMCOtMY0... | 5yuqWMCOtMY0IBaQCBzqT5 | False | Set My Soul on Fire | 48 | ... | 0 | 11 | 0.128 | -10.767 | 0 | 0.0382 | 122.416 | 4 | https://api.spotify.com/v1/tracks/5yuqWMCOtMY0... | 0.299 |
5 rows × 29 columns
Getting the lyrical data
To get the lyrical data, we shall be using the Genius lyrics API. We do this by first submitting a query to find the artist id, then submit another query to download the lyrics. There does not (to my knowledge) exist an existing python library for interacting with this API (as there was for the Spotify API), so we need to do a bit more work here.
To submit the API requests, we shall be using the python requests
library. This will then return us a JSON object containing the url of the page containing the song lyrics. To then get the lyrics, we will use requests
to get the lyrics page’s source code, then use Beautiful Soup
with the lxml
parser to find the div
container which contains the lyrics.
Note that the search_genius
function takes a credentials file as an argument. This credentials file contains the token required to interact with the API.
def search_genius(query, credentials_file, return_artist_id=True):
f = open(credentials_file)
txt = f.readlines()
genius_token = None
for l in txt:
l = l.replace('\n', '').split(' ')
if l[0] == 'genius_token':
genius_token = l[1]
API = 'https://api.genius.com'
HEADERS = {'Authorization': 'Bearer ' + genius_token}
search_endpoint = API + '/search?'
payload = {'q': query}
search_request_object = requests.get(search_endpoint, params=payload, headers=HEADERS)
if search_request_object.status_code == 200:
s_json_response = search_request_object.json()
api_call = search_request_object.url
if len(s_json_response['response']['hits']) == 0:
return None
else:
hit = s_json_response['response']['hits'][0]
artist = hit['result']['primary_artist']['name']
artist_url = hit['result']['primary_artist']['url']
artist_id = hit['result']['primary_artist']['id']
title = hit['result']['title']
song_id = hit['result']['id']
url = hit['result']['url']
if return_artist_id:
return artist_id #, artist_url, title, url, hit
else:
return url
elif 400 <= search_request_object.status_code < 500:
print('[!] Uh-oh, something seems wrong...')
print('[!] Please submit an issue at https://github.com/donniebishop/genius_lyrics/issues')
sys.exit(1)
elif search_request_object.status_code >= 500:
print('[*] Hmm... Genius.com seems to be having some issues right now.')
print('[*] Please try your search again in a little bit!')
sys.exit(1)
return
def get_songs(artist_id, credentials_file):
f = open(credentials_file)
txt = f.readlines()
genius_token = None
for l in txt:
l = l.replace('\n', '').split(' ')
if l[0] == 'genius_token':
genius_token = l[1]
API = 'https://api.genius.com'
HEADERS = {'Authorization': 'Bearer ' + genius_token}
songs = []
page = 1
search_endpoint = API + '/artists/' + str(artist_id) + '/songs'
while True:
payload = {'per_page': 50, 'page': page}
search_request_object = requests.get(search_endpoint, params=payload, headers=HEADERS)
if search_request_object.status_code != 200:
break
else:
s_json_response = search_request_object.json()
if len(s_json_response['response']['songs']) == 0:
break
for song in s_json_response['response']['songs']:
songs.append([song['title'], song['id'], song['url']])
page += 1
return songs
def get_lyrics(url):
get_url = requests.get(url)
song_soup = BeautifulSoup(get_url.text, 'lxml')
divs = song_soup.find_all('div')
lyrics = []
for d in divs:
try:
if d['class'][0] == 'lyrics':
strings = d.stripped_strings
except KeyError:
pass
for s in strings:
if s[0] != '[':
lyrics.append(s)
ls = ' '.join(lyrics)
return ls
Let’s try and run that on our White Stripes data:
artist_id = search_genius("The White Stripes", 'credentials.txt')
genius_songs = get_songs(artist_id, 'credentials.txt')
name = genius_songs[0][0]
url = genius_songs[0][2]
lyrics = get_lyrics(url)
print(name, '\n')
print(lyrics)
300 M.P.H. Torrential Outpour Blues
I'm bringing back ghosts That are no longer there I'm getting hard on myself Sitting in my easy chair
Well, there's three people in the mirror And I'm wondering which one of them i should choose
Well, I can't keep from laughing Spitting out these 300 mile per hour out-pour blues I'm breaking my teeth off
Trying to bite my lip There's all kinds of red-headed women That I ain't supposed to kiss
And it's that color that never fails To turn me blue So I just swallow it and hold on to it
And use it to scare the hell out of you I have a woman 'Says come and watch me bleed
And I'm wondering just how I can do that And still give her everything that she needs
Well, there's three people in my head that have the answer And one of them's got to be you
But you're holding tight to it -- the answer Singing these three hundred mile per hour out-pour blues
Put on gloves, a tied scarf and wrap up warm On this winter night Every-time you get defensive
You're just looking for a fight It's safe to sing somebody out there's got a problem
With almost anything you'll do Well, next time they stab you don't fight back just play the victim
Instead of playing the fool And the roads are covered with a million Little molecules
Of cigarette ashes and the school floors are covered With pieces of pencil eraser too
Well sooner or later the ground's gonna be holding all Of my ashes too But I can't help but wonder if after
I'm gone will i still have these three hundred mile per Hour, finger breaking, no answers making,
battered dirty hands, bee stung and busted up, empty Cup torrential out pour blues
One thing's for sure: in that graveyard I'm gonna have the shiniest pair of shoes
We’re going to be interested in getting the lyrics for a load of different tracks in a playlist. Let’s create a function that finds the urls of the lyrics pages for all the tracks in a playlist.
def get_playlist_urls(df, credentials_file):
# get the urls for the lyrics of all the songs in a dataframe
if 'genius_url' in df.columns and df.iloc[0].genius_url is not None:
return
df['genius_url'] = None
for i, r in df.iterrows():
try:
url = search_genius(r['artists'] + ', ' + i, credentials_file, return_artist_id=False)
df.set_value(i, 'genius_url', url)
except IndexError:
pass
get_playlist_urls(acoustic_grit, 'credentials.txt')
acoustic_grit.genius_url.head()
track_name
Dry Dirt (Stripped) https://genius.com/Big-d-the-impossible-time-o...
Thousand Mile Night https://genius.com/Jonah-tolchin-thousand-mile...
Lead Me Home - The Walking Dead Soundtrack https://genius.com/Jamie-n-commons-lead-me-hom...
Whispered Words (Pretty Lies) https://genius.com/Dan-auerbach-whispered-word...
Set My Soul on Fire https://genius.com/President-james-k-polk-pres...
Name: genius_url, dtype: object
Now let’s wrap this inside a function which grabs the lyrics for all the songs in our playlist.
def get_playlist_lyrics(df, credentials_file):
# get the lyrics for all the songs in a dataframe
get_playlist_urls(df, credentials_file)
if 'lyrics' in df.columns and df.iloc[0].lyrics is not None:
return
df['lyrics'] = None
for i, r in df.iterrows():
if r['genius_url'] is not None:
lyrics = get_lyrics(r['genius_url'])
df.set_value(i, 'lyrics', lyrics)
get_playlist_lyrics(acoustic_grit, 'credentials.txt')
acoustic_grit.lyrics.head()
track_name
Dry Dirt (Stripped) Time out lay out Body build re-circle lyric Jo...
Thousand Mile Night Thousand mile night, Mobile to Michigan Old ra...
Lead Me Home - The Walking Dead Soundtrack Oh lord live inside me Lead me on my way Oh lo...
Whispered Words (Pretty Lies) I hear words, pretty lies Like the ones they t...
Set My Soul on Fire James K. Polk XI President of the United State...
Name: lyrics, dtype: object
Analysing the data
Now we have created all the functions required to grab the required data, let’s do some analysis. Let’s begin by first trying to recreate the analysis done in the original blog post. First, let’s download the Radiohead spotify data.
radiohead = get_spotify_data('Radiohead', 'credentials.txt')
Let’s now sort the dataset by valence to find the most depressing songs:
radiohead[['album','valence']].sort_values(by='valence', ascending=True).head(10)
album | valence | |
---|---|---|
track_name | ||
We Suck Young Blood | Hail To the Thief | 0.0378 |
True Love Waits | A Moon Shaped Pool | 0.0379 |
MK 1 | In Rainbows Disk 2 | 0.0389 |
MK 2 | In Rainbows Disk 2 | 0.0390 |
The Tourist | OK Computer | 0.0398 |
Motion Picture Soundtrack | Kid A | 0.0435 |
Go Slowly | In Rainbows Disk 2 | 0.0439 |
Videotape | In Rainbows | 0.0466 |
Life In a Glasshouse | Amnesiac | 0.0497 |
Tinker Tailor Soldier Sailor Rich Man Poor Man Beggar Man Thief | A Moon Shaped Pool | 0.0507 |
And the least depressing songs:
radiohead[['album','valence']].sort_values(by='valence', ascending=False).head(10)
album | valence | |
---|---|---|
track_name | ||
15 Step | In Rainbows | 0.847 |
Jigsaw Falling Into Place | In Rainbows | 0.808 |
Where Bluebirds Fly | Com Lag: 2+2=5 | 0.746 |
Fitter Happier | OK Computer | 0.744 |
Backdrifts | Hail To the Thief | 0.732 |
Feral | The King Of Limbs | 0.729 |
Bodysnatchers | In Rainbows | 0.727 |
There, There | Hail To the Thief | 0.717 |
I Am a Wicked Child | Com Lag: 2+2=5 | 0.692 |
Paperbag Writer | Com Lag: 2+2=5 | 0.682 |
In the original post, the author tries to improve on spotify’s valence measure by instead calculating a ‘gloom index’, based on this post by Myles Harrison. This takes into account the lyrics of the song, calculating which percentage of the lyrics are ‘sad’
I thought it would be interesting to similarly calculate a ‘happiness index’, which does a similar calculation but instead uses the percentage of happy words in the song lyrics
In order to calculate the number of happy and sad words in the songs, I used the NRC Emotion Lexicon.
lex = pd.read_table('NRC-emotion-lexicon-wordlevel-v0.92.txt', names=['TargetWord','AffectCategory','AssociationFlag'])
sad_words = lex[(lex.AssociationFlag==1) & (lex.AffectCategory == 'sadness')]['TargetWord'].values
happy_words = lex[(lex.AssociationFlag==1) & (lex.AffectCategory == 'joy')]['TargetWord'].values
ignore = ['a', 'i', 'it', 'the', 'and', 'in', 'he', 'she',
'to', 'at', 'of', 'that', 'as', 'is', 'his', 'my',
'for', 'was', 'me', 'we', 'be', 'on', 'so', 'by' ,'you',
"it's", "i'm", 'oh']
def gloom(df, ignore=ignore, sad_words=sad_words):
if 'gloom' in df.columns and df.iloc[0].gloom != -1:
return
df['gloom'] = -1.
for i, r in df.iterrows():
v = r.valence
try:
filtered = r.lyrics.lower()
for j in ignore:
filtered = filtered.replace(' ' + j + ' ', ' ')
num_sad = 0.
filtered = filtered.split(' ')
for w in filtered:
if w in sad_words:
num_sad += 1.
percentage_sad = num_sad / len(filtered)
density = len(filtered) / r.duration_ms * 1000.
gloom = 0.5 * ((1. - v) + percentage_sad * (1. + density))
except AttributeError: # song has no lyrics
gloom = 0.5 * (1. - v)
df.set_value(i, 'gloom', gloom)
def joy(df, ignore=ignore, happy_words=happy_words):
if 'happiness' in df.columns and df.iloc[0].happiness != -1:
return
df['happiness'] = -1.
for i, r in df.iterrows():
v = r.valence
try:
filtered = r.lyrics.lower()
for j in ignore:
filtered = filtered.replace(' ' + j + ' ', ' ')
num_happy = 0.
filtered = filtered.split(' ')
for w in filtered:
if w in happy_words:
num_happy += 1.
percentage_happy = num_happy / len(filtered)
density = len(filtered) / r.duration_ms * 1000.
happiness = 0.5 * (v + percentage_happy * (1. + density))
except AttributeError: # song has no lyrics
happiness = 0.5 * v
df.set_value(i, 'happiness', happiness)
get_playlist_lyrics(radiohead, 'credentials.txt')
gloom(radiohead)
joy(radiohead)
radiohead[['album','valence', 'gloom']].sort_values(by='gloom', ascending=False).head(10)
album | valence | gloom | |
---|---|---|---|
track_name | |||
True Love Waits | A Moon Shaped Pool | 0.0379 | 0.591282 |
Give Up The Ghost | The King Of Limbs | 0.1590 | 0.507669 |
We Suck Young Blood | Hail To the Thief | 0.0378 | 0.502327 |
Tinker Tailor Soldier Sailor Rich Man Poor Man Beggar Man Thief | A Moon Shaped Pool | 0.0507 | 0.502317 |
Dollars & Cents | Amnesiac | 0.0881 | 0.499827 |
Pyramid Song | Amnesiac | 0.0686 | 0.496156 |
Let Down | OK Computer | 0.1450 | 0.494404 |
Life In a Glasshouse | Amnesiac | 0.0497 | 0.494085 |
The Tourist | OK Computer | 0.0398 | 0.489094 |
Bullet Proof ... I Wish I Was | The Bends | 0.0773 | 0.488363 |
We now see that some songs with higher valence have been deemed sadder based on their lyrical contents. For example Give up the Ghost contains the sad words hurt, lost, impossible, which are repeated a lot so end up making a significant percentage of the song’s lyrics.
radiohead.loc['Give Up The Ghost'].lyrics
"Don't hurt me, don't haunt me Don't hurt me, don't haunt me Don't hurt me Gather up the lost and their souls
(Don't haunt me) Into your arms (Don't hurt me) Into your arms (Don't haunt me) Gather up the pitiful
(Don't hurt me, don't haunt me) Into your arms (Don't hurt me) Into your arms (Don't haunt me)
Into your arms (Don't hurt me) Into your arms (Don't haunt me) (Into your arms) What seems impossible
(Don't hurt me, don't haunt me) Into your arms (Don't hurt me) Into your arms (Don't haunt me) Into your arms
I think I have had my fill (Don't hurt me, don't haunt me) Into your arms (Don't hurt me) Into your arms
(Don't haunt me) Into your arms (Don't hurt me) Into your arms (Don't haunt me) I think I should give up the ghost
(Don't hurt me, don't haunt me) Into your arms (Don't hurt me, don't haunt me) Into your arms (Don't hurt me)
Into your arms (Don't haunt me) Into your arms (Don't hurt me) Into your arms (Don't haunt me) Into your arms
(Don't hurt me) Into your arms (Don't haunt me) Into your arms (Don't hurt me) Into your arms (Don't haunt me)
Into your arms (Don't hurt me) Into your arms (Don't haunt me)"
Let’s similarly look for the ‘happiest’ songs:
radiohead[['album','valence', 'happiness', 'gloom']].sort_values(by='happiness', ascending=False).head(10)
album | valence | happiness | gloom | |
---|---|---|---|---|
track_name | ||||
Fitter Happier | OK Computer | 0.744 | 0.435214 | 0.191214 |
I Am a Wicked Child | Com Lag: 2+2=5 | 0.692 | 0.425438 | 0.183789 |
15 Step | In Rainbows | 0.847 | 0.423500 | 0.076500 |
Jigsaw Falling Into Place | In Rainbows | 0.808 | 0.414006 | 0.131020 |
Sulk | The Bends | 0.671 | 0.405095 | 0.171460 |
Where Bluebirds Fly | Com Lag: 2+2=5 | 0.746 | 0.392560 | 0.136780 |
I Promise | OK Computer OKNOTOK 1997 2017 | 0.487 | 0.369156 | 0.256500 |
Backdrifts | Hail To the Thief | 0.732 | 0.366000 | 0.199295 |
Separator | The King Of Limbs | 0.659 | 0.365018 | 0.197138 |
There, There | Hail To the Thief | 0.717 | 0.364853 | 0.166911 |
It is noticeable that the songs with the ‘happiest’ lyrics don’t necessarily have the highest valence. I Promise scores low on the valence scale, but is judged to have the 7th happiest set of lyrics overall. Looking at the lyrics, we see this is likely to be due to the repetition of the word ‘promise’, which is classed as a happy word. It also has a fairly high gloom index as the other words in the song are fairly negative.
radiohead.loc['I Promise'].lyrics
"I won't run away no more, I promise Even when I get bored, I promise Even when you lock me out,
I promise I say my prayers every night, I promise I know which side I'm spread, I promise
The tantrums and the chitty chats, I promise Even when the ship is wrecked, I promise
Tie me to the rotting deck, I promise I won't run away no more, I promise Even when I get bored,
I promise Even when the ship is wrecked, I promise Tie me to the rotting deck, I promise
I won't run away no more, I promise"
Looking at musical genres
After replicating the Radiohead analysis, I thought it might be interesting to look at tracks from different musical genres and compare their characteristics. To do this, I downloaded the data for Spotify playlists containing songs from a variety of genres. From this, I then made use of plotly
’s interactivity to produce a plot that allows us to investigate the different measures.
# RapCaviar, Pop Rising, Ultimate Indie, Top picks country, truly deeply house, metal essentials
dfs = {'indie': pd.read_csv('indie.csv'), 'pop': pd.read_csv('pop.csv'), 'country': pd.read_csv('country.csv'),
'metal': pd.read_csv('metal.csv'), 'house': pd.read_csv('house.csv'), 'rap': pd.read_csv('rap.csv')}
import plotly
plotly.offline.init_notebook_mode(connected=True)
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.tools as tls
# function to make list of traces given dictionary of dataframes and the dataframe keys to be plotted
def make_traces(x, y, dfs):
ts = []
for name, df in dfs.items():
ts.append(go.Scatter(x=df[x], y=df[y], mode='markers',
name=name, text=df.name + ' - ' + df.artists))
return ts
data = dict()
# define which categories we want to include
categories = ['duration_ms', 'popularity', 'acousticness', 'danceability',
'energy', 'instrumentalness', 'key', 'liveness', 'loudness',
'mode', 'speechiness', 'tempo', 'time_signature', 'valence']
# for each category make a list of the data from each dataframe
for cat in categories:
data[cat] = [df[cat] for df in dfs.values()]
# define behaviour of dropdown menus
# we've defined the buttons using list comprehensions - on selection, the x/y data and axis label are updated
updatemenus = list([
dict(x=-0.05, y=0.8,
buttons=list([
dict(label = cat, method = 'update',
args = [{'x': data[cat]}, {'xaxis': dict(title = cat)}]) for cat in categories
])
),
dict(x=-0.05, y=1,
buttons=list([
dict(label = cat, method = 'update',
args = [{'y': data[cat]}, {'yaxis': dict(title = cat)}]) for cat in categories
])
)
])
# set the initial data
initial_dat = go.Data(make_traces('duration_ms', 'duration_ms', dfs))
# make the layout
layout = dict(title='Compare genres', showlegend=True,
updatemenus=updatemenus)
fig = dict(data=initial_dat, layout=layout)
fig['layout'].update(hovermode='closest')
plotly.offline.iplot(fig)
This is pretty interesting: looking at energy vs danceability, if you select just the metal, country and house datasets (click on the name of a dataset in the legend to hide it), you can see that the data form 3 pretty distinct clusters. The rap and house datasets surprisingly occupy a similar region in the plot. The metal dataset is the most tightly clustered, all tracks being high energy but not very danceable. We can also see that house tracks tend to have a very consistent bpm (120) and be the longest, rap music tends to be the most popular and almost all songs have a 4/4 time signature.
Machine learning
From the above plot, it can be seen that the data is quite clustered. This suggests it may be possible to create some kind of machine learning genre classifier using scikit-learn
. Knowing very little about machine learning, I pretty much stuck to the examples in the documentation to create this, so the resulting classifier is almost certainly much less successful than it could be some tuning.
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import accuracy_score
for genre, df in dfs.items():
df['genre'] = genre
dat = pd.concat(dfs.values())
data = dat[['duration_ms', 'explicit', 'popularity', 'acousticness', 'danceability',
'energy', 'instrumentalness', 'key', 'liveness', 'loudness',
'mode', 'speechiness', 'tempo', 'time_signature', 'valence', 'genre']].values[:,:-1]
labels = dat[['duration_ms', 'explicit', 'popularity', 'acousticness', 'danceability',
'energy', 'instrumentalness', 'key', 'liveness', 'loudness',
'mode', 'speechiness', 'tempo', 'time_signature', 'valence', 'genre']].values[:,-1]
data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size = 0.3)
classifier = tree.DecisionTreeClassifier()
classifier.fit(data_train, labels_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
accuracy = accuracy_score(labels_test, classifier.predict(data_test))
print("Decision Tree Accuracy with 50/50: {}".format(accuracy))
Decision Tree Accuracy with 50/50: 0.6298701298701299
This doesn’t have the best accuracy, however if we drop one of the genres (especially one of the less tightly clustered ones such as pop, indie or rap), our accuracy increases significantly. Not knowing anything about machine learning classifiers, it’s also very likely I’ve chosen one that isn’t great for this dataset, and the accuracy would also improve if I chose one that was more suitable.
nopop_dat = dat[dat.genre != 'pop']
data = nopop_dat[['duration_ms', 'explicit', 'popularity', 'acousticness', 'danceability',
'energy', 'instrumentalness', 'key', 'liveness', 'loudness',
'mode', 'speechiness', 'tempo', 'time_signature', 'valence', 'genre']].values[:,:-1]
labels = nopop_dat[['duration_ms', 'explicit', 'popularity', 'acousticness', 'danceability',
'energy', 'instrumentalness', 'key', 'liveness', 'loudness',
'mode', 'speechiness', 'tempo', 'time_signature', 'valence', 'genre']].values[:,-1]
data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size = 0.3)
classifier = tree.DecisionTreeClassifier()
classifier.fit(data_train, labels_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
accuracy = accuracy_score(labels_test, classifier.predict(data_test))
print("Decision Tree Accuracy with 50/50: {}".format(accuracy))
Decision Tree Accuracy with 50/50: 0.8173076923076923
That’s much better!
Summary
In this post, I have described how I was able to download a load of music data from Spotify using spotipy
, analyse it using pandas
, produce an interactive plot using plotly
, and do some (v. basic) machine learning using scikit-learn
. In the process, I learnt a great deal about pandas
and how web APIs work, such that I’m pretty keen to explore some more interesting datasets in the future. I also learnt that Radiohead songs really are quite depressing, and that metal music may be very energetic but is not judged by Spotify as being very danceable.