Scrape Spotify’s API within 20 mins

Alpar Gür
5 min readJul 2

--

Photo by Jonas Leupe on Unsplash

Recently, I worked on a project to develop a machine learning model which should predict the genre of a given track based on the audio features of it. In order to do that, we needed tracks and their audio features. Fortunately, Spotify offers various APIs to extract tracks and their preanalysed audio features.

In this post, I will be sharing with you how we collected the data to train our models. Without further ado, let’s get started!

Prerequisites

  • While calling Spotify’s APIs authentication is required. Therefore, you will need a client-id and client secret to obtain an authentication token. You will find these in settings tab of Spotify for Developers. If you didn’t used it before you can just click to the link and create a Spotify app.
  • For scraping the APIs we’ll use Python, pandas and Jupyter so make sure your development environment is fully equipped and up to date.

Authentication

First of all, create new environment variables using CLI and reload your environment:

$ export SPOTIFY_CLIENT_ID=<your-client-id>
$ export SPOTIFY_CLIENT_SECRET=<your-client-secret>

Code snippet below imports necessary libraries and defines a function which exposes Spotify API to get the authentication token using client credentials:

# import packages
import os
import json
import requests
import pandas as pd


def get_access_token(client_id: str, client_secret: str, grant_type: str = 'client_credentials'):
url = 'https://accounts.spotify.com/api/token?grant_type={}&client_id={}&client_secret={}'.format(grant_type, client_id, client_secret)
response = requests.post(url, headers={'Content-Type':'application/x-www-form-urlencoded'})
access_token = 'Bearer ' + json.loads(response.text)['access_token']

return access_token


# get access token
grant_type = 'client_credentials'
client_id = os.getenv('SPOTIFY_CLIENT_ID')
client_secret = os.getenv('SPOTIFY_CLIENT_SECRET')

access_token = get_access_token(client_id, client_secret, grant_type)

Following is a generic function to make API calls by providing only the target URL and an access token:

def get_data(url: str, access_token: str, verbose: bool = False):
response = requests.get(url, headers={'Authorization': access_token})
result = json.loads(response.text)

if verbose:
print('Response body:\n', result)

return result

Data Collection

Now we laid a foundation. Using this setup you can communicate with Spotify’s APIs however you like. The next step illustrates collecting track information:

def get_tracks(genres_list: list, steps: int, limit: int, offset: int, access_token: str):
tracks_df = pd.DataFrame()
_initial_offset = offset

for genre in genres_list:

for step in range(steps):
url = 'https://api.spotify.com/v1/search?q=genre:{}&type=track&limit={}&offset={}'.format(genre, limit, offset)
search_item = get_data(url, access_token)

for n in range(limit):
track_id = search_item['tracks']['items'][n]['id']
track_name = search_item['tracks']['items'][n]['name']
artist_name = search_item['tracks']['items'][n]['artists'][0]['name']
popularity = search_item['tracks']['items'][n]['popularity']

tracks_df = tracks_df.append({
'track_id': track_id,
'track_name': track_name,
'artist_name': artist_name,
'popularity': popularity,
'genre': genre
}, ignore_index=True)

offset += limit
offset = _initial_offset

return tracks_df

Basically, get_tracks function exposes /search endpoint to collect multiple track information for each genre defined in the genres_list and puts them into a pandas DataFrame. Though, this might be a lot to take in at once. So let’s break it down:

The yellow part of URL specifies the /search endpoint and purple highlighted part is the search query. In this query, we select a genre and define a limit. Limit is basically to tell how many items we want to collect (max 50). Offset declares the starting point for the operation.

After collecting data, we extract needed information from each item. In this case these fields were: track_id, track_name, artist_name and popularity of a track (it’s a value between 0 and 100).
To find out more about the query possibilities check out the documentation.

Here we select the first artist’s name for each track. Featured tracks have multiple artists and if this information is something important for your usecase, you should adjust the line starting with artist_name:

For example, using the following setup of parameters we can collect 1000 entry for genres: rock, rap, metal, blues, electronic.
Initial value of offset is 0 . At the end of each iteration it will get incremented by the limit value to continue from where it left off:

steps = 20
limit = 50
offset = 0
genres_list = ['rock', 'rap', 'metal', 'blues', 'techno']

tracks_df = get_tracks(genres_list, steps, limit, offset, access_token)
# show dataframe
tracks_df
tracks_df dataframe

We collected a fair amount of tracks, now it is time to get audio features of each track:

def get_track_features(tracks_df: pd.DataFrame, access_token: str):
track_features_df = pd.DataFrame()

ids_to_request = []
for index, row in tracks_df.iterrows():
track_id = tracks_df.iloc[index]['track_id']
ids_to_request += [track_id]

for i in range(len(ids_to_request) // 100 + 1):
_list = ids_to_request[i*100:(i+1)*100]
if len(_list) == 0:
break

request_text = ",".join(_list)
url = 'https://api.spotify.com/v1/audio-features?ids=' + request_text
result = get_data(url, access_token)
track_features_list = result["audio_features"]

for track_features in track_features_list:
track_features_df = track_features_df.append(track_features, ignore_index=True)

# drop negligible features
track_features_df.drop(columns=['type', 'uri', 'track_href', 'analysis_url'], inplace=True)
track_features_df.rename(columns={'id':'track_id'}, inplace=True)

return track_features_df

This function iterates over a dataframe of tracks, retrieves audio features for given track-ids, and writes them into a new dataframe.

track_features_df = get_track_features(tracks_df, access_token)
track_features_df
track_features_df dataframe

Aaand that’s it! Now we can merge both dataframes and export it to start with training the models:

df = tracks_df.merge(track_features_df_no_duplicates, on='track_id')
df

# export data
file_path = Path('data/tracks-with-audio-features.csv')
file_path.parent.mkdir(parents=True, exist_ok=True)
df.to_csv(file_path)
Merged dataframe df

Disclaimer: It is not uncommon that a track is registered under multiple genres in Spotify’s catalogue or a track has multiple versions (radio edit or extended version). These two phenomenons result in duplicate items in dataset. So you might want to do some data cleaning depending on what you want to achieve.

If you eager to find out more about the project you can check out
the GitHub repository :)

Until then,
Alpar 🏄🏻‍♂️

--

--

Lists

See more recommendations