音乐推荐系统1

type

status

date

slug

summary

从0搭建一个音乐推荐系统

简介：通过使用Million Song Dataset，构建以协同过滤做召回、以GBDT+LR做排序的音乐推荐系统模型，将点击率的比值作为用戶评分。通过KNN算法，实现UserCF和ItemCF；通过用SVD实现矩阵分解召回

该项目基于该博客进行修改，增加FM做召回和排序部分

前言在本篇博客中，我们将从0搭建一个音乐推荐系统，其中的流程也可以用来搭建其他内容的推荐系统。我们将整个过程分为三个部分，分别是数据预处理召回排序拿到原始数据集之后，我们需要对其进行处理，包括去重、重命名、去掉无用特征等等，最后形成较为简洁清晰的数据集。有了数据集之后，我们进入系统的召回阶段。在这一阶段，我们从大量歌曲中选出少部分歌曲作为候选集，采用的方法有排行榜、协同过滤和矩阵分解。通过召回阶段，我们得到歌曲的候选集，为了进一步筛选，我们采用GBDT+LR的ctr预估方法，对候选集进行

https://blog.csdn.net/qq_30841655/article/details/107989560

数据集介绍

基于排行榜的推荐

基于协同过滤的推荐

基于矩阵分解的推荐

基于GBDT+LR预估的排序

结语

In [ ]:

Part 1. 数据集介绍

我们的数据集

数据集预处理

我们的数据集是从网上的一个项目中获得的，这个项目由The Echonest和LABRosa一起完成。数据集主要是多年间外国音乐的量化特征，包含了百万用户对几十万首歌曲的播放记录（train_triplets.txt，2.9G）和这些歌曲的详细信息（triplets_metadata.db，700M）。

millionsongdataset.com

http://millionsongdataset.com/sites/default/files/challenge/train_triplets.txt.zip

用户的播放记录数据集train_triplets.txt格式是这样的：用户歌曲播放次数，其中用户和歌曲都匿名

歌曲的详细信息数据集triplets_metadata.db则包括歌曲的发布时间、作者、作者热度等

由于数据集很大，可以从.txt文件中选取200万条数据作为我们的数据集。

Step 1. 对.txt文件的处理

通过编码和转换数据类型降低数据内存

过滤掉播放量过低的用户

In [ ]:

Out[ ]:

ㅤ	user	song	play_count
0	b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOAKIMP12A8C130995	1
1	b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOAPDEY12A81C210A9	1
2	b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOBBMDR12A8C13253B	2
3	b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOBFNSP12AF72A0E22	1
4	b80344d063b5ccb3212f76538f3d9e43d87dca9e	SOBFOVM12A58A7D494	1

In [ ]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000000 entries, 0 to 2999999
Data columns (total 3 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   user        object
 1   song        object
 2   play_count  int64 
dtypes: int64(1), object(2)
memory usage: 68.7+ MB

可以看到，用户和歌曲已经被加过密，不过这并不妨碍我们做推荐。

查看数据集内存信息，为了方便后面快速运算，我们需要降低其内存大小。具体的，

我们对user和song进行labelencoder

将所有的数据类型转化为int32

In [ ]:

Out[ ]:

ㅤ	user	song	play_count
0	44970	3684	1
1	44970	5409	1
2	44970	9724	2
3	44970	11147	1
4	44970	11158	1
...	...	...	...
2999995	32232	125731	2
2999996	32232	125854	1
2999997	32232	126016	1
2999998	32232	126253	1
2999999	32232	127219	4

3000000 rows × 3 columns

In [ ]:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000000 entries, 0 to 2999999
Data columns (total 3 columns):
 #   Column      Dtype
---  ------      -----
 0   user        int32
 1   song        int32
 2   play_count  int64
dtypes: int32(2), int64(1)
memory usage: 45.8 MB

这里，我们看到，内存从450M降低到300M，这样处理是有效的。

接着，我们需要进行一些基本的数据过滤。我们先来看一下用户的歌曲播放总量的分布情况。

In [ ]:

C:\Users\19853\AppData\Local\Temp\ipykernel_17072\3752125060.py:2: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(list(user_playcounts.values()), bins=5000, kde=False)

从上图可以看到，有一大部分用户的歌曲播放量少于100。少于100的歌曲播放量在持续几年的时间长度上来看是不正常的。造成这种现象的原因，可能是这些用户不喜欢听歌，只是偶尔点开。对于这些用户，我们看看他们在总体数据上的占比情况。

In [ ]:

歌曲播放量大于100的用户数量占总体用户数量的比例为 39.51%
歌曲播放量大于100的用户产生的播放总量占总体播放总量的比例为 80.1985%
歌曲播放量大于100的用户产生的数据占总体数 据的比例为 71.191%

通过上面的结果，我们可以看到，歌曲播放量大于100的用户占总体的40%，而正是这40%的用户，产生了80%的播放量，占据了总体数据的70%。因此，我们可以直接将歌曲播放量少于100的用户过滤掉，而不影响整体数据。

In [ ]:

类似的，我们挑选出具有一定播放量的歌曲。因为播放量太低的歌曲不但会增加计算复杂度，还会降低协同过滤的准确度。我们首先看不同歌曲的播放量分布情况。

In [ ]:

C:\Users\19853\AppData\Local\Temp\ipykernel_17072\1014393965.py:2: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(list(song_playcounts.values()), bins=10000, kde=False)

我们观察到，大部分歌曲的播放量非常少，甚至不到50次！这些歌曲完全无人问津，属于我们可以过滤掉的对象。

In [ ]:

播放量大于20的歌曲数量占总体歌曲数量的比例为 12.509999999999998%
播放量大于20的歌曲产生的播放总量占总体播放总量的比例为 76.9802%
播放量大于20的歌曲产生的数据占总体数据的比例为 68.7174%

可以看到，播放量大于50的歌曲数量，占总体数量的27%，而这27%的歌曲，产生的播放总量和数据总量都占90%以上！因此可以说，过滤掉这些播放量小于50的歌曲，对总体数据不会产生太大影响。

In [ ]:

Step 2. 对.db文件的处理

读取数据

对song_id进行labelencoder

将新读取的数据与原有data，按照song_id合并

In [ ]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1520667 entries, 0 to 1520666
Data columns (total 16 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   user                1520667 non-null  int32  
 1   song                1520667 non-null  int32  
 2   play_count          1520667 non-null  int64  
 3   track_id            1520667 non-null  object 
 4   title               1520667 non-null  object 
 5   release             1520667 non-null  object 
 6   artist_id           1520667 non-null  object 
 7   artist_mbid         1520667 non-null  object 
 8   artist_name         1520667 non-null  object 
 9   duration            1520667 non-null  float64
 10  artist_familiarity  1520667 non-null  float64
 11  artist_hotttnesss   1520667 non-null  float64
 12  year                1520667 non-null  int64  
 13  track_7digitalid    1520667 non-null  int64  
 14  shs_perf            1520667 non-null  int64  
 15  shs_work            1520667 non-null  int64  
dtypes: float64(3), int32(2), int64(5), object(6)
memory usage: 185.6+ MB

In [ ]:

Out[ ]:

Index(['user', 'song', 'play_count', 'track_id', 'title', 'release',
       'artist_id', 'artist_mbid', 'artist_name', 'duration',
       'artist_familiarity', 'artist_hotttnesss', 'year', 'track_7digitalid',
       'shs_perf', 'shs_work'],
      dtype='object')

为了降低内存，我们同样进行类型转换，

将int64转换成int32

将float64转换为float32

In [ ]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1520667 entries, 0 to 1520666
Data columns (total 16 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   user                1520667 non-null  int32  
 1   song                1520667 non-null  int32  
 2   play_count          1520667 non-null  int32  
 3   track_id            1520667 non-null  object 
 4   title               1520667 non-null  object 
 5   release             1520667 non-null  object 
 6   artist_id           1520667 non-null  object 
 7   artist_mbid         1520667 non-null  object 
 8   artist_name         1520667 non-null  object 
 9   duration            1520667 non-null  float32
 10  artist_familiarity  1520667 non-null  float32
 11  artist_hotttnesss   1520667 non-null  float32
 12  year                1520667 non-null  int32  
 13  track_7digitalid    1520667 non-null  int32  
 14  shs_perf            1520667 non-null  int64  
 15  shs_work            1520667 non-null  int64  
dtypes: float32(3), int32(5), int64(2), object(6)
memory usage: 150.8+ MB

Step 3. 数据清洗

去重

丢掉无用信息

实际上，有些信息我们比较肯定是无用的，比如

track_id

artist_id

artist_mbid

duration

track_7digitalid

shs_perf

shs_work

我们主要利用评分矩阵进行召回和排序，上面的信息我们应该用不到。

In [ ]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1520667 entries, 0 to 1520666
Data columns (total 9 columns):
 #   Column              Non-Null Count    Dtype  
---  ------              --------------    -----  
 0   user                1520667 non-null  int32  
 1   song                1520667 non-null  int32  
 2   play_count          1520667 non-null  int32  
 3   title               1520667 non-null  object 
 4   release             1520667 non-null  object 
 5   artist_name         1520667 non-null  object 
 6   artist_familiarity  1520667 non-null  float32
 7   artist_hotttnesss   1520667 non-null  float32
 8   year                1520667 non-null  int32  
dtypes: float32(2), int32(4), object(3)
memory usage: 81.2+ MB

Step 4. 可视化

这里，我们利用词云，直观看一下最受欢迎的歌手、专辑和歌曲。

In [ ]:

Out[ ]:

ㅤ	user	song	play_count	title	release	artist_name	artist_familiarity	artist_hotttnesss	year
0	44970	3684	1	The Cove	Thicker Than Water	Jack Johnson	0.832012	0.677482	0
1	30316	3684	1	The Cove	Thicker Than Water	Jack Johnson	0.832012	0.677482	0
2	28697	3684	3	The Cove	Thicker Than Water	Jack Johnson	0.832012	0.677482	0
3	8903	3684	1	The Cove	Thicker Than Water	Jack Johnson	0.832012	0.677482	0
4	15439	3684	6	The Cove	Thicker Than Water	Jack Johnson	0.832012	0.677482	0

In [ ]:

Part 2. 不同的推荐引擎

对于系统的召回阶段，我们将给出如下三种推荐方式，分别是

基于排行榜的推荐

基于协同过滤的推荐

基于矩阵分解的推荐

Step 1. 基于排行榜的推荐

我们将每首歌听过的人数作为每首歌的打分。这里之所以不将点击量作为打分，是因为一个人可能对一首歌多次点击，但这首歌其他人并不喜欢。

In [ ]:

Out[ ]:

['Use Somebody',
 'Sehr kosmisch',
 'Dog Days Are Over (Radio Edit)',
 'Yellow',
 'Undo']

Step 2. 基于协同过滤的推荐

协同过滤需要用户-物品评分矩阵。这里，用户对某首歌的评分的计算公式如下，

该用户的最大歌曲点击量

当前歌曲点击量/平均歌曲点击量

评分为log(2 + 上述比值)

得到用户-物品评分矩阵之后，我们用surprise库中的knnbasic函数进行协同过滤。

In [ ]:

Out[ ]:

(1, 2213)

In [ ]:

C:\Users\19853\AppData\Local\Temp\ipykernel_17072\3075060499.py:1: UserWarning: 

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(data['rating'].values, bins=100)

In [ ]:

首先，我们做itemCF的推荐。

In [ ]:

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.2750
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.2758
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.2757
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.2753
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.2749

In [ ]:

Out[ ]:

ㅤ	user	item	rating
0	44970	3684	0.982117
1	30316	3684	0.940118
2	28697	3684	1.025853
3	8903	3684	0.882501
4	15439	3684	1.212341

In [ ]:

Out[ ]:

{213116: 2.217696608652135,
 136892: 2.217696608652135,
 50275: 2.2176966086521346,
 35162: 2.2176966086521346,
 68624: 2.2176966086521346}

其次，我们做userCF的推荐。

In [ ]:

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.2691
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.2698
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.2691
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.2705
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.2704

In [ ]:

Out[ ]:

['The Ghost Of Cain',
 'Cream (Paul van Dyk Remix)',
 'Sitting on Top of the World',
 'Money Honey',
 'Born To Serve The Lord']

Step 3. 基于矩阵分解的推荐

矩阵分解同样需要用户-物品评分矩阵。我们依然沿用上面的评分矩阵进行预测。同样的，我们用surprise库里面的SVD来进行矩阵分解方法。

In [ ]:

RMSE: 0.2738
RMSE: 0.2728
RMSE: 0.2729
RMSE: 0.2725
RMSE: 0.2728

In [ ]:

Out[ ]:

{61409: 1.5882515013638112,
 713: 1.5695735776735484,
 93688: 1.5248378042643582,
 149970: 1.5044646167453368,
 155927: 1.4800919020292347}

Part 3. 推荐系统的排序

对于系统的排序阶段，我们通常是这样的，

以召回阶段的输出作为输入

用CTR预估作为进一步的排序标准

这里，我们可以召回50首音乐，用GBDT+LR对这些音乐做ctr预估，给出评分排序，选出5首歌曲。

现在，仅仅用用户-物品评分是不够的，因为我们需要考虑特征之间的组合。为此，我们用之前的data数据。

这里的数据处理思路是，

复制一份新的数据，命名为new_data

去掉title列，因为它不需要参与特征组合

对其余object列进行labelencoder编码

根据rating列数值情况，为了样本的正负均衡，我们令rating小于0.7的为0，也就是不喜欢，令rating大于0.7的为1，也就是喜欢

将new_data按照0.5的比例分成两份，一份给gbdt作为训练集，一份给lr作为训练集

In [ ]:

Out[ ]:

ㅤ	user	song	play_count	release	artist_name	artist_familiarity	artist_hotttnesss	year	rating
0	44970	3684	1	11441	3245	0.832012	0.677482	0	1
1	30316	3684	1	11441	3245	0.832012	0.677482	0	1
2	28697	3684	3	11441	3245	0.832012	0.677482	0	1
3	8903	3684	1	11441	3245	0.832012	0.677482	0	0
4	15439	3684	6	11441	3245	0.832012	0.677482	0	1

Step 1. GBDT+LR预估

这里，我们做一个ctr点击预估，将点击概率作为权重，与rating结合，作为最终的评分。为了做这个，我们需要

分割数据集，一部分作为GBDT的训练集，一部分作为LR的训练集

先训练GBDT，将其结果作为输入，送进LR里面，再生成结果

最后看AUC指标

In [ ]:

当前n_estimators= 200
当前gbdt训练完成！

c:\Users\19853\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

当前lr训练完成！
当前n_estimators和auc分别为 200 0.7332819214330095
########################################

如果gbdt的迭代次数设置为300次，auc为0.7374325241505955

Step 2. 排序

这里，我们通过ItemCF召回50首歌，然后根据gbdt+lr的结果做权重，给它们做排序，选出其中的5首歌作为推荐结果。

In [ ]:

召回完毕！
排序权重计算完毕！
最终推荐列表为

Out[ ]:

['Do You Wanna Dance',
'Behind The Sea [Live In Chicago]',
"Apuesta Por El Rock 'N' Roll",
"I'll Be Missing You (Featuring Faith Evans & 112)(Album Version)",
"I?'m A Steady Rollin? Man"]

In [ ]:

🗒️音乐推荐系统1

从0搭建一个音乐推荐系统

Part 1. 数据集介绍

Step 1. 对.txt文件的处理

Step 2. 对.db文件的处理

Step 3. 数据清洗

Step 4. 可视化

Part 2. 不同的推荐引擎

Step 1. 基于排行榜的推荐

Step 2. 基于协同过滤的推荐

Step 3. 基于矩阵分解的推荐

Part 3. 推荐系统的排序

Step 1. GBDT+LR预估

Step 2. 排序