声明:本文原创,文章一切解释权归本人所有。本文允许转载,如需引用,请附属本文链接。
摘要:随大数据的发展,中大型商户渐渐需要依靠大数据对顾客进行更深层次的了解。为了保证中大型商户的利益最大化,常常需要对用户进行动态采样,适当获取用户个人信息,匹配到他们真正想要的产品,因此,了解一样商品的复购率就显得尤为重要了。
本文主要依靠Python自动化分析用户购买商品的匹配度、相似度,以及商品的热力关系(复购率),了解中大型商户在为用户进行个性化推荐时的底层逻辑并进行简单模拟。
注意:在开始之前,请先下载本文附属的excel文件(已经过数据清洗的某超市2011-2014年商品购买情况数据集),并另存在本项目的根目录中。
(excel文件版权说明:该文件为AI批量生成,无任何真实性,本人不对此文件负责)
该篇目主要会对用户购买商品的相似度进行分析。
①底层逻辑分析
关于用户购买商品的相似度,相信大家都很好理解。但是,你有时可能会感到不解——为什么某宝、某东会给我推送符合我可能想要购买的商品呢?对于这个问题,有人可能会解答说,是大数据无时不刻地在“偷窃”你的隐私。但,事实也许并不是这样的。随着我国对应用程序访问个人隐私的管控与限制,大数据暗地里“偷情报”的操作越来越少,似乎又可能是数据库在作妖?
对于这些问题,其实可以用一张图片来解释:
图1:不同准则的商品筛选
(引自:卢明东的博客lumingdong.cn)
来分析一下图1,左边是“User-Based Filtering”,也就是“以用户为准则的筛选”,即,要把用户相互联系起来。例如,左边的图中(假设自上往下为A、B、C),A用户购买了葡萄、苹果、菠萝以及梨,B用户仅购买了苹果,C用户购买了苹果与菠萝。仅靠这三位用户的购买数据对比后可以发现,A用户与C用户的商品相似度较高,为2件商品(苹果与菠萝),那么,我们不妨假设一下,C用户有没有可能也喜欢葡萄与梨呢?这时,系统便可以将这两位用户相关联,并在下次C用户访问商户时,商户可对C用户推荐这两种产品,这就是User-Based Filtering的逻辑。
右边是“Item-Based Filtering”,也就是“以商品为准则的筛选”,即,要把商品相互联系起来。例如,图1右边A用户购买了葡萄、苹果、菠萝以及梨子,B用户购买了葡萄和菠萝,C用户购买了菠萝。那么,不难发现,菠萝受到了广泛的喜爱。而A用户与B用户有一个共同点,就是他们在同时购买菠萝的同时,也都购买了葡萄。那么,系统就可以为C用户推荐葡萄这样产品了,原因是他们的商品选择的相似度较高,这就是Item-Based Filtering的逻辑。
②复购率分析
我们不妨先来分析一下每件商品的复购率。
基本环境设置
开始前,请确保项目根目录中已安装matplotlab、 pandas、 numpy这三个库,如未安装,可以使用pip 或 conda安装,基本命令如下:
$ > pip install matplotlab
$ > pip install pandas
$ > pip install numpy
------or------
$ > conda install matplotlab
$ > conda install pandas
$ > conda install numpy
如有需要,可添加上-i或-pre附属参数。
安装完成后,可以在程序头导入库与excel文件,并过滤错误的行,如下:
# 引入库
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# 设置基本参数
pd.set_option('display.width', 60)
# 导入完整数据
df = pd.read_excel('supermarket_data_clean.xlsx', index_col=0)
# 过滤数据类型错误的信息所在的行
def is_number(value):
return isinstance(value, (int, float))
def is_string(value):
return isinstance(value, (str,))
df = df[df["Quantity"].map(lambda x: isinstance(x, (int, float)))]
df = df[df["Sub-Category"].map(is_string)]
print(df)
运行结果如下:
Order Date Order Date Year \
Order ID
AG-2011-2040 1/1/2011 2011
IN-2011-47883 1/1/2011 2011
HU-2011-1220 1/1/2011 2011
IT-2011-3647632 1/1/2011 2011
IN-2011-47883 1/1/2011 2011
... ... ...
CA-2014-115427 31-12-2014 2014
MO-2014-2560 31-12-2014 2014
MX-2014-110527 31-12-2014 2014
MX-2014-114783 31-12-2014 2014
CA-2014-156720 31-12-2014 2014
Order Date Month Order Date Day \
Order ID
AG-2011-2040 1 1
IN-2011-47883 1 1
HU-2011-1220 1 1
IT-2011-3647632 1 1
IN-2011-47883 1 1
... ... ...
CA-2014-115427 12 31
MO-2014-2560 12 31
MX-2014-110527 12 31
MX-2014-114783 12 31
CA-2014-156720 12 31
Ship Date Ship Date Year \
Order ID
AG-2011-2040 6/1/2011 2011
IN-2011-47883 8/1/2011 2011
HU-2011-1220 5/1/2011 2011
IT-2011-3647632 5/1/2011 2011
IN-2011-47883 8/1/2011 2011
... ... ...
CA-2014-115427 4/1/2015 2015
MO-2014-2560 5/1/2015 2015
MX-2014-110527 2/1/2015 2015
MX-2014-114783 6/1/2015 2015
CA-2014-156720 4/1/2015 2015
Ship Date Month Ship Date Day \
Order ID
AG-2011-2040 6 1
IN-2011-47883 8 1
HU-2011-1220 5 1
IT-2011-3647632 5 1
IN-2011-47883 8 1
... ... ...
CA-2014-115427 4 1
MO-2014-2560 5 1
MX-2014-110527 2 1
MX-2014-114783 6 1
CA-2014-156720 4 1
Ship Mode Customer ID ... \
Order ID ...
AG-2011-2040 Standard Class TB-11280 ...
IN-2011-47883 Standard Class JH-15985 ...
HU-2011-1220 Second Class AT-735 ...
IT-2011-3647632 Second Class EM-14140 ...
IN-2011-47883 Standard Class JH-15985 ...
... ... ... ...
CA-2014-115427 Standard Class EB-13975 ...
MO-2014-2560 Standard Class LP-7095 ...
MX-2014-110527 Second Class CM-12190 ...
MX-2014-114783 Standard Class TD-20995 ...
CA-2014-156720 Standard Class JM-15580 ...
Sub-Category \
Order ID
AG-2011-2040 Storage
IN-2011-47883 Supplies
HU-2011-1220 Storage
IT-2011-3647632 Paper
IN-2011-47883 Furnishings
... ...
CA-2014-115427 Binders
MO-2014-2560 Wilson Jones Hole Reinforcements, Clear
MX-2014-110527 Labels
MX-2014-114783 Labels
CA-2014-156720 Fasteners
Product Name \
Order ID
AG-2011-2040 Tenex Lockers, Blue
IN-2011-47883 Acme Trimmer, High Speed
HU-2011-1220 Tenex Box, Single Width
IT-2011-3647632 Enermax Note Cards, Premium
IN-2011-47883 Eldon Light Bulb, Duo Pack
... ...
CA-2014-115427 Cardinal Slant-D Ring Binder, Heavy Gauge Vinyl
MO-2014-2560 3.99
MX-2014-110527 Hon Color Coded Labels, 5000 Label Set
MX-2014-114783 Hon Legal Exhibit Labels, Alphabetical
CA-2014-156720 Bagged Rubber Bands
Sales Quantity Discount Profit \
Order ID
AG-2011-2040 408.3 2 0.00 106.14
IN-2011-47883 120.366 3 0.10 36.036
HU-2011-1220 66.12 4 0.00 29.64
IT-2011-3647632 44.865 3 0.50 -26.055
IN-2011-47883 113.67 5 0.10 37.77
... ... ... ... ...
CA-2014-115427 13.904 2 0.20 4.5188
MO-2014-2560 1 0 0.42 0.49
MX-2014-110527 26.4 3 0.00 12.36
MX-2014-114783 7.12 1 0.00 0.56
CA-2014-156720 3.024 3 0.20 -0.6048
Shipping Cost Order Priority Unnamed: 24 \
Order ID
AG-2011-2040 35.46 Medium NaN
IN-2011-47883 9.72 Medium NaN
HU-2011-1220 8.17 High NaN
IT-2011-3647632 4.82 High NaN
IN-2011-47883 4.7 Medium NaN
... ... ... ...
CA-2014-115427 0.89 Medium NaN
MO-2014-2560 Medium NaN NaN
MX-2014-110527 0.35 Medium NaN
MX-2014-114783 0.2 Medium NaN
CA-2014-156720 0.17 Medium NaN
Unnamed: 25
Order ID
AG-2011-2040 NaN
IN-2011-47883 NaN
HU-2011-1220 NaN
IT-2011-3647632 NaN
IN-2011-47883 NaN
... ...
CA-2014-115427 NaN
MO-2014-2560 NaN
MX-2014-110527 NaN
MX-2014-114783 NaN
CA-2014-156720 NaN
[51152 rows x 30 columns]
然后可以针对某用户查看其订单的数量,代码如下:
cc = df.groupby("Customer Name").agg({"count"})
print(cc)
我们可以查看其中的一段内容:
Order Date Order Date Year \
count count
Customer Name
Aaron Bergman 89 89
Aaron Hawkins 56 56
Aaron Smayling 60 60
Adam Bellavance 68 68
Adam Hart 82 82
... ... ...
Xylona Preis 61 61
Yana Sorensen 62 62
Yoseph Carroll 56 56
Zuschuss Carroll 85 85
Zuschuss Donatelli 54 54
例如,可以发现,用户 Aaron Bergman 在当年内,一共访问该商户89次。我们接下来将先以该用户作为一个简单的示例,来说明该如何筛选用户的有效信息。
筛选用户的有效信息
当我们对用户Aaron Bergman进行分析时,我们会优先地关注该用户的消费次数、单次的消费价格、所购买的物品以及其数量。而在excel表格中,这些信息以及都被一一罗列了出来。例如 Sub-Category 代表着购买的物品,Quantity 表示购买的数量,那么,我们便可通过 df 的附属操作来自动获取这些内容了,如下:
u1 = df[df['Customer Name'] == 'Aaron Bergman'][['Sub-Category', 'Sales', 'Quantity', 'Order Date']]
print(u1)
运行结果如下:
Sub-Category Sales Quantity Order Date
Order ID
MX-2011-127215 Phones 82.26 1 3/11/2011
ES-2011-4146320 Art 50.7 2 4/4/2011
ES-2011-4146320 Labels 32.4 3 4/4/2011
CA-2011-156587 Chairs 48.712 1 7/3/2011
CA-2011-156587 Art 17.94 3 7/3/2011
... ... ... ... ...
ES-2011-4184901 Furnishings 75.96 4 30-08-2011
US-2013-123806 Chairs 59.328 3 31-05-2013
US-2013-123806 Art 21.816 1 31-05-2013
US-2013-103450 Chairs 86.416 1 31-10-2013
US-2013-103450 Fasteners 27.18 3 31-10-2013
[89 rows x 4 columns]
商品统计
做完上面这项操作以后,每个用户的购买情况都变得一目了然,我们便可以轻而易举地计算出该用户对于某产品的购买次数,如下:
u1sc = u1[["Quantity", 'Sub-Category']].groupby('Sub-Category').agg({'sum'})
print(u1sc)
运行结果如下:
Quantity
sum
Sub-Category
Accessories 26
Art 14
Binders 23
Bookcases 9
Chairs 16
Copiers 36
Envelopes 1
Fasteners 16
Furnishings 35
Labels 18
Machines 20
Phones 27
Storage 29
Supplies 31
这样就可以得出用户Aaron Bergman对他历史上买过的每个产品的购买次数了。
遍历至所有类别
我们现在可对所有商品类别的总购买次数进行遍历,如下:
sc = df[['Sub-Category', 'Quantity']].groupby('Sub-Category').agg({'sum'})
sc.columns = ['All']
print(sc)
运行结果如下:
All
Sub-Category
Accessories 10806
Acco 3-Hole Punch, Recycled 0
Acco Binder, Economy 0
Acco Binding Machine, Recycled 0
Acco Hole Reinforcements, Durable 0
... ...
Wilson Jones Index Tab, Economy 0
Xerox Cards & Envelopes, Multicolor 0
Xerox Cards & Envelopes, Recycled 0
Xerox Computer Printout Paper, Multicolor 0
Xerox Parchment Paper, Multicolor 0
[480 rows x 1 columns]
我们可对几个用户进行分析并拼接数据,如下:
users = ["Aaron Bergman", "Aaron Hawkins", "Aaron Smayling", "Adam Bellavance"]
for user in users:
u1 = df[df['Customer Name'] == user][['Sub-Category', 'Sales', 'Quantity', 'Order Date']]
u1sc = u1[["Quantity", 'Sub-Category']].groupby('Sub-Category').agg({'sum'})
u1sc.columns = [user]
sc = pd.concat([sc, u1sc], axis=1)
sc = sc.fillna(0)
print(sc)
All \
Sub-Category
Accessories 10806.0
Acco 3-Hole Punch, Recycled 0.0
Acco Binder, Economy 0.0
Acco Binding Machine, Recycled 0.0
Acco Hole Reinforcements, Durable 0.0
... ...
Wilson Jones Index Tab, Economy 0.0
Xerox Cards & Envelopes, Multicolor 0.0
Xerox Cards & Envelopes, Recycled 0.0
Xerox Computer Printout Paper, Multicolor 0.0
Xerox Parchment Paper, Multicolor 0.0
Aaron Bergman \
Sub-Category
Accessories 26
Acco 3-Hole Punch, Recycled 0
Acco Binder, Economy 0
Acco Binding Machine, Recycled 0
Acco Hole Reinforcements, Durable 0
... ...
Wilson Jones Index Tab, Economy 0
Xerox Cards & Envelopes, Multicolor 0
Xerox Cards & Envelopes, Recycled 0
Xerox Computer Printout Paper, Multicolor 0
Xerox Parchment Paper, Multicolor 0
Aaron Hawkins \
Sub-Category
Accessories 10
Acco 3-Hole Punch, Recycled 0
Acco Binder, Economy 0
Acco Binding Machine, Recycled 0
Acco Hole Reinforcements, Durable 0
... ...
Wilson Jones Index Tab, Economy 0
Xerox Cards & Envelopes, Multicolor 0
Xerox Cards & Envelopes, Recycled 0
Xerox Computer Printout Paper, Multicolor 0
Xerox Parchment Paper, Multicolor 0
Aaron Smayling \
Sub-Category
Accessories 20
Acco 3-Hole Punch, Recycled 0
Acco Binder, Economy 0
Acco Binding Machine, Recycled 0
Acco Hole Reinforcements, Durable 0
... ...
Wilson Jones Index Tab, Economy 0
Xerox Cards & Envelopes, Multicolor 0
Xerox Cards & Envelopes, Recycled 0
Xerox Computer Printout Paper, Multicolor 0
Xerox Parchment Paper, Multicolor 0
Adam Bellavance
Sub-Category
Accessories 13.0
Acco 3-Hole Punch, Recycled 0.0
Acco Binder, Economy 0.0
Acco Binding Machine, Recycled 0.0
Acco Hole Reinforcements, Durable 0.0
... ...
Wilson Jones Index Tab, Economy 0.0
Xerox Cards & Envelopes, Multicolor 0.0
Xerox Cards & Envelopes, Recycled 0.0
Xerox Computer Printout Paper, Multicolor 0.0
Xerox Parchment Paper, Multicolor 0.0
[480 rows x 5 columns]
每列间的相似度
再完成上方的操作之后,我们便可以先尝试分析每列间的相似度情况了,如下:
similar = sc.corr(method = 'pearson', min_periods=1)
print(similar)
All Aaron Bergman Aaron Hawkins \
All 1.000000 0.816475 0.809099
Aaron Bergman 0.816475 1.000000 0.722000
Aaron Hawkins 0.809099 0.722000 1.000000
Aaron Smayling 0.898259 0.660771 0.594060
Adam Bellavance 0.875031 0.588859 0.676802
Aaron Smayling Adam Bellavance
All 0.898259 0.875031
Aaron Bergman 0.660771 0.588859
Aaron Hawkins 0.594060 0.676802
Aaron Smayling 1.000000 0.752868
Adam Bellavance 0.752868 1.000000
也可以筛选出与指定用户最相似的用户(也就是用户自己):
name = "Aaron Bergman"
us = similar[[name]]
us = us.drop('All')
us = us.drop(name)
similar_user = us.iloc[us[name].argmax()].index.values[0]
print(similar_user)
以及那位用户最喜欢的商品:
suf = sc[[similar_user]]
line = suf.iloc[suf[similar_user].argmax()]
print(line)
计算复购率
做完这一切,我们就可以来计算商品的复购率了,如下:
df_2011 = df[df['Order Date Year'] == 2012]
plt.rcParams['font.family'] = 'SimHei'
df_pivot_counts = df_2011.pivot_table(index='Customer ID', columns='Order Date Month', values='Quantity', aggfunc='count')
#print(df_pivot_counts)
# 将数据中两次及以上的转为1,以下的转为0
df_pivot_counts_repurchase = df_pivot_counts.applymap(lambda x: 1 if x >= 2 else 0 if pd.notnull(x) else np.nan)
(df_pivot_counts_repurchase.sum()/df_pivot_counts_repurchase.count()).plot(marker='o',figsize=(12, 6))
plt.title('每月的复购率')
plt.grid(linestyle='-.')
plt.show()
运行生成折线图如图2:
图2:经过数据分析后生成的商品复购率折线图
二、Lift关联与置信度分析
本篇目主要利用apyori库对数据表进行Lift关联性分析,利用其分析商品的支持度与置信度,并画出热力图。
① apriori类 与 Lift关联性简介
apriori类附属于apyori库中,是apyori库最常用的类之一,作为一个强大的算法库,apriori为自动化运维提供了3个非常强大的方法:Support(支持度)、Confidence(可信度/置信度)、Lift(提升度),定义如下:
Support(支持度):表示同时包含 A 和 B 的事物占所有事物的比例。)
Confidence(可信度):表示包含 A 的事物中同时包含 B 的事物的比例,即同时包含 A 和 B 的事物占包含 A 的事物的比例。
Lift(提升度):表示“包含 A 的事物中同时包含 B 的事物的比例”与“包含 B 的事物的比例”的比值。公式表达:Lift = ( P(A & B)/ P(A) ) / P(B) = P(A & B)/ P(A) / P(B)。
提升度反映了关联规则中的 A 与 B 的相关性,提升度 > 1 且越高表明正相关性越高,提升度 < 1 且越低表明负相关性越高,提升度 = 1 表明没有相关性。
其中,
,
② apriori实践
了解apriori后,我们便可对数据集进行支持度、置信度、提升度进行分析。
首先,请先确保在项目更目录中已安装 apyori 库,如未安装,可使用 pip 或 conda 安装,命令如下:
$ > pip install apyori
------or------
$ > conda install apyori
安装完成后,请在程序头导入 pandas 库与 apyori 库,以及所需要的数据集文件,如下:
import pandas as pd
from apyori import apriori
# 读取数据,查看数据格式
file_path = "supermarket_data_clean.xlsx"
dfs = pd.read_excel(file_path, index_col=0)
为了方便apriori进行算法分析,我们需要罗列数据集,代码如下:
df = dfs[dfs["Profit"].map(lambda x: isinstance(x, (int, float)))]
df = df[df["Customer ID"].map(lambda x: isinstance(x, str))]
df = df[df["Product Name"].map(lambda x: isinstance(x, str))]
print(df)
其输出如下:
Order Date Order Date Year Order Date Month \
Order ID
AG-2011-2040 1/1/2011 2011 1
IN-2011-47883 1/1/2011 2011 1
HU-2011-1220 1/1/2011 2011 1
IT-2011-3647632 1/1/2011 2011 1
IN-2011-47883 1/1/2011 2011 1
... ... ... ...
MX-2014-108574 31-12-2014 2014 12
CA-2014-115427 31-12-2014 2014 12
MX-2014-110527 31-12-2014 2014 12
MX-2014-114783 31-12-2014 2014 12
CA-2014-156720 31-12-2014 2014 12
Order Date Day Ship Date Ship Date Year Ship Date Month \
Order ID
AG-2011-2040 1 6/1/2011 2011 6
IN-2011-47883 1 8/1/2011 2011 8
HU-2011-1220 1 5/1/2011 2011 5
IT-2011-3647632 1 5/1/2011 2011 5
IN-2011-47883 1 8/1/2011 2011 8
... ... ... ... ...
MX-2014-108574 31 4/1/2015 2015 4
CA-2014-115427 31 4/1/2015 2015 4
MX-2014-110527 31 2/1/2015 2015 2
MX-2014-114783 31 6/1/2015 2015 6
CA-2014-156720 31 4/1/2015 2015 4
Ship Date Day Ship Mode Customer ID ... Sub-Category \
Order ID ...
AG-2011-2040 1 Standard Class TB-11280 ... Storage
IN-2011-47883 1 Standard Class JH-15985 ... Supplies
HU-2011-1220 1 Second Class AT-735 ... Storage
IT-2011-3647632 1 Second Class EM-14140 ... Paper
IN-2011-47883 1 Standard Class JH-15985 ... Furnishings
... ... ... ... ... ...
MX-2014-108574 1 Standard Class JB-16045 ... Labels
CA-2014-115427 1 Standard Class EB-13975 ... Binders
MX-2014-110527 1 Second Class CM-12190 ... Labels
MX-2014-114783 1 Standard Class TD-20995 ... Labels
CA-2014-156720 1 Standard Class JM-15580 ... Fasteners
Product Name Sales \
Order ID
AG-2011-2040 Tenex Lockers, Blue 408.3
IN-2011-47883 Acme Trimmer, High Speed 120.366
HU-2011-1220 Tenex Box, Single Width 66.12
IT-2011-3647632 Enermax Note Cards, Premium 44.865
IN-2011-47883 Eldon Light Bulb, Duo Pack 113.67
... ... ...
MX-2014-108574 Novimex Legal Exhibit Labels, Adjustable 16.74
CA-2014-115427 Cardinal Slant-D Ring Binder, Heavy Gauge Vinyl 13.904
MX-2014-110527 Hon Color Coded Labels, 5000 Label Set 26.4
MX-2014-114783 Hon Legal Exhibit Labels, Alphabetical 7.12
CA-2014-156720 Bagged Rubber Bands 3.024
Quantity Discount Profit Shipping Cost Order Priority \
Order ID
AG-2011-2040 2 0.0 106.14 35.46 Medium
IN-2011-47883 3 0.1 36.036 9.72 Medium
HU-2011-1220 4 0.0 29.64 8.17 High
IT-2011-3647632 3 0.5 -26.055 4.82 High
IN-2011-47883 5 0.1 37.77 4.7 Medium
... ... ... ... ... ...
MX-2014-108574 3 0.0 0.66 1.32 Medium
CA-2014-115427 2 0.2 4.5188 0.89 Medium
MX-2014-110527 3 0.0 12.36 0.35 Medium
MX-2014-114783 1 0.0 0.56 0.2 Medium
CA-2014-156720 3 0.2 -0.6048 0.17 Medium
Unnamed: 24 Unnamed: 25
Order ID
AG-2011-2040 NaN NaN
IN-2011-47883 NaN NaN
HU-2011-1220 NaN NaN
IT-2011-3647632 NaN NaN
IN-2011-47883 NaN NaN
... ... ...
MX-2014-108574 NaN NaN
CA-2014-115427 NaN NaN
MX-2014-110527 NaN NaN
MX-2014-114783 NaN NaN
CA-2014-156720 NaN NaN
[50629 rows x 30 columns]
然后再按用户合并商品为字典:
combine_dict = df.groupby('Customer ID').apply(
lambda x: {col: x[col].tolist()[0] if col != 'Product Name' else x[col].tolist() for col in x.columns}).to_dict()
new_data_dict = []
order_item_list = []
for key, value in combine_dict.items():
sample = {"Customer ID": key, "Product Name": value["Product Name"]}
new_data_dict.append(sample)
order_item_list.append(value["Product Name"])
print(new_data_dict[0])
print(order_item_list[0])
{'Customer ID': 'AA-10315', 'Product Name': ['Fiskars Trimmer, Serrated', 'Avery Shipping Labels, Alphabetical', 'SanDisk Numeric Keypad, USB', 'Elite Shears, High Speed', 'Tenex Personal Project File with Scoop Front Design, Black', 'High Speed Automatic Electric Letter Opener', 'Polycom VVX 310 VoIP phone', 'Verbatim 25 GB 6x Blu-ray Single Layer Recordable Disc, 1/Pack', 'Acco Banker\'s Clasps, 5 3/4"-Long', 'SanDisk Memo Slips, Multicolor', 'BIC Highlighters, Blue', 'Samsung Audio Dock, Full Size', 'Nokia Speaker Phone, Full Size', 'Bush Floating Shelf Set, Metal', 'Apple Speaker Phone, with Caller ID', 'Smead Lockers, Wire Frame', 'Cardinal Binder Covers, Durable', 'Binney & Smith Pencil Sharpener, Water Color', 'Tenex Folders, Blue', 'Konica Calculator, Wireless', 'Hon File Folder Labels, Laser Printer Compatible', 'Master Caster Door Stop, Large Neon Orange', 'Staples', 'Sauder Library with Doors, Traditional', 'Motorola Signal Booster, Full Size', 'Motorola Headset, with Caller ID', 'Avery Index Tab, Economy', 'Eldon Box, Industrial', 'Cardinal Binder, Durable', 'Advantus Stacking Tray, Erganomic', 'Samsung Audio Dock, Cordless', 'Ibico Binding Machine, Recycled', 'Ibico Binder Covers, Durable', 'Wilson Jones Hole Reinforcements, Economy', 'Hamilton Beach Toaster, Black', 'Deflect-O Frame, Duo Pack', 'Boston Canvas, Fluorescent', "Belkin 325VA UPS Surge Protector, 6'", 'Avery Binding System Hidden Tab Executive Style Index Sets', 'GBC DocuBind 200 Manual Binding Machine', 'Fellowes Advanced Computer Series Surge Protectors', 'Bush Stackable Bookrack, Pine']}
['Fiskars Trimmer, Serrated', 'Avery Shipping Labels, Alphabetical', 'SanDisk Numeric Keypad, USB', 'Elite Shears, High Speed', 'Tenex Personal Project File with Scoop Front Design, Black', 'High Speed Automatic Electric Letter Opener', 'Polycom VVX 310 VoIP phone', 'Verbatim 25 GB 6x Blu-ray Single Layer Recordable Disc, 1/Pack', 'Acco Banker\'s Clasps, 5 3/4"-Long', 'SanDisk Memo Slips, Multicolor', 'BIC Highlighters, Blue', 'Samsung Audio Dock, Full Size', 'Nokia Speaker Phone, Full Size', 'Bush Floating Shelf Set, Metal', 'Apple Speaker Phone, with Caller ID', 'Smead Lockers, Wire Frame', 'Cardinal Binder Covers, Durable', 'Binney & Smith Pencil Sharpener, Water Color', 'Tenex Folders, Blue', 'Konica Calculator, Wireless', 'Hon File Folder Labels, Laser Printer Compatible', 'Master Caster Door Stop, Large Neon Orange', 'Staples', 'Sauder Library with Doors, Traditional', 'Motorola Signal Booster, Full Size', 'Motorola Headset, with Caller ID', 'Avery Index Tab, Economy', 'Eldon Box, Industrial', 'Cardinal Binder, Durable', 'Advantus Stacking Tray, Erganomic', 'Samsung Audio Dock, Cordless', 'Ibico Binding Machine, Recycled', 'Ibico Binder Covers, Durable', 'Wilson Jones Hole Reinforcements, Economy', 'Hamilton Beach Toaster, Black', 'Deflect-O Frame, Duo Pack', 'Boston Canvas, Fluorescent', "Belkin 325VA UPS Surge Protector, 6'", 'Avery Binding System Hidden Tab Executive Style Index Sets', 'GBC DocuBind 200 Manual Binding Machine', 'Fellowes Advanced Computer Series Surge Protectors', 'Bush Stackable Bookrack, Pine']
接下来我们便可进行关联性定义了,代码如下:
results = apriori(order_item_list, min_support=0.005, min_confidence=0.25)
定义完成后,可遍历结果数据:
list1, list2, list3, list4 = [], [], [], []
for result in results:
# 获取支持度,并保留3位小数
support = round(result.support, 3)
# 遍历ordered_statistics对象
for rule in result.ordered_statistics:
# 获取前件和后件并转成列表
head_set = list(rule.items_base)
tail_set = list(rule.items_add)
# 跳过前件为空的数据
if not head_set:
continue
# 将前件、后件拼接成关联规则的形式
related_category = str(head_set) + '→' + str(tail_set)
# 提取置信度,并保留3位小数
confidence = round(rule.confidence, 3)
# 提取提升度,并保留3位小数
lift = round(rule.lift, 3)
# 查看强关联规则,支持度,置信度,提升度
print(related_category, support, confidence, lift)
list1.append(related_category)
list2.append(support)
list3.append(confidence)
list4.append(lift)
输出如下:
['Acco Binder Covers, Durable']→['Staples'] 0.005 0.258 2.171
['Acco Binder, Durable']→['Staples'] 0.007 0.268 2.257
['Acco Index Tab, Durable']→['Staples'] 0.008 0.351 2.956
['Acme Trimmer, Easy Grip']→['Staples'] 0.005 0.4 3.365
['Advantus Clamps, Assorted Sizes']→['Staples'] 0.005 0.333 2.804
['Advantus Door Stop, Erganomic']→['Staples'] 0.008 0.429 3.605
['Advantus Light Bulb, Black']→['Staples'] 0.005 0.348 2.926
['Advantus Stacking Tray, Erganomic']→['Staples'] 0.006 0.25 2.103
['Ames Peel and Seal, Set of 50']→['Staples'] 0.005 0.32 2.692
['Apple Signal Booster, Cordless']→['Staples'] 0.005 0.308 2.589
['Avery Binder Covers, Durable']→['Staples'] 0.006 0.281 2.366
['Avery Binder Covers, Economy']→['Staples'] 0.006 0.257 2.163
['BIC Canvas, Blue']→['Staples'] 0.006 0.29 2.442
['BIC Markers, Water Color']→['Staples'] 0.006 0.25 2.103
['BIC Pens, Fluorescent']→['Staples'] 0.005 0.286 2.404
['Binney & Smith Highlighters, Easy-Erase']→['Staples'] 0.005 0.267 2.243
['Binney & Smith Pencil Sharpener, Water Color']→['Staples'] 0.009 0.269 2.265
['Boston Sketch Pad, Easy-Erase']→['Staples'] 0.006 0.281 2.366
['Cardinal Binding Machine, Durable']→['Staples'] 0.005 0.25 2.103
['Cardinal Hole Reinforcements, Economy']→['Staples'] 0.006 0.294 2.474
['Cisco Office Telephone, VoIP']→['Staples'] 0.005 0.333 2.804
['Cisco Smart Phone, Cordless']→['Staples'] 0.005 0.296 2.493
['Dania Classic Bookcase, Pine']→['Staples'] 0.005 0.381 3.205
['Deflect-O Clock, Black']→['Staples'] 0.005 0.348 2.926
['Deflect-O Stacking Tray, Black']→['Staples'] 0.005 0.333 2.804
['Elite Letter Opener, Easy Grip']→['Staples'] 0.005 0.296 2.493
['Elite Trimmer, Serrated']→['Staples'] 0.005 0.364 3.059
['Enermax Flash Drive, Erganomic']→['Staples'] 0.007 0.55 4.627
['Fellowes Shelving, Single Width']→['Staples'] 0.006 0.257 2.163
['Fellowes Trays, Single Width']→['Staples'] 0.005 0.267 2.243
['Fiskars Letter Opener, Easy Grip']→['Staples'] 0.005 0.258 2.171
['Fiskars Trimmer, Easy Grip']→['Staples'] 0.005 0.296 2.493
['GBC Standard Therm-A-Bind Covers']→['Staples'] 0.005 0.727 6.118
['Harbour Creations Round Labels, Alphabetical']→['Staples'] 0.005 0.364 3.059
['Hon Executive Leather Armchair, Black']→['Staples'] 0.005 0.308 2.589
['Hon Shipping Labels, Laser Printer Compatible']→['Staples'] 0.005 0.296 2.493
['Hon Steel Folding Chair, Adjustable']→['Staples'] 0.005 0.296 2.493
['Hon Swivel Stool, Black']→['Staples'] 0.006 0.25 2.103
['Ibico 3-Hole Punch, Recycled']→['Staples'] 0.006 0.27 2.274
['Ibico Binding Machine, Recycled']→['Staples'] 0.008 0.302 2.543
['Kleencut Box Cutter, High Speed']→['Staples'] 0.005 0.276 2.321
['Kraft Business Envelopes, Recycled']→['Staples'] 0.005 0.267 2.243
['Memorex Flash Drive, Programmable']→['Staples'] 0.005 0.4 3.365
['Novimex Bag Chairs, Black']→['Staples'] 0.005 0.276 2.321
['Novimex Chairmat, Set of Two']→['Staples'] 0.005 0.296 2.493
['Novimex Executive Leather Armchair, Red']→['Staples'] 0.006 0.321 2.704
['OIC Push Pins, Bulk Pack']→['Staples'] 0.005 0.444 3.739
['Office Star Rocking Chair, Black']→['Staples'] 0.005 0.267 2.243
['Office Star Rocking Chair, Set of Two']→['Staples'] 0.007 0.393 3.305
['Rogers Folders, Wire Frame']→['Staples'] 0.006 0.281 2.366
['Rogers Lockers, Industrial']→['Staples'] 0.005 0.258 2.171
['SAFCO Executive Leather Armchair, Adjustable']→['Staples'] 0.006 0.346 2.912
['SAFCO Executive Leather Armchair, Black']→['Staples'] 0.005 0.286 2.404
['SAFCO Rocking Chair, Red']→['Staples'] 0.005 0.4 3.365
['SAFCO Steel Folding Chair, Red']→['Staples'] 0.006 0.37 3.116
['Smead Lockers, Wire Frame']→['Staples'] 0.006 0.265 2.227
['Smead Trays, Single Width']→['Staples'] 0.009 0.359 3.02
['Stiletto Shears, Serrated']→['Staples'] 0.005 0.471 3.959
['Stockwell Paper Clips, Assorted Sizes']→['Staples'] 0.011 0.295 2.482
['Stockwell Thumb Tacks, Bulk Pack']→['Staples'] 0.007 0.289 2.435
['Tenex Lockers, Single Width']→['Staples'] 0.005 0.267 2.243
['Tenex Trays, Single Width']→['Staples'] 0.006 0.263 2.214
['Wilson Jones Hole Reinforcements, Economy']→['Staples'] 0.008 0.324 2.728
['Xerox 1881']→['Staples'] 0.005 0.727 6.118
现在,每行中匹配到的商品关联,都依次可以现实支持度、置信度、提升度。我们还可以将这些关联规则保存至 csv 文件:
df = pd.DataFrame()
df['related_category'] = list1
df['support'] = list2
df['confidence'] = list3
df['lift'] = list4
df = df.sort_values('lift', ascending=False)
df.to_csv("./shopping_basket_result.csv")
print('Save successfully.')
运行后,便会在当前目录中产生一个名为 shopping_basket_result 的 csv 文件,包含的内容与上方的控制台输出相同。
③ 使用 seaborn 绘制热力图
seaborn 是一个常用的开源可视化图表库,因为该库与 matplotlab 有着良好衔接,我们可以使用这两个库对支持度、置信度、提升度进行可视化。
请先确保项目根目录中已安装 seaborn 库,如未安装,可使用 pip 与 conda 安装,代码如下:
$ > pip install seaborn
------or------
$ > conda install seaborn
安装完成后,请在程序头导入这两个库,代码如下:
import matplotlib.pyplot as plt
import seaborn as sns
导入完成后,便可画出三个数据的热力图了,完整代码如下:
# 设置字体以支持中文
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
# 从保存的文件中加载结果
df = pd.read_csv("./shopping_basket_result.csv", index_col=0)
# Lift值最高的前10个规则
top_10_lift = df.sort_values('lift', ascending=False).head(10)
# 绘制条形图
plt.figure(figsize=(12, 8))
sns.barplot(data=top_10_lift, y='related_category', x='lift', palette='viridis')
plt.title('Lift值最高的前10个关联规则')
plt.xlabel('Lift值')
plt.ylabel('关联规则')
plt.tight_layout()
plt.savefig('Lift值最高的前10个关联规则.png')
plt.show()
# 支持度和置信度的热图
plt.figure(figsize=(12, 8))
pivot_table = df.pivot(index='related_category', columns='support', values='confidence')
sns.heatmap(pivot_table, cmap="YlGnBu", cbar_kws={'label': '置信度'})
plt.title('支持度和置信度的热图')
plt.xlabel('支持度')
plt.ylabel('关联规则')
plt.tight_layout()
plt.savefig('支持度和置信度热图.png')
plt.show()
运行后会在控制台中显示两张热力图,它们也会保存在当前目录中,如图3、图4所示:
图3:Lift值最高的前10个关联数据
图4:支持度与置信度热力图
④现实意义
至此,我们已经完成了针对关联规则进行热力可视化分析,作为一个商户的身份来看时,例如,便可以发现购买 GBC Standard therm-A-Bind Covers 的用户最有可能同时购买 Staples,那么,就可以向单单购买 GBC Standard therm-A-Bind Covers 的用户推荐 Staples,一个个性化推荐就这样完成了。
The end.
参考文献:
- Multi-Interest Network with Dynamic Routing for Recommendation at Tmall
- Deep Session Interest Network for Click-Through Rate Prediction
- Behavior Sequence Transformer for E-commerce Recommendation in Alibaba
- Billion-scale Commodity Embedding for E-commerce Recommendation in Alibaba
- Personal Recommendation Using Deep Recurrent Neural Networks in NetEase
- Deep Reinforcement Learning for List-wise Recommendations
- Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning
- Learning Tree-based Deep Model for Recommender Systems
- Item2Vec- Neural Item Embedding for Collaborative Filtering
- Deep Neural Networks for YouTube Recommendations
- Deep Learning based Recommender System- A Survey and New Perspectives
- Wide & Deep Learning for Recommender Systems
- Learning Piece-wise Linear Models from Large Scale Data for Ad Click Prediction
- Ad Click Prediction- a View from the Trenches
- Greedy function approximation: a gradient boosting machine
- Practical Lessons from Predicting Clicks on Ads at Facebook
- Google News Personalization: Scalable Online Collaborative Filtering
版权归原作者 Mike Qin 所有, 如有侵权,请联系我们删除。