专业的编程技术博客社区

网站首页 > 博客文章 正文

手把手教你阿里天池数据竞赛:O2O优惠券使用预测

baijin 2024-08-14 12:22:57 博客文章 223 ℃ 0 评论

对于想从事数据分析行业的小伙伴来说,参加数据竞赛可以有效提高职场竞争力,下面我将讲解一下我前几天参加的阿里天池的数据竞赛:O2O优惠券使用预测,该比赛作为长期赛,可以随时提交结果,查看自己的排名。

本赛题的比赛背景:随着移动设备的完善和普及,移动互联网+各行各业进入了高速发展阶段,这其中以O2O消费最为吸引眼球。O2O行业天然关联数亿消费者,各类APP每天记录了超过百亿条用户行为和位置记录,因而成为大数据科研和商业化运营的最佳结合点之一。 以优惠券盘活老用户或吸引新客户进店消费是O2O的一种重要营销方式。然而随机投放的优惠券对多数用户造成无意义的干扰。对商家而言,滥发的优惠券可能降低品牌声誉,同时难以估算营销成本。 个性化投放是提高优惠券核销率的重要技术,它可以让具有一定偏好的消费者得到真正的实惠,同时赋予商家更强的营销能力。本次大赛为参赛选手提供了O2O场景相关的丰富数据,希望参赛选手通过分析建模,精准预测用户是否会在规定时间内使用相应优惠券。

本赛题提供用户在2016年1月1日至2016年6月30日之间真实线上线下消费行为,预测用户在2016年7月领取优惠券后15天以内的使用情况。数据集有三个,分别是ccf_offline_stage1_train.csv(用户线下消费和优惠券领取行为)、ccf_online_stage1_train.csv(用户线上点击/消费和优惠券领取行为)、ccf_offline_stage1_test_revised.csv(用户O2O线下优惠券使用预测样本)

首先导入各种库:

import os,sys,pickle 

import numpy as np
import pandas as pd

from datetime import date 

from sklearn.model_selection import KFold,train_test_split,StratifiedKFold,cross_val_score,GridSearchCV
from sklearn.pipeline import Pipeline 
from sklearn.linear_model import SGDClassifier,LogisticRegression 
from sklearn.preprocessing import StandardScaler 
from sklearn.metrics import log_loss,roc_auc_score,auc,roc_curve 
from sklearn.preprocessing import MinMaxScaler

读取文件:

# 导入数据
dfoff = pd.read_csv('ccf_offline_stage1_train.csv')
dfon = pd.read_csv('ccf_online_stage1_train.csv')
dftest = pd.read_csv('ccf_offline_stage1_test_revised.csv')

pd.set_option('display.max_columns',None)

观察一下用户线下消费和优惠券使用情况:

dfoff.head()

可以看到User_id表示用户的ID,Merchant_id表示商户的ID,Coupon_id表示优惠券ID,Discount_rate表示折扣率,Distance表示用户经常活动的地点离该商户的最近门店距离,Date_received表示用户领取优惠券的时间,Date表示用户消费的日期。


简单统计一下用户使用优惠券的情况:

dfoff = dfoff.fillna('null')
dftest = dftest.fillna('null')
print('有优惠券,购买商品:%d' % dfoff[(dfoff['Date_received']!='null')&(dfoff['Date']!='null')].shape[0])
print('有优惠券,未购商品:%d' % dfoff[(dfoff['Date_received']!='null')&(dfoff['Date']=='null')].shape[0])
print('无优惠券,购买商品:%d' % dfoff[(dfoff['Date_received']=='null')&(dfoff['Date']!='null')].shape[0])
print('无优惠券,未购商品:%d' % dfoff[(dfoff['Date_received']=='null')&(dfoff['Date']=='null')].shape[0])

可以发现在数据集中,通过使用优惠券来购买商品的只有75382条数据(感觉优惠券没什么吸引力...):

查看一下折扣率的类型:

# 打折率Discount_rate
print('Discount_rate类型:\n',dfoff['Discount_rate'].unique())

发现打折率分3种情况:

'null'表示没有打折

[0,1]表示折扣率

x:y表示满x减y

构建函数对折扣率提取特征:

'''
处理方式:
打折类型:getDiscountType()
折扣率:convertRate()
满多少:getDiscountMan()
减多少:getDiscountJian()
'''
def getDiscountType(row):
    if row == 'null':
        return 'null'
    elif ':' in row:
        return 1
    else:
        return 0

def convertRate(row):
    if row == 'null':
        return 1.0
    elif ':' in row:
        rows = row.split(':')
        return 1.0 - float(rows[1])/float(rows[0])
    else:
        return float(row)
    
def getDiscountMan(row):
    if ':' in row:
        rows = row.split(':')
        return int(rows[0])
    else:
        return 0
    
def getDiscountJian(row):
    if ':' in row:
        rows = row.split(':')
        return int(rows[1])
    else:
        return 0

def processData(df):
    # convert discount_rate
    df['discount_type'] = df['Discount_rate'].apply(getDiscountType)
    df['discount_rate'] = df['Discount_rate'].apply(convertRate)
    df['discount_man'] = df['Discount_rate'].apply(getDiscountMan)
    df['discount_jian'] = df['Discount_rate'].apply(getDiscountJian)

    print(df['discount_rate'].unique())
    
    return df

执行函数,效果如下:

dfoff =processData(dfoff)
dftest = processData(dftest)

dfoff.head()

观察Distance数据:

# 距离Distance
print('Distance类型:',dfoff['Distance'].unique())

将空值'null'用-1代替,并将距离由字符串转换为整形:

# 将距离str转为int
dfoff['distance'] = dfoff['Distance'].replace('null',-1).astype(int)
print(dfoff['distance'].unique())
dftest['distance'] = dftest['Distance'].replace('null',-1).astype(int)
print(dftest['distance'].unique())

筛选出领券日期,并排序:

[20160101.0, 20160102.0, 20160103.0, 20160104.0, 20160105.0, 20160106.0, 20160107.0, 20160108.0, 20160109.0, 20160110.0, 20160111.0, 20160112.0, 20160113.0, 20160114.0, 20160115.0, 20160116.0, 20160117.0, 20160118.0, 20160119.0, 20160120.0, 20160121.0, 20160122.0, 20160123.0, 20160124.0, 20160125.0, 20160126.0, 20160127.0, 20160128.0, 20160129.0, 20160130.0, 20160131.0, 20160201.0, 20160202.0, 20160203.0, 20160204.0, 20160205.0, 20160206.0, 20160207.0, 20160208.0, 20160209.0, 20160210.0, 20160211.0, 20160212.0, 20160213.0, 20160214.0, 20160215.0, 20160216.0, 20160217.0, 20160218.0, 20160219.0, 20160220.0, 20160221.0, 20160222.0, 20160223.0, 20160224.0, 20160225.0, 20160226.0, 20160227.0, 20160228.0, 20160229.0, 20160301.0, 20160302.0, 20160303.0, 20160304.0, 20160305.0, 20160306.0, 20160307.0, 20160308.0, 20160309.0, 20160310.0, 20160311.0, 20160312.0, 20160313.0, 20160314.0, 20160315.0, 20160316.0, 20160317.0, 20160318.0, 20160319.0, 20160320.0, 20160321.0, 20160322.0, 20160323.0, 20160324.0, 20160325.0, 20160326.0, 20160327.0, 20160328.0, 20160329.0, 20160330.0, 20160331.0, 20160401.0, 20160402.0, 20160403.0, 20160404.0, 20160405.0, 20160406.0, 20160407.0, 20160408.0, 20160409.0, 20160410.0, 20160411.0, 20160412.0, 20160413.0, 20160414.0, 20160415.0, 20160416.0, 20160417.0, 20160418.0, 20160419.0, 20160420.0, 20160421.0, 20160422.0, 20160423.0, 20160424.0, 20160425.0, 20160426.0, 20160427.0, 20160428.0, 20160429.0, 20160430.0, 20160501.0, 20160502.0, 20160503.0, 20160504.0, 20160505.0, 20160506.0, 20160507.0, 20160508.0, 20160509.0, 20160510.0, 20160511.0, 20160512.0, 20160513.0, 20160514.0, 20160515.0, 20160516.0, 20160517.0, 20160518.0, 20160519.0, 20160520.0, 20160521.0, 20160522.0, 20160523.0, 20160524.0, 20160525.0, 20160526.0, 20160527.0, 20160528.0, 20160529.0, 20160530.0, 20160531.0, 20160601.0, 20160602.0, 20160603.0, 20160604.0, 20160605.0, 20160606.0, 20160607.0, 20160608.0, 20160609.0, 20160610.0, 20160611.0, 20160612.0, 20160613.0, 20160614.0, 20160615.0]

筛选出消费日期,并排序:

date_buy = dfoff['Date'].unique()
date_buy = sorted(date_buy[date_buy != 'null'])

领取优惠券与消费的时间范围:

print('优惠券收到日期从',date_received[0],'到',date_received[-1])
print('消费日期从',date_buy[0],'到',date_buy[-1])

领券日期可以分解出不同的特征,比如周末的时候领取优惠券,可能会印象深刻,有消费欲望,这里用weekday_type来区分,同时用one-hot编码来对日期提取特征(one hot编码是将类别变量转换为机器学习算法易于利用的一种形式的过程):

'''
关于领券日期的特征:
weekday : {null,1,2,3,4,5,6,7}
weekday_type : {1,0}{周六和周日为1,其他为0}
Weekday_1 : {1,0,0,0,0,0,0}
Weekday_2 : {0,1,0,0,0,0,0}
Weekday_3 : {0,0,1,0,0,0,0}
Weekday_4 : {0,0,0,1,0,0,0}
Weekday_5 : {0,0,0,0,1,0,0}
Weekday_6 : {0,0,0,0,0,1,0}
Weekday_7 : {0,0,0,0,0,0,1}
'''

构造可以识别某日是星期几的函数,并执行:

def getWeekday(row):
    if row == 'null':
        return row
    else:
        # row[0:4]表示年份,row[4:6]表示月份,row[6:8]表示日期
        return date(int(row[0:4]),int(row[4:6]),int(row[6:8])).weekday()+1 

dfoff['weekday'] = dfoff['Date_received'].astype(str).apply(getWeekday)
dftest['weekday'] = dftest['Date_received'].astype(str).apply(getWeekday)

构造新的列weekday_type,区分是不是周末:

dfoff['weekday'] = dfoff['Date_received'].astype(str).apply(getWeekday)
dftest['weekday'] = dftest['Date_received'].astype(str).apply(getWeekday)

对dfoff和dftest进行one-hot编码:

# change weekday to one-hot encoding
weekdaycols = ['weekday_'+str(i) for i in range(1,8)]
# print(weekdaycols)

tmpdf = pd.get_dummies(dfoff['weekday'].replace('null',np.nan))
tmpdf.columns = weekdaycols
dfoff[weekdaycols] = tmpdf

tmpdf = pd.get_dummies(dftest['weekday'].replace('null',np.nan))
tmpdf.columns = weekdaycols
dftest[weekdaycols] = tmpdf

到此为止,所有特征已经构造完毕,如下所示:

'''
所有特征:
discount_rate
discount_type
discount_man
discount_jian
distance
weekday
weekday_type
weekday_1
weekday_2
weekday_3
weekday_4
weekday_5
weekday_6
weekday_7
'''

因为训练集并没有告诉我们用户到底有没有使用优惠券进行消费,所以需要我们根据领券时间与消费时间去判断,并打上标签:

#标签标注:
#Date_received=='null':表示没有领到优惠券,无需考虑,y=-1
#(Date_received!='null')&(Date-Date_received<=15):表示领取优惠券且在15天内使用,即正样本,y=1
#(Date_received!='null')&(Date=='null')|(Date-Date_received>15):表示领取优惠券未在15天内使用,即负样本,y=-1
def label(row):
    if row['Date_received']=='null':
        return -1
    if row['Date']!='null':
        td = pd.to_datetime(row['Date'],format='%Y%m%d')-pd.to_datetime(row['Date_received'],format='%Y%m%d')
        if td<=pd.Timedelta(15,'D'):
            return 1
    return 0
    
dfoff['label'] = dfoff.apply(label,axis=1)

print(dfoff['label'].value_counts())

dfoff.head()

下面进入建模阶段,首先划分训练集与测试集,我们将训练集时间范围设为20160101-20160515 ,验证集时间范围设为20160516-20160615:

# 划分训练集/验证集
df = dfoff[dfoff['label']!=-1].copy()
df['Date_received'] = pd.to_datetime(df['Date_received'],format='%Y%m%d')

train = df[(df['Date_received']<'20160516')].copy()
valid = df[(df['Date_received']>='20160516')&(df['Date_received']<='20160615')].copy()
print('Train Set:\n',train['label'].value_counts())
print('Valid Set:\n',valid['label'].value_counts())

# 特征数量
original_feature = ['discount_rate','discount_type','discount_man','discount_jian',\
                    'distance','weekday','weekday_type']+weekdaycols 
print('共有特征:',len(original_feature),'个')
print(original_feature)

接下来建立模型,我们将使用线性模型SGDClassifier,它是sklearn自带的一个库,具体参数可以在sklearn的官方文档中查询,之后使用管道机制来加快模型的运算,然后使用网格搜索来寻找最优参数:

# 建立模型
def check_model(data,predictors):
    classifier = lambda:SGDClassifier(
        loss='log' ,# loss function:logistic regression
        penalty='elasticnet', # L1&L2
        fit_intercept=True, #是否存在截距,默认存在
        max_iter=100,
        shuffle=True, # Whether or not the training data should be shuffled after each epoch
        n_jobs=1, # The number of processors to use
        class_weight=None # Weights associated with classes.If not given ,all classes are supposed to have weight one.
        )
    # 管道机制使得参数集在新数据集(比如测试集)上的重复使用,管道机制实现了对全部步骤的流式化封装和管理
    model = Pipeline(steps=[
        ('ss',StandardScaler()), # transformer
        ('en',classifier()) # estimator
        ])

    parameters = {
        'en__alpha':[0.001,0.01,0.1],  #注意en后面是两个下划线__
        'en__l1_ratio':[0.001,0.01,0.1] #注意en后面是两个下划线__,l1后面是一个下划线_
        }

    # StratifiedKFold用法类似Kfold,但是他是分层采样,确保训练集,测试集中各类别样本的比例与原始数据集中相同
    folder = StratifiedKFold(n_splits=3,shuffle=True)
    
    # Exhaustive search over specified parameter values for an estimator
    grid_search = GridSearchCV(
        model,
        parameters,
        cv=folder,
        n_jobs=-1, # -1 means using all processors
        verbose=1
        )
    grid_search = grid_search.fit(data[predictors],data['label'])
    
    return grid_search

进行模型训练:

# 训练
predictors = original_feature
model = check_model(train,predictors)

进行模型验证,并计算AUC值(所有优惠券的AUC的平均值):

# 对验证集中每个优惠券预测的结果计算AUC,再对所有优惠券的AUC求平均。计算AUC的时候,如果label只有一类,就直接跳过,因为AUC无法计算
y_valid_pred = model.predict_proba(valid[predictors])
valid1 = valid.copy()
valid1['pred_prob'] = y_valid_pred[:,1]
valid1.head()

# 计算AUC
vg = valid1.groupby(['Coupon_id'])
aucs = []
for i in vg:
    tmpdf = i[1]
    if len(tmpdf['label'].unique())!=2:
        continue
    fpr,tpr,thresholds = roc_curve(tmpdf['label'],tmpdf['pred_prob'],pos_label=1)
    aucs.append(auc(fpr,tpr))
print(np.average(aucs))

最终验证集的AUC值是0.5309425926481126,之后对测试集进行预测,将结果上传到天池官网就可以查看成绩啦!

想要进一步提高成绩和排名,可以去数据集里提取更多的特征,多去论坛看看大神们的思路。就分享到这了,我会继续更新数据分析领域的文章,欢迎关注。

本文暂时没有评论,来添加一个吧(●'◡'●)

欢迎 发表评论:

最近发表
标签列表