1.实验简介:本实验案例涉及数据预处理,数据存储,数据查询分析及可视化 展示等大数据处理的全部操作流程。首先需配置部署在线拍卖数据分 析系统所需要的环境,然后把数据集上传到HDFS 分布式文件系统, 利用 Hive 或 Spark 对在线拍卖数据进行分析处理,并利用 Python 对分析结果进行可视化展示。
2.数据集:本实验使用 FTP 服务器上的 raw.tar.gz 压缩文件中的数据。 raw.tar.gz 中包含 TestSet.csv、TestSubset.csv、TrainingSet.csv、 277 TrainingSubset.csv,解压后选择 TrainingSet.csv 和 TestSet.csv 作 为数据集,TrainingSet 中含有 40 万条的拍卖数据,TestSet 中有 3 万条数据。raw.tar.gz 解压后包含以下 4 个 CSV 文件,我们只使用 其中的 TrainingSet.csv 和 TestSet.csv 文件。
3.实验步骤
(1)数据分析部分
1.购买弹性公网IP,使用华为云EIP进行购买,这里购买三个弹性公网IP以备后续使用
2.配置MRS,根据华为云实验操作手册购买集群,并进行相关配置
3.开通MRS集群,等待十分钟左右运行成功
4.绑定弹性公网IP,选择master节点然后绑定前述中所购买的弹性公网IP,这里选择的是124.70.50.214
5.根据华为云实验操作手册进行安全配置,添加入方向规则,优先级设置为1,协议端口选择全部协议,其余保持默认
正在上传…重新上传取消
6.进入华为云CloudShell,使用如下命令创建相关文件夹
- mkdir /home/zkpk
- mkdir /home/zkpk/raw
正在上传…重新上传取消
7.安装MySQL,使用如下代码安装MySQL并解压
- #安装
- wget https://mirrors.tuna.tsinghua.edu.cn/mysql/downloads/MySQL-8.0/mysql-8.0.27-1.el8.aarch64.rpm-bundle.tar
- #解压
- tar -xvf mysql-8.0.27-1.el8.aarch64.rpm-bundle.tar
正在上传…重新上传取消
解压后显示complete,完成
正在上传…重新上传取消
8.查看mysql运行状态并更改密码
- systemctl start mysqld
- systemctl status mysqld
正在上传…重新上传取消
使用如下命令查看随机默认密码
- grep 'temporary password' /var/log/mysqld.log
正在上传…重新上传取消
登录mysql后运行如下命令,修改密码为MyNewPass4!
创建远程连接用户
- create user 'zkpk'@'%' identified by 'MyNewPass4!';
授予用户权限
- grant all on . to 'zkpk'@'%';
更改加密方式
- ALTER USER 'zkpk'@'%' IDENTIFIED BY 'MyNewPass4!' PASSWORD EXPIRE NEVER;
刷新权限
- flush privileges;
正在上传…重新上传取消
9.使用WinSCP将相关文件上传到前述所创建的raw文件夹中
正在上传…重新上传取消
10.通过如下shell命令去除文件的首行字段
- cd /home/zkpk/raw/
- sed -i '1d' TrainingSet.csv
- sed -i '1d' TestSet.csv
正在上传…重新上传取消
11.使用如下命令将CSV文件上传到hadoop
- hadoop fs -put TrainingSet.csv /zkpk/raw/
- hadoop fs -put TestSet.csv /zkpk/raw/
- hadoop fs -ls /zkpk/raw/
正在上传…重新上传取消
12.进入hive,创建zkpk数据库
- SHOW DATABASES;
- CREATE DATABASE zkpk;
- SHOW DATABASES;
正在上传…重新上传取消
13.对训练集、测试集使用如下代码创建外部表
- CREATE EXTERNAL TABLE trainingset_log(
- EbayID STRING,
- QuantitySold INT,
- Price FLOAT,
- PricePercent FLOAT,
- StartingBidPercent FLOAT,
- SellerName STRING,
- SellerClosePercent DOUBLE,
- Category INT,
- PersonID STRING,
- StartingBid FLOAT,
- AvgPrice FLOAT,
- EndDay STRING,
- HitCount INT,
- AuctionAvgHitCount INT,
- ItemAuctionSellPercent INT,
- SellerSaleAvgPriceRatio DOUBLE,
- SellerAvg DOUBLE,
- SellerItemAvg INT,
- AuctionHitCountAvgRatio INT,
- BestOffer DOUBLE,
- IsHOF INT,
- ItemListedCount INT,
- AuctionCount INT,
- AuctionSaleCount INT,
- SellerAuctionCount INT,
- SellerAuctionSaleCount INT,
- AuctionMedianPrice FLOAT)
- ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
- STORED AS TEXTFILE;
- CREATE EXTERNAL TABLE testset_log(
- EbayID STRING,
- QuantitySold INT,
- Price FLOAT,
- PricePercent FLOAT,
- StartingBidPercent FLOAT,
- SellerName STRING,
- SellerClosePercent DOUBLE,
- Category INT,
- PersonID STRING,
- StartingBid FLOAT,
- AvgPrice FLOAT,
- EndDay STRING,
- HitCount INT,
- AuctionAvgHitCount INT,
- ItemAuctionSellPercent INT,
- SellerSaleAvgPriceRatio DOUBLE,
- SellerAvg DOUBLE,
- SellerItemAvg INT,
- AuctionHitCountAvgRatio INT,
- BestOffer DOUBLE,
- IsHOF INT,
- ItemListedCount INT,
- AuctionCount INT,
- AuctionSaleCount INT,
- SellerAuctionCount INT,
- SellerAuctionSaleCount INT,
- AuctionMedianPrice FLOAT)
- ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
- STORED AS TEXTFILE;
正在上传…重新上传取消
14.导入数据到相应的表
- load data inpath '/zkpk/raw/TrainingSet.csv' into table trainingset_log;
- load data inpath '/zkpk/raw/TestSet.csv' into table testset_log;
正在上传…重新上传取消
正在上传…重新上传取消
15.统计 TrainingSet 中拍买成功交易的平均成交价并保存
- CREATE TABLE avg_price(avg_price FLOAT)
- ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
- Insert OVERWRITE TABLE avg_price
- SELECT avg(Price) FROM trainingset_log WHERE QuantitySold=1;
正在上传…重新上传取消
最终结果为33.03668
16.统计 TrainingSet 中金牌卖家的拍卖成功率,降序排列并保存
- CREATE TABLE success_rate_temp(SellerName STRING,Rate DOUBLE)
- ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
- INSERT OVERWRITE TABLE success_rate_temp
- SELECT SellerName,sum(QuantitySold)/count(QuantitySold)
- FROM trainingset_log WHERE IsHOF=1 GROUP BY SellerName;
- CREATE TABLE success_rate_desc(SellerName STRING,Rate DOUBLE)
- ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
- INSERT OVERWRITE TABLE success_rate_desc
- SELECT * FROM success_rate_temp ORDER BY Rate DESC;
- drop table success_rate_temp;
正在上传…重新上传取消
正在上传…重新上传取消
正在上传…重新上传取消
17.统计TrainingSet和TestSet中周一到周日,每天拍卖成功的数量及拍卖成功率并保存
- CREATE TABLE train_day_rate(EndDay STRING,Success INT,Rate DOUBLE)
- ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
- INSERT OVERWRITE TABLE train_day_rate
- SELECT EndDay,sum(QuantitySold),sum(QuantitySold)/count(QuantitySold)
- FROM trainingset_log GROUP BY EndDay;
- CREATE TABLE test_day_rate(EndDay STRING,Success INT,Rate DOUBLE)
- ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
- INSERT OVERWRITE TABLE test_day_rate
- SELECT EndDay,sum(QuantitySold),sum(QuantitySold)/count(QuantitySold)
- FROM testset_log GROUP BY EndDay;
18.筛选出TrainingSet和TestSet数据中的EbayID,Quantitiysold字段,保存为train_label文件和test_label文件
- CREATE TABLE train_label(EbayID STRING, QuantitySold Int)
- ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
- STORED AS TEXTFILE;
- INSERT OVERWRITE TABLE train_label SELECT
- EbayID, QuantitySold FROM trainingset_log;
- CREATE TABLE test_label(EbayID STRING, QuantitySold Int)
- ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
- STORED AS TEXTFILE;
- INSERT OVERWRITE TABLE test_label SELECT
- EbayID, QuantitySold FROM testset_log;
19.从TrainingSet和TestSet数据中删除的SellerName,QuantiySold,EndDay字段,并将数据导出保存为train_data文件和test_data文件
- CREATE TABLE train_data (
- EbayID STRING,
- Price FLOAT,
- PricePercent FLOAT,
- StartingBidPercent FLOAT,
- SellerClosePercent DOUBLE,
- Category INT,
- PersonID STRING,
- StartingBid FLOAT,
- AvgPrice FLOAT,
- HitCount INT,
- AuctionAvgHitCount INT,
- ItemAuctionSellPercent INT,
- SellerSaleAvgPriceRatio DOUBLE,
- SellerAvg DOUBLE,
- SellerItemAvg INT,
- AuctionHitCountAvgRatio INT,
- BestOffer DOUBLE,
- IsHOF INT,
- ItemListedCount INT,
- AuctionCount INT,
- AuctionSaleCount INT,
- SellerAuctionCount INT,
- SellerAuctionSaleCount INT,
- AuctionMedianPrice FLOAT)
- ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
- STORED AS TEXTFILE;
- INSERT OVERWRITE TABLE train_data SELECT
- EbayID,Price,PricePercent,StartingBidPercent,SellerClosePercent,Category,PersonID,StartingBid,
- AvgPrice,HitCount,AuctionAvgHitCount,ItemAuctionSellPercent,SellerSaleAvgPriceRatio,SellerAvg,
- SellerItemAvg,AuctionHitCountAvgRatio,BestOffer,IsHOF,ItemListedCount,AuctionCount,
- AuctionSaleCount,SellerAuctionCount,SellerAuctionSaleCount,AuctionMedianPrice
- FROM trainingset_log;
- CREATE TABLE test_data (
- EbayID STRING,
- Price FLOAT,
- PricePercent FLOAT,
- StartingBidPercent FLOAT,
- SellerClosePercent DOUBLE,
- Category INT,
- PersonID STRING,
- StartingBid FLOAT,
- AvgPrice FLOAT,
- HitCount INT,
- AuctionAvgHitCount INT,
- ItemAuctionSellPercent INT,
- SellerSaleAvgPriceRatio DOUBLE,
- SellerAvg DOUBLE,
- SellerItemAvg INT,
- AuctionHitCountAvgRatio INT,
- BestOffer DOUBLE,
- IsHOF INT,
- ItemListedCount INT,
- AuctionCount INT,
- AuctionSaleCount INT,
- SellerAuctionCount INT,
- SellerAuctionSaleCount INT,
- AuctionMedianPrice FLOAT)
- ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
- STORED AS TEXTFILE;
- INSERT OVERWRITE TABLE test_data SELECT
- EbayID,Price,PricePercent,StartingBidPercent,SellerClosePercent,Category,PersonID,StartingBid,
- AvgPrice,HitCount,AuctionAvgHitCount,ItemAuctionSellPercent,SellerSaleAvgPriceRatio,SellerAvg,
- SellerItemAvg,AuctionHitCountAvgRatio,BestOffer,IsHOF,ItemListedCount,AuctionCount,
- AuctionSaleCount,SellerAuctionCount,SellerAuctionSaleCount,AuctionMedianPrice
- FROM testset_log;
20.查看所有结果表
21.从HDFS导出数据到本地文件系统
首先查看表格在HDFS上的存储位置,为hdfs://hacluster/user/hive/warehouse/zkpk.db/
然后用hadoop fs -put命令导出到本地文件系统,即创建的/home/zkpk/result目录下
22.导入数据到MySQL数据库中
首先在MySQL中创建对应的表,如下分别导出 TrainingSet 和 TestSet 中周一到周日,每天拍卖成功的数量及拍卖成功率的统计结果,需要创建表train_day_rate,test_day_rate
- CREATE TABLE train_day_rate(EndDay varchar(60),Success int,Rate double);
- CREATE TABLE test_day_rate(EndDay varchar(60),Success int,Rate double);
正在上传…重新上传取消
然后使用如下命令,导入到mysql数据库成功
- LOAD DATA LOCAL INFILE '/home/zkpk/result/train_day_rate/000000_0' INTO TABLE zkpk.train_day_rate FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
- LOAD DATA LOCAL INFILE '/home/zkpk/result/test_day_rate/000000_0' INTO TABLE zkpk.test_day_rate FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
23.在mysql中查看结果
24.对其余表格进行同样操作,导入到MySQL数据库中,最终结果如下
(2)数据可视化部分
首先配置ModelArts环境,选择tensorflow2.1-cuda10.1-cudnn7-ubuntu18.04,CPU选择8核32GB
然后进入notebook开发环境,导入文件,使用pip命令安装相关包
最后编写代码,利用python 中 pandas 库的 dataframe 加载数据,再利用 matplotlib 绘制图形,代码如下
- import numpy as np
- import matplotlib
- import pandas as pd
- import matplotlib.pyplot as plt
- matplotlib.rcParams['font.sans-serif'] = ['SimHei']
- matplotlib.rcParams['axes.unicode_minus'] = False
- df = pd.read_csv('../train_day_rate.csv', sep=',')
- print(df)
- plt.ylabel("拍卖成功数量")
- plt.xlabel("时间")
- p1 = plt.bar(df['endday'], df['success'], color='red')
- plt.bar_label(p1, labels=df['success'])
- plt.title('train_data-success', fontproperties='SimHei', fontsize=15)
- plt.savefig("train_data-success.png")
- plt.show()
- plt.ylabel("拍卖成功率")
- plt.xlabel("时间")
- p1 = plt.bar(df['endday'], df['rate'], color='blue')
- plt.title('train_data-rate', fontproperties='SimHei', fontsize=15)
- plt.savefig("train_data-rate.png")
- plt.show()
- df = pd.read_csv('../test_day_rate.csv', sep=',')
- print(df)
综上所述,trainingset中Thursday拍卖成功率最高,Testset中Tuesday拍卖成功率最高
(3)数据预测部分
版权归原作者 夕远3602 所有, 如有侵权,请联系我们删除。