0


华为云MapReduce、ModelArts实现大数据综合案例-在线拍卖数据分析

1.实验简介:本实验案例涉及数据预处理,数据存储,数据查询分析及可视化 展示等大数据处理的全部操作流程。首先需配置部署在线拍卖数据分 析系统所需要的环境,然后把数据集上传到HDFS 分布式文件系统, 利用 Hive 或 Spark 对在线拍卖数据进行分析处理,并利用 Python 对分析结果进行可视化展示。

2.数据集:本实验使用 FTP 服务器上的 raw.tar.gz 压缩文件中的数据。 raw.tar.gz 中包含 TestSet.csv、TestSubset.csv、TrainingSet.csv、 277 TrainingSubset.csv,解压后选择 TrainingSet.csv 和 TestSet.csv 作 为数据集,TrainingSet 中含有 40 万条的拍卖数据,TestSet 中有 3 万条数据。raw.tar.gz 解压后包含以下 4 个 CSV 文件,我们只使用 其中的 TrainingSet.csv 和 TestSet.csv 文件。

3.实验步骤

(1)数据分析部分

1.购买弹性公网IP,使用华为云EIP进行购买,这里购买三个弹性公网IP以备后续使用

2.配置MRS,根据华为云实验操作手册购买集群,并进行相关配置

3.开通MRS集群,等待十分钟左右运行成功

4.绑定弹性公网IP,选择master节点然后绑定前述中所购买的弹性公网IP,这里选择的是124.70.50.214

5.根据华为云实验操作手册进行安全配置,添加入方向规则,优先级设置为1,协议端口选择全部协议,其余保持默认

正在上传…重新上传取消

6.进入华为云CloudShell,使用如下命令创建相关文件夹

  1. mkdir /home/zkpk
  2. mkdir /home/zkpk/raw

正在上传…重新上传取消

7.安装MySQL,使用如下代码安装MySQL并解压

  1. #安装
  2. wget https://mirrors.tuna.tsinghua.edu.cn/mysql/downloads/MySQL-8.0/mysql-8.0.27-1.el8.aarch64.rpm-bundle.tar
  3. #解压
  4. tar -xvf mysql-8.0.27-1.el8.aarch64.rpm-bundle.tar

正在上传…重新上传取消

解压后显示complete,完成

正在上传…重新上传取消

8.查看mysql运行状态并更改密码

  1. systemctl start mysqld
  2. systemctl status mysqld

正在上传…重新上传取消

使用如下命令查看随机默认密码

  1. grep 'temporary password' /var/log/mysqld.log

正在上传…重新上传取消

登录mysql后运行如下命令,修改密码为MyNewPass4!

  1. 创建远程连接用户

  2. create user 'zkpk'@'%' identified by 'MyNewPass4!';
  3. 授予用户权限

  4. grant all on . to 'zkpk'@'%';
  5. 更改加密方式

  6. ALTER USER 'zkpk'@'%' IDENTIFIED BY 'MyNewPass4!' PASSWORD EXPIRE NEVER;
  7. 刷新权限

  8. flush privileges;

正在上传…重新上传取消

9.使用WinSCP将相关文件上传到前述所创建的raw文件夹中

正在上传…重新上传取消

10.通过如下shell命令去除文件的首行字段

  1. cd /home/zkpk/raw/
  2. sed -i '1d' TrainingSet.csv
  3. sed -i '1d' TestSet.csv

正在上传…重新上传取消

11.使用如下命令将CSV文件上传到hadoop

  1. hadoop fs -put TrainingSet.csv /zkpk/raw/
  2. hadoop fs -put TestSet.csv /zkpk/raw/
  3. hadoop fs -ls /zkpk/raw/

正在上传…重新上传取消

12.进入hive,创建zkpk数据库

  1. SHOW DATABASES;
  2. CREATE DATABASE zkpk;
  3. SHOW DATABASES;

正在上传…重新上传取消

13.对训练集、测试集使用如下代码创建外部表

  1. CREATE EXTERNAL TABLE trainingset_log(
  2. EbayID STRING,
  3. QuantitySold INT,
  4. Price FLOAT,
  5. PricePercent FLOAT,
  6. StartingBidPercent FLOAT,
  7. SellerName STRING,
  8. SellerClosePercent DOUBLE,
  9. Category INT,
  10. PersonID STRING,
  11. StartingBid FLOAT,
  12. AvgPrice FLOAT,
  13. EndDay STRING,
  14. HitCount INT,
  15. AuctionAvgHitCount INT,
  16. ItemAuctionSellPercent INT,
  17. SellerSaleAvgPriceRatio DOUBLE,
  18. SellerAvg DOUBLE,
  19. SellerItemAvg INT,
  20. AuctionHitCountAvgRatio INT,
  21. BestOffer DOUBLE,
  22. IsHOF INT,
  23. ItemListedCount INT,
  24. AuctionCount INT,
  25. AuctionSaleCount INT,
  26. SellerAuctionCount INT,
  27. SellerAuctionSaleCount INT,
  28. AuctionMedianPrice FLOAT)
  29. ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
  30. STORED AS TEXTFILE;
  31. CREATE EXTERNAL TABLE testset_log(
  32. EbayID STRING,
  33. QuantitySold INT,
  34. Price FLOAT,
  35. PricePercent FLOAT,
  36. StartingBidPercent FLOAT,
  37. SellerName STRING,
  38. SellerClosePercent DOUBLE,
  39. Category INT,
  40. PersonID STRING,
  41. StartingBid FLOAT,
  42. AvgPrice FLOAT,
  43. EndDay STRING,
  44. HitCount INT,
  45. AuctionAvgHitCount INT,
  46. ItemAuctionSellPercent INT,
  47. SellerSaleAvgPriceRatio DOUBLE,
  48. SellerAvg DOUBLE,
  49. SellerItemAvg INT,
  50. AuctionHitCountAvgRatio INT,
  51. BestOffer DOUBLE,
  52. IsHOF INT,
  53. ItemListedCount INT,
  54. AuctionCount INT,
  55. AuctionSaleCount INT,
  56. SellerAuctionCount INT,
  57. SellerAuctionSaleCount INT,
  58. AuctionMedianPrice FLOAT)
  59. ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
  60. STORED AS TEXTFILE;

正在上传…重新上传取消

14.导入数据到相应的表

  1. load data inpath '/zkpk/raw/TrainingSet.csv' into table trainingset_log;
  2. load data inpath '/zkpk/raw/TestSet.csv' into table testset_log;

正在上传…重新上传取消

正在上传…重新上传取消

15.统计 TrainingSet 中拍买成功交易的平均成交价并保存

  1. CREATE TABLE avg_price(avg_price FLOAT)
  2. ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
  3. Insert OVERWRITE TABLE avg_price
  4. SELECT avg(Price) FROM trainingset_log WHERE QuantitySold=1;

正在上传…重新上传取消

最终结果为33.03668

16.统计 TrainingSet 中金牌卖家的拍卖成功率,降序排列并保存

  1. CREATE TABLE success_rate_temp(SellerName STRING,Rate DOUBLE)
  2. ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
  3. INSERT OVERWRITE TABLE success_rate_temp
  4. SELECT SellerName,sum(QuantitySold)/count(QuantitySold)
  5. FROM trainingset_log WHERE IsHOF=1 GROUP BY SellerName;
  6. CREATE TABLE success_rate_desc(SellerName STRING,Rate DOUBLE)
  7. ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
  8. INSERT OVERWRITE TABLE success_rate_desc
  9. SELECT * FROM success_rate_temp ORDER BY Rate DESC;
  10. drop table success_rate_temp;

正在上传…重新上传取消

正在上传…重新上传取消

正在上传…重新上传取消

17.统计TrainingSet和TestSet中周一到周日,每天拍卖成功的数量及拍卖成功率并保存

  1. CREATE TABLE train_day_rate(EndDay STRING,Success INT,Rate DOUBLE)
  2. ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
  3. INSERT OVERWRITE TABLE train_day_rate
  4. SELECT EndDay,sum(QuantitySold),sum(QuantitySold)/count(QuantitySold)
  5. FROM trainingset_log GROUP BY EndDay;
  6. CREATE TABLE test_day_rate(EndDay STRING,Success INT,Rate DOUBLE)
  7. ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
  8. INSERT OVERWRITE TABLE test_day_rate
  9. SELECT EndDay,sum(QuantitySold),sum(QuantitySold)/count(QuantitySold)
  10. FROM testset_log GROUP BY EndDay;

18.筛选出TrainingSet和TestSet数据中的EbayID,Quantitiysold字段,保存为train_label文件和test_label文件

  1. CREATE TABLE train_label(EbayID STRING, QuantitySold Int)
  2. ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
  3. STORED AS TEXTFILE;
  4. INSERT OVERWRITE TABLE train_label SELECT
  5. EbayID, QuantitySold FROM trainingset_log;
  6. CREATE TABLE test_label(EbayID STRING, QuantitySold Int)
  7. ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
  8. STORED AS TEXTFILE;
  9. INSERT OVERWRITE TABLE test_label SELECT
  10. EbayID, QuantitySold FROM testset_log;

19.从TrainingSet和TestSet数据中删除的SellerName,QuantiySold,EndDay字段,并将数据导出保存为train_data文件和test_data文件

  1. CREATE TABLE train_data (
  2. EbayID STRING,
  3. Price FLOAT,
  4. PricePercent FLOAT,
  5. StartingBidPercent FLOAT,
  6. SellerClosePercent DOUBLE,
  7. Category INT,
  8. PersonID STRING,
  9. StartingBid FLOAT,
  10. AvgPrice FLOAT,
  11. HitCount INT,
  12. AuctionAvgHitCount INT,
  13. ItemAuctionSellPercent INT,
  14. SellerSaleAvgPriceRatio DOUBLE,
  15. SellerAvg DOUBLE,
  16. SellerItemAvg INT,
  17. AuctionHitCountAvgRatio INT,
  18. BestOffer DOUBLE,
  19. IsHOF INT,
  20. ItemListedCount INT,
  21. AuctionCount INT,
  22. AuctionSaleCount INT,
  23. SellerAuctionCount INT,
  24. SellerAuctionSaleCount INT,
  25. AuctionMedianPrice FLOAT)
  26. ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
  27. STORED AS TEXTFILE;
  28. INSERT OVERWRITE TABLE train_data SELECT
  29. EbayID,Price,PricePercent,StartingBidPercent,SellerClosePercent,Category,PersonID,StartingBid,
  30. AvgPrice,HitCount,AuctionAvgHitCount,ItemAuctionSellPercent,SellerSaleAvgPriceRatio,SellerAvg,
  31. SellerItemAvg,AuctionHitCountAvgRatio,BestOffer,IsHOF,ItemListedCount,AuctionCount,
  32. AuctionSaleCount,SellerAuctionCount,SellerAuctionSaleCount,AuctionMedianPrice
  33. FROM trainingset_log;
  34. CREATE TABLE test_data (
  35. EbayID STRING,
  36. Price FLOAT,
  37. PricePercent FLOAT,
  38. StartingBidPercent FLOAT,
  39. SellerClosePercent DOUBLE,
  40. Category INT,
  41. PersonID STRING,
  42. StartingBid FLOAT,
  43. AvgPrice FLOAT,
  44. HitCount INT,
  45. AuctionAvgHitCount INT,
  46. ItemAuctionSellPercent INT,
  47. SellerSaleAvgPriceRatio DOUBLE,
  48. SellerAvg DOUBLE,
  49. SellerItemAvg INT,
  50. AuctionHitCountAvgRatio INT,
  51. BestOffer DOUBLE,
  52. IsHOF INT,
  53. ItemListedCount INT,
  54. AuctionCount INT,
  55. AuctionSaleCount INT,
  56. SellerAuctionCount INT,
  57. SellerAuctionSaleCount INT,
  58. AuctionMedianPrice FLOAT)
  59. ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
  60. STORED AS TEXTFILE;
  61. INSERT OVERWRITE TABLE test_data SELECT
  62. EbayID,Price,PricePercent,StartingBidPercent,SellerClosePercent,Category,PersonID,StartingBid,
  63. AvgPrice,HitCount,AuctionAvgHitCount,ItemAuctionSellPercent,SellerSaleAvgPriceRatio,SellerAvg,
  64. SellerItemAvg,AuctionHitCountAvgRatio,BestOffer,IsHOF,ItemListedCount,AuctionCount,
  65. AuctionSaleCount,SellerAuctionCount,SellerAuctionSaleCount,AuctionMedianPrice
  66. FROM testset_log;

20.查看所有结果表

21.从HDFS导出数据到本地文件系统

首先查看表格在HDFS上的存储位置,为hdfs://hacluster/user/hive/warehouse/zkpk.db/

然后用hadoop fs -put命令导出到本地文件系统,即创建的/home/zkpk/result目录下

22.导入数据到MySQL数据库中

首先在MySQL中创建对应的表,如下分别导出 TrainingSet 和 TestSet 中周一到周日,每天拍卖成功的数量及拍卖成功率的统计结果,需要创建表train_day_rate,test_day_rate

  1. CREATE TABLE train_day_rate(EndDay varchar(60),Success int,Rate double);
  2. CREATE TABLE test_day_rate(EndDay varchar(60),Success int,Rate double);

正在上传…重新上传取消

然后使用如下命令,导入到mysql数据库成功

  1. LOAD DATA LOCAL INFILE '/home/zkpk/result/train_day_rate/000000_0' INTO TABLE zkpk.train_day_rate FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
  2. LOAD DATA LOCAL INFILE '/home/zkpk/result/test_day_rate/000000_0' INTO TABLE zkpk.test_day_rate FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';

23.在mysql中查看结果

24.对其余表格进行同样操作,导入到MySQL数据库中,最终结果如下

(2)数据可视化部分

首先配置ModelArts环境,选择tensorflow2.1-cuda10.1-cudnn7-ubuntu18.04,CPU选择8核32GB

然后进入notebook开发环境,导入文件,使用pip命令安装相关包

最后编写代码,利用python 中 pandas 库的 dataframe 加载数据,再利用 matplotlib 绘制图形,代码如下

  1. import numpy as np
  2. import matplotlib
  3. import pandas as pd
  4. import matplotlib.pyplot as plt
  5. matplotlib.rcParams['font.sans-serif'] = ['SimHei']
  6. matplotlib.rcParams['axes.unicode_minus'] = False
  7. df = pd.read_csv('../train_day_rate.csv', sep=',')
  8. print(df)
  9. plt.ylabel("拍卖成功数量")
  10. plt.xlabel("时间")
  11. p1 = plt.bar(df['endday'], df['success'], color='red')
  12. plt.bar_label(p1, labels=df['success'])
  13. plt.title('train_data-success', fontproperties='SimHei', fontsize=15)
  14. plt.savefig("train_data-success.png")
  15. plt.show()
  16. plt.ylabel("拍卖成功率")
  17. plt.xlabel("时间")
  18. p1 = plt.bar(df['endday'], df['rate'], color='blue')
  19. plt.title('train_data-rate', fontproperties='SimHei', fontsize=15)
  20. plt.savefig("train_data-rate.png")
  21. plt.show()
  22. df = pd.read_csv('../test_day_rate.csv', sep=',')
  23. print(df)

综上所述,trainingset中Thursday拍卖成功率最高,Testset中Tuesday拍卖成功率最高

(3)数据预测部分

项目代码:完整版:https://bbs.huaweicloud.com/blogs/380728


本文转载自: https://blog.csdn.net/m0_56272552/article/details/127433578
版权归原作者 夕远3602 所有, 如有侵权,请联系我们删除。

“华为云MapReduce、ModelArts实现大数据综合案例-在线拍卖数据分析”的评论:

还没有评论