0


大数据课设-2020年美国新冠肺炎疫情数据分析

一.大数据环境的安装与配置

1.安装Hadoop集群

1).配置ssh及免密登录

2).三台主机互相通信

3).安装java

4).安装Hadoop集群

2.在Hadoop集群中并安装非关系型数据库系统集群Hbase

1). 下载并解压

2). 修改环境变量

3). 启动并测试

3.在Hadoop安装Zookeepeer,要求3.1以上版本,每个数据节点都要配置Zookeepper

1). 下载并解压

2). 修改配置文件zoo.cfg

3). 创建data、logs文件夹

4). 配置zookeeper分布式

5). 启动并测试

4.在集群中安装配置Hive

**1). **下载并解压

2). 配置环境变量

3). 修改/usr/local/hive/conf下的hive-site.xml

5.在集群中配置sqoop

1). 下载并解压sqoop1.4.6

2). 修改配置文件sqoop-env.sh

3). 配置环境变量

4). 将mysql驱动包拷贝到$SQOOP_HOME/lib

5). 测试与MySQL的连接

6.安装MySQL

1). Ubuntu下mysql的安装

2). 下载mysql jdbc包

3). 启动并登陆mysql shell

二.数据库的创建及数据导入

1.启动Hadoop和Hive

2.根据要求在MySQL、Hbase、Hive中创建所需要的数据表

1). Mysql中创建表

CREATE TABLE us_counties(

date DATE,

county VARCHAR(100),

state VARCHAR(100),

cases INT,

deaths INT

);

2). Hbase创建表

创建表create 'us_counties', {NAME => 'f1', VERSIONS => 5}

3). Hive创建表

CREATE EXTERNAL TABLE dblab.us_counties (dates STRING,county STRING,state STRING,cases INT,deaths INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/bigdatacase/dataset/';

3.实现数据的导入

1). 上传到HDFS

2). 数据导入到mysql

LOAD DATA INFILE '/var/lib/mysql-files/us-counties.txt' INTO TABLE us_counties;

3). 数据导入到HIVE

CREATE EXTERNAL TABLE dblab.us_counties (dates STRING,county STRING,state STRING,cases INT,deaths INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE LOCATION '/bigdatacase/dataset/';

4). 使用sqoop将数据从Mysql导入到Hbase

cd /usr/local/sqoop

./bin/sqoop import --connect jdbc:mysql://localhost:3306/dblab --username username --password pwd--table us_counties --hbase-table us_counties --column-family f1 --hbase-row-key date --hbase-create-table -m 1

4.在Hive中对数据进行分析

1). 美国每日的累计确诊病例数和死亡数

SELECT dates, SUM(cases) AS cases_sum, SUM(deaths) AS deaths_sum FROM us_counties GROUP BY dates ORDER BY cases_sum ASC;

2). 美国每日的新增确诊病例数

3). 美国每日的新增死亡病例数

4). 截止5.19,美国各州累计确诊、死亡人数和病死率

5). 截止5.19,美国累计确诊人数前10的州

6). 截止5.19,美国累计死亡人数前10的州

7). 截止5.19,美国累计确诊人数最少的10个州

8). 截止5.19,美国累计死亡人数最少的10个州

9). 截止5.19,美国的病死率

三.数据可视化

1. 美国每日的累计确诊病例数和死亡数——>双柱状图

2. 美国每日的新增确诊病例数——>折线图

3. 美国每日的新增死亡病例数——>折线图

4. 截止5.19,美国各州累计确诊、死亡人数和病死率--->表格

5. 截止5.19,美国累计确诊人数前10的州--->词云图

6. 截止5.19,美国累计死亡人数前10的州--->象柱状图

7. 截止5.19,美国累计确诊人数最少的10个州--->词云图

8. 截止5.19,美国累计死亡人数最少的10个州--->漏斗图

9. 截止5.19,美国的病死率--->饼状图

四.附录

Pyecharts 可视化部分代码

import ...
from pyecharts import options as opts
from pyecharts.charts import Bar
from pyecharts.charts import Line
from pyecharts.components import Table
from pyecharts.charts import WordCloud
from pyecharts.charts import Pie
from pyecharts.charts import Funnel
from pyecharts.charts import Scatter
from pyecharts.charts import PictorialBar
from pyecharts.options import ComponentTitleOpts
from pyecharts.globals import SymbolType
from jinja2.utils import escape
import json

db = pymysql.connect(host='localhost',
                    user='user',
                    password='password',
                    db='dbname',
                    charset='utf8mb4')
try:
    #获取会话指针
    with db.cursor() as cursor:
        #查询语句
        sql = "select `date`,`county`,`state`,`cases`,`deaths` from `us_counties`  "

        cursor.execute(sql)
        result = cursor.fetchall()

        df = pd.DataFrame(list(result),columns=['date','county','state','cases','deaths'])
        # print(df['date'])
        # print(df)
        date =list(df['date'])

        dates = [datetime.strptime(date_str, '%Y/%m/%d').date() for date_str in date]

        county =list(df['county'])

        state =list(df['state'])
        cases =list(df['cases'])
        totalCases = df['cases'].count()
        deaths =list(df['deaths'])
        sql2 = "select date,state,sum(cases) as totalCases,sum(deaths) as totalDeaths,round(sum(deaths)/sum(cases),4) as deathRate from us_counties  where date = '2020/5/19' group by date,state limit 5"
        cursor.execute(sql2)
        result2 = cursor.fetchall()
        print(result2)
        df = pd.DataFrame(list(result2), columns=['date', 'state', 'totalCases', 'totalDeaths', 'deathRate'])
        state = df['state']
        # print(state)
        # print(list(state))
        # print(state)
        totalCases = df['totalCases']
        totalDeaths = df['totalDeaths']
        deathRate = df['deathRate']
        # print(deathRate)

        
finally:
    db.close()
def test1():
    #美国每日累计确诊和死亡人数
    d = (
        Bar()
            .add_xaxis(formatted_dates)
            .add_yaxis("累计确诊人数", cases, stack="stack1")
            .add_yaxis("累计死亡人数", deaths, stack="stack1")
            .set_series_opts(label_opts=opts.LabelOpts(is_show=False))
            .set_global_opts(title_opts=opts.TitleOpts(title="美国每日累计确诊和死亡人数"))
    )
    d.render('test1.html')

def test2():
    #美国每日确诊死亡折线图
    L1 = (
        Line(init_opts=opts.InitOpts(width="1600px", height="800px"))
            .add_xaxis(xaxis_data=formatted_dates)
            .add_yaxis(
            series_name="新增确诊",
            y_axis=cases,
            markpoint_opts=opts.MarkPointOpts(
                data=[
                    opts.MarkPointItem(type_="max", name="最大值")

                ]
            ),
            markline_opts=opts.MarkLineOpts(
                data=[opts.MarkLineItem(type_="average", name="平均值")]
            ),
        )
            .set_global_opts(
            title_opts=opts.TitleOpts(title="美国每日新增确诊折线图", subtitle=""),
            tooltip_opts=opts.TooltipOpts(trigger="axis"),
            toolbox_opts=opts.ToolboxOpts(is_show=True),
            xaxis_opts=opts.AxisOpts(type_="category", boundary_gap=False),
        )
    )
    L1.render('test2.html')

def test3():
    #美国每日新增死亡折线图
    L1 = (
        Line(init_opts=opts.InitOpts(width="1600px", height="800px"))
            .add_xaxis(xaxis_data=formatted_dates)
            .add_yaxis(
            series_name="新增死亡",
            y_axis=cases,
            markpoint_opts=opts.MarkPointOpts(
                data=[
                    opts.MarkPointItem(type_="max", name="最大值")

                ]
            ),
            markline_opts=opts.MarkLineOpts(
                data=[opts.MarkLineItem(type_="average", name="平均值")]
            ),
        )
            .set_global_opts(
            title_opts=opts.TitleOpts(title="美国每日新增死亡折线图", subtitle=""),
            tooltip_opts=opts.TooltipOpts(trigger="axis"),
            toolbox_opts=opts.ToolboxOpts(is_show=True),
            xaxis_opts=opts.AxisOpts(type_="category", boundary_gap=False),
        )
    )
    L1.render('test3.html')

def test4():

    #截止5.19,美国各州累计确诊、死亡人数和病死率--->表格
    allState=[]
    row = []
    row.append(str(state))
    row.append(totalCases)
    row.append(totalDeaths)
    row.append(deathRate)
    allState.append(row)

    table = Table()

    headers = ["State name", "Total cases", "Total deaths", "Death rate"]
    rows = allState
    table.add(headers, rows)
    table.set_global_opts(
        title_opts=ComponentTitleOpts(title="美国各州疫情一览", subtitle="")
    )

    # table.render('test4.html')

def test5():
    d = (
        Bar()
            .add_xaxis(formatted_dates)
            .add_yaxis("累计确诊人数", cases, stack="stack1")
            .add_yaxis("累计死亡人数", deaths, stack="stack1")
            .set_series_opts(label_opts=opts.LabelOpts(is_show=False))
            .set_global_opts(title_opts=opts.TitleOpts(title="美国每日累计确诊和死亡人数"))
    )
    d.render('test1.html')

def test6():
    d = (
        Bar()
            .add_xaxis(formatted_dates)
            .add_yaxis("累计确诊人数", cases, stack="stack1")
            .add_yaxis("累计死亡人数", deaths, stack="stack1")
            .set_series_opts(label_opts=opts.LabelOpts(is_show=False))
            .set_global_opts(title_opts=opts.TitleOpts(title="美国每日累计确诊和死亡人数"))
    )
    d.render('test1.html')

def test7():
    d = (
        Bar()
            .add_xaxis(formatted_dates)
            .add_yaxis("累计确诊人数", cases, stack="stack1")
            .add_yaxis("累计死亡人数", deaths, stack="stack1")
            .set_series_opts(label_opts=opts.LabelOpts(is_show=False))
            .set_global_opts(title_opts=opts.TitleOpts(title="美国每日累计确诊和死亡人数"))
    )
    d.render('test1.html')

def test8():
    d = (
        Bar()
            .add_xaxis(formatted_dates)
            .add_yaxis("累计确诊人数", cases, stack="stack1")
            .add_yaxis("累计死亡人数", deaths, stack="stack1")
            .set_series_opts(label_opts=opts.LabelOpts(is_show=False))
            .set_global_opts(title_opts=opts.TitleOpts(title="美国每日累计确诊和死亡人数"))
    )
    d.render('test1.html')

def test9():
    d = (
        Bar()
            .add_xaxis(formatted_dates)
            .add_yaxis("累计确诊人数", cases, stack="stack1")
            .add_yaxis("累计死亡人数", deaths, stack="stack1")
            .set_series_opts(label_opts=opts.LabelOpts(is_show=False))
            .set_global_opts(title_opts=opts.TitleOpts(title="美国每日累计确诊和死亡人数"))
    )
    d.render('test1.html')

def line_chart():
    L2 = (
        Line(init_opts=opts.InitOpts(width="1600px", height="800px"))
            .add_xaxis(xaxis_data=date)
            .add_yaxis(
            series_name="新增死亡",
            y_axis=deaths,
            markpoint_opts=opts.MarkPointOpts(
                data=[opts.MarkPointItem(type_="max", name="最大值")]
            ),
            markline_opts=opts.MarkLineOpts(
                data=[
                    opts.MarkLineItem(type_="average", name="平均值"),
                    opts.MarkLineItem(symbol="none", x="90%", y="max"),
                    opts.MarkLineItem(symbol="circle", type_="max", name="最高点"),
                ]
            ),
        )
            .set_global_opts(
            title_opts=opts.TitleOpts(title="美国每日新增死亡折线图", subtitle=""),
            tooltip_opts=opts.TooltipOpts(trigger="axis"),
            toolbox_opts=opts.ToolboxOpts(is_show=True),
            xaxis_opts=opts.AxisOpts(type_="category", boundary_gap=False),
        )
    )
    L2.render('line_chart.html')

def line_chart1():
    L2 = (
        Line(init_opts=opts.InitOpts(width="1600px", height="800px"))
            .add_xaxis(xaxis_data=date)
            .add_yaxis(
            series_name="新增确诊",
            y_axis=cases,
            markpoint_opts=opts.MarkPointOpts(
                data=[opts.MarkPointItem(type_="max", name="最大值")]
            ),
            markline_opts=opts.MarkLineOpts(
                data=[
                    opts.MarkLineItem(type_="average", name="平均值"),
                    opts.MarkLineItem(symbol="none", x="90%", y="max"),
                    opts.MarkLineItem(symbol="circle", type_="max", name="最高点"),
                ]
            ),
        )
            .set_global_opts(
            title_opts=opts.TitleOpts(title="美国每日新增确诊折线图", subtitle=""),
            tooltip_opts=opts.TooltipOpts(trigger="axis"),
            toolbox_opts=opts.ToolboxOpts(is_show=True),
            xaxis_opts=opts.AxisOpts(type_="category", boundary_gap=False),
        )
    )
    L2.render('line_chart1.html')

def wcy():
    d = (
        Bar()
            .add_xaxis(formatted_dates)
            .add_yaxis("累计确诊人数", cases, stack="stack1")
            .add_yaxis("累计死亡人数", deaths, stack="stack1")
            .set_series_opts(label_opts=opts.LabelOpts(is_show=False,formatter="{b}"))
            .set_global_opts(title_opts=opts.TitleOpts(title="美国每日累计确诊和死亡人数"))
            .render("result1.html")
    )

def wcy2():
    L1 = (
        Line(init_opts=opts.InitOpts(width="1400px", height="800px"))
            .add_xaxis(xaxis_data=formatted_dates)
            .add_yaxis(
            series_name="新增确诊",
            y_axis=cases,
            markpoint_opts=opts.MarkPointOpts(
                data=[
                    opts.MarkPointItem(type_="max", name="最大值")

                ]
            ),
            markline_opts=opts.MarkLineOpts(
                data=[opts.MarkLineItem(type_="average", name="平均值")]
            ),
        )
            .set_global_opts(
            title_opts=opts.TitleOpts(title="美国每日新增确诊折线图", subtitle=""),
            tooltip_opts=opts.TooltipOpts(trigger="axis"),
            toolbox_opts=opts.ToolboxOpts(is_show=True),
            xaxis_opts=opts.AxisOpts(type_="category", boundary_gap=False),
        )
    )
    L1.render("result2.html")

def main():

    test1()
    test2()
    test3()
    test4()

if __name__ == '__main__':
    main()

Where there is a will,there is a way.


本文转载自: https://blog.csdn.net/LvHLong/article/details/134992723
版权归原作者 blalon 所有, 如有侵权,请联系我们删除。

“大数据课设-2020年美国新冠肺炎疫情数据分析”的评论:

还没有评论