0


国科大数据挖掘课程HW1

HW1

Submission requirements:

Please submit your solutions to our class website.


Q1.Suppose that a data warehouse consists of four dimensions, date, spectator, location, and game, and two measures, count and charge, where charge is the fare that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate.

(a) Draw a star schema diagram for the data warehouse.

在这里插入图片描述

(b) Starting with the base cuboid [**date,**spectator, location, game],what specific OLAP operations should one perform in order to list the total charge paid by student spectators in Los Angeles?

step 1. Roll-up on date from date_key to all
step 2. Roll-up on spectator from spectator_key to status
step 3. Roll-up on location from location_key to location_name
step 4. Roll-up on game from game_key to all

step 5. Dice with"status=student"and"location_name=Los Angeles"

© Bitmap indexing is a very useful optimization technique. Please present the pros and cons of using bitmap indexing in this given data warehouse.

优点

位图索引是一种高效的索引结构,在查询、过滤等方面上,由于进行的是位运算,所以比常规的查询方式快很多。例如在本仓库中,假设对于

spectator

表的子列

status

,我们有:
spectator_keystatusgender0学生男1成人女2学生男3学生女4老人女

status

就可以建立以下位图索引:

status="学生":10110
status="成人":01000
status="老人":00001
gender

可以建立以下位图索引:

gender="男":10100
gender="女":01011

例如,我们想要查询学生,只需要用

10110

去过滤原始数据就行。

我们想混合查询,比如同时查询

status="学生"

gender="男"

的数据,只需要进行并操作就行了:

10110&10100=10100

可以大大提高计算速度。

此外,位图索引可以在一定程度上

绕开

原始数据,进一步提高处理速度。例如,我们想统计满足上面条件的人数,只需要:

ans=0
x=(10110&10100)while x:
    x&=(x-1)
    ans+=1

缺点

位图索引比较适合枚举类型,也就是离散型变量,对于连续变量,位图索引并不适用,往往需要先做离散化。比如本仓库中,

phone number

字段可能就不太适合(也许这个字段没有存在的必要?)

而当属性列非常多时,我们做位图索引的开销也比较大。


Q2.某电子邮件数据库中存储了大量的电子邮件。请设计数据仓库的结构,以便用户从多个维度进行查询和挖掘。

在这里插入图片描述


Q3. Suppose a hospital tested the age and body fat data for 18 random selected adults with the following result:
age232327273941474950525454565758586061**%fat**9.526.57.817.831.425.927.427.231.234.642.528.833.430.234.132.941.235.7
(a) Calculate the mean, median, and standard deviation of age and %fat.

             age       %fat
mean   46.44444428.783333
std    13.2186249.254395
median      51.030.7

(b) Draw the boxplots for age and %fat.

在这里插入图片描述

© Draw a scatter plot based on these two variables.

在这里插入图片描述

(d) Normalize age based on min-max normalization.

x=data["age"]
y=data['%fat']
X=(x-x.min())/(x.max()-x.min())
Y=(y-y.min())/(y.max()-y.min())print(X,Y)

Result is:

00.00000010.00000020.10526330.10526340.42105350.47368460.63157970.68421180.71052690.763158100.815789110.815789120.868421130.894737140.921053150.921053160.973684171.000000

(e) Calculate the correlation coefficient (Pearson’s product moment coefficient). Are these two variables positively or negatively correlated?

print(np.corrcoef(x,y))print("相关系数",stats.pearsonr(x,y)[0])

Result is

[[1.0.8176188][0.81761881.]]
相关系数 0.8176187964565874

I think they are positively correlated.

(f) Smooth the fat data by bin means, using a bin depth of 6.

defmean(x):returnround(sum(x)/len(x),2)

N_y=sorted(y)
bins=[[]]for j in N_y:
    bins[-1].append(j)iflen((v:=bins[-1]))==6:
        v[:]=[mean(v)]*len(v)
        bins.append([])for i,j  inenumerate(bins[:-1]):print("bin %d is :"%(i+1),j)
bin1is:[19.12,19.12,19.12,19.12,19.12,19.12]bin2is:[30.32,30.32,30.32,30.32,30.32,30.32]bin3is:[36.92,36.92,36.92,36.92,36.92,36.92]

(g) Smooth the fat data by bin boundaries, using a bin depth of 6.

这里因为我们是对排好序的数据做处理,所以可以通过二分法进行优化,获取中间分界。

defclose(x,a,b):# 是否靠近下界return(x-a)<=(b-x)defboundary(x):
    Min=x[0]
    Max=x[-1]

    l,r=0,len(x)-1while l<=r:
        mid=(r-l)//2+l
        if close(x[mid],Min,Max):ifnot close(x[mid+1],Min,Max):
                l=mid
                break
            l=mid+1else:if close(x[mid-1],Min,Max):
                l=mid
                break
            r=mid-1return[[Min]*l+[Max]*(len(x)-l)]

N_y=sorted(y)
bins=[[]]for j in N_y:
    bins[-1].append(j)iflen((v:=bins[-1]))==6:
        v[:]=boundary(v)
        bins.append([])for i,j  inenumerate(bins[:-1]):print("bin %d is :"%(i+1),j)
bin1is:[[7.8,7.8,27.2,27.2,27.2,27.2]]bin2is:[[27.4,27.4,32.9,32.9,32.9,32.9]]bin3is:[[33.4,33.4,33.4,33.4,42.5,42.5]]

本文转载自: https://blog.csdn.net/qq_45957458/article/details/127236288
版权归原作者 Torture_L 所有, 如有侵权,请联系我们删除。

“国科大数据挖掘课程HW1”的评论:

还没有评论