爬虫获取电影数据----以沈腾参演电影为例

数据可视化&分析实战

1.1 沈腾参演电影数据获取

文章目录

前言

大家好✨，这里是bio🦖。今天为大家带来的是数据获取的一种方法，网络爬虫（Web Crawler）。是一种自动化程序，用于在互联网上获取信息、抓取网页内容并进行数据收集。网络爬虫通过访问网页的链接，并从中提取信息和数据，然后将这些数据保存或用于后续处理和分析。
网络爬虫的工作流程通常包括以下几个步骤：

发送请求：网络爬虫首先发送HTTP请求到指定的URL，请求获取网页内容。
获取响应：网站服务器接收到请求后，会返回相应的网页内容作为HTTP响应。爬虫会获取并接收这个响应内容。
解析网页：爬虫会对网页内容进行解析，提取出需要的数据和信息。通常使用HTML解析器或XPath等技术来解析网页的结构和元素。
数据提取：从解析的网页中，爬虫会提取出感兴趣的数据，如文字、图片、链接等。
存储数据：爬虫将提取的数据保存到数据库、文件或其他存储介质中，以备后续分析和应用。

通过本文获取电影数据信息，为后续的数据可视化提供数据支撑~

1. 网页分析

数据来源于豆瓣电影网，在豆瓣电影网搜索演员沈腾，找到他参演的所有作品（沈腾参演作品）。打开页面发现沈腾一共参演134部作品，其中第一页所有作品均未上映，所以之后获取数据时，可以不用关注第一页。其次应该关注网页链接，查看不同网页链接之间的差异，以便于批量获取数据。
在这里插入图片描述

下面是各个页面的链接，通过观察不难看出各个链接之间的差异在

start=

后的数字，第一页是

，第二页是

，第三页是

……最后一页是

。在上文中说到第一页的所有电影均未上映，未上映的电影没有后续数据可视化可用的数据，故不用获取。使用1到13的循环，便可获取沈腾参演的所有电影数据。

https://movie.douban.com/celebrity/1325700/movies?start=**0**&format=pic&sortby=time&
https://movie.douban.com/celebrity/1325700/movies?start=**10**&format=pic&sortby=time&
https://movie.douban.com/celebrity/1325700/movies?start=**20**&format=pic&sortby=time&
…
https://movie.douban.com/celebrity/1325700/movies?start=**130**&format=pic&sortby=time&

2. 构建数据获取函数

2.1 网页数据获取函数

由于网络爬虫的访问网站的速度很快，会给网站服务器增加负担，因此网站会设置反爬机制。
为了防止网站检测出来，使用

header

参数伪造浏览器信号。
然后使用

requests

包获取网页数据，对获得的文本数据使用

gbk

编码，同时遗忘不能被

gbk

编码的数据
最后使用

BeautifulSoup

对获取的数据转化成

html

格式。

# time: 2023.07.26# author: bio大恐龙# define a function to get website infomation with html formatimport requests
from bs4 import BeautifulSoup

defget_url_info(url):
    headers ={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE'}try:
        info = requests.get(url, headers=headers).text.encode('gbk','ignore').decode('gbk')
        soup = BeautifulSoup(info,'html.parser')return soup
    except:print('Sorry! The film information is not got')

2.2 网页照片获取函数

每个电影都有自己的海报，具有观赏价值。获取的图片数据是二进制数据，所有当保存照片是使用

（二进制写入）。其他代码注释同网页数据获取函数。

# define a function to download film posterdefdownload_image(url, save_path):
    headers ={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE'}try:
        image = requests.get(url, headers=headers).content
        withopen(save_path,'wb')as f:
            f.write(image)except:print('Sorry! failure to download the image')

3. 获取参演影视作品基本数据

通过网页数据获取函数

get_url_info()

获取一个任意一个网页的信息，这里以最后一页为例。首先获取参演影视作品（不一定是电影）的名字，URL和年份，之后根据影视作品的URL获取具体信息。
在获取的网页信息中发现，想获得的数据在

h6

下，因此可以使用

BeautifulSoup

的

find()

去获取我们想要的信息。例如，获取年份信息可以使用

html_content.find('span').text.strip('()')

，其中

.text

是返回文本信息，

strip('()')

是去除括号。(假设你已经使用了

find(h6)

得到了下面html的内容)，

<imgalt="案发现场2"class=""src="https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2466501379.jpg"title="案发现场2"/></a></dt><dd><h6><aclass=""href="https://movie.douban.com/subject/3151813/">案发现场2</a><spanclass="">(2007)</span><spanclass="">[ 演员 (饰 夏晓强) ]</span></h6>

同理，可以获得影视数据的名字、URL。获取第二页到第十四页所有影视作品的基本信息，代码如下，思路与寻找一致。

import pandas as pd
import time

# construct a dataframe to store movies shenteng involved in information
shenteng_movies_df = pd.DataFrame(columns=['Film_Name','URL','Year'])'''
the urls of website were constructed as following url with difference in "start" and total pages are 13
'https://movie.douban.com/celebrity/1325700/movies?start=10&format=pic&sortby=time&'
'https://movie.douban.com/celebrity/1325700/movies?start=20&format=pic&sortby=time&'
'''
df_index =0
website_list =list(range(1,14))for i in website_list:
    movie_info = get_url_info(f'https://movie.douban.com/celebrity/1325700/movies?start={i}0&format=pic&sortby=time&')
    interest_info = movie_info.find_all('h6')#print(interest_info[0].find('span'))#breakfor k inrange(len(interest_info)):
        movie_year = interest_info[k].find('span').text.strip('()')
        movie_url = interest_info[k].find('a')['href']
        movie_name = interest_info[k].find('a').text
        shenteng_movies_df.loc[df_index]=[movie_name, movie_url, movie_year]
        df_index +=1
    time.sleep(10)

获取的结果如下，对应的CSV文件可以从CSDN资源库中下载——沈腾参演影视作品基础信息。
在这里插入图片描述

4. 电影详细数据获取

由于后续是想做数据可视化，故拟获取电影名称、URL、年份、导演、演员、类型、投票人数、评分、IMDb号、描述、感兴趣的人数，看过的人数。名称、URL、年份在上一步中已经获取了，这一步主要是为了获取剩余信息，由于部分信息不是电影、且部分电影信息不含有投票人数、感兴趣人数等，需要不断调试，故对最后的全部代码解释可能不全面，如果你没有看懂，欢迎留言or私信。

4.1 导演、演员、描述、类型、投票人数、评分信息、电影海报获取

4.1.1 电影海报获取（以超能一家人为例）：

通过2.1、2.2定义的

get_url_info()

，

download_iamge()

函数，在下面的html信息中可以看到

"image": "https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2890369636.jpg"

在

<script type="application/ld+json">

一栏下，所以首先通过

find()

函数提取这部分信息，然后通过

json.load()

函数将html格式转换为字典形式，然后根据键名提取对应的值。代码如下：

json_data = json.load(movie_info.find('script',type="application/ld+json").string.strip())# 假设movie_info你通过get_url_info()获取的电影信息数据
image_url = json_data['image']# 提取图片url
downloaw_image(image_url, save_path)# 下载图片

获取的海报共计27张，也就是说总共27部电影~~ 在这里插入图片描述

4.1.2 导演、演员信息获取：

通过之前转换的字典格式数据，可以轻松获取导演、演员信息。这里只获取中文名

director = json_data['director'][0]['name'].split()[0]
actors =str([i['name'].split()[0]for i in json_data['actor']]).strip('[]')

4.1.3 描述、类型、投票人数、评分信息获取：

同理，运用字典的键值对提取信息即可

genre =str(json_data['genre']).strip('[]')# 类型
rating_count = json_data['aggregateRating']['ratingCount']# 投票人数
rating_value = json_data['aggregateRating']['ratingValue']# 评分
description = json_data['description']# 描述

html信息：

<scripttype="application/ld+json">
{
  "@context": "http://schema.org",
  "name": "超能一家人",
  "url": "/subject/35228789/",
  "image": "https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2890369636.jpg",
  "director": 
  [
    {
      "@type": "Person",
      "url": "/celebrity/1350407/",
      "name": "宋阳 Yang Song"
    }
  ]
,
  "author": 
  [
    {
      "@type": "Person",
      "url": "/celebrity/1350407/",
      "name": "宋阳 Yang Song"
    }
    ,
    {
      "@type": "Person",
      "url": "/celebrity/1375192/",
      "name": "毕慷 Kang Bi"
    }
  ]
,
  "actor": 
  [
    {
      "@type": "Person",
      "url": "/celebrity/1350408/",
      "name": "艾伦 Allen"
    }
    ,
    {
      "@type": "Person",
      "url": "/celebrity/1325700/",
      "name": "沈腾 Teng Shen"
    }
  ]
,
  "datePublished": "2023-07-21",
  "genre": ["\u559c\u5267", "\u5bb6\u5ead", "\u5947\u5e7b"],
  "duration": "PT1H53M",
  "description": "郑前（艾伦 饰）新开发的APP被狡猾又诚实的反派乞乞科夫（沈腾 饰）盯上了。幸好郑前一家人意外获得了超能力，姐姐会飞天，爸爸能隐身，爷爷不死术，妹妹力大无穷。郑前本指望家人们出手帮忙，一家人却常常出糗...",
  "@type": "Movie",
  "aggregateRating": {
    "@type": "AggregateRating",
    "ratingCount": "60348",
    "bestRating": "10",
    "worstRating": "2",
    "ratingValue": "4.0"
  }
}

4.2 IMDb号、感兴趣的人数，看过的人数信息获取

4.2.1 IMDb号获取：

IMDb在html信息中如下所示，在

span class="pl"

下，所以通过

find()

函数获取该信息所在位置，然后使用

next_sibling

获取兄弟节点的信息即可，代码如下

movie_info.find('span', class_='pl', text="IMDb:").next_sibling.strip()

html信息：

<spanclass="pl">IMDb:</span> tt12787014<br/>

4.2.2 感兴趣的人数，看过的人数信息获取:

该信息位于

<div class="subject-others-interests-ft">

下，所以先通过find_all()找到信息所在位置，然后提取相关信息即可，代码如下：

tem_info = movie_info.find("div", class_="subject-others-interests-ft").find_all('a')
interest_count = tem_info[0].text.split('人')[0]
watched_count = tem_info[1].text.split('人')[0]

html信息：

<divclass="subject-others-interests-ft"><ahref="https://movie.douban.com/subject/35228789/comments?status=P">62456人看过</a>
                 / 
            <ahref="https://movie.douban.com/subject/35228789/comments?status=F">36999人想看</a></div>

4.3 详细信息获取全代码

其中很多过滤条件是为了筛选掉不属于电影类型的数据，同时为了防止部分电影数据信息缺失造成脚本报错，引入了

Tag

，是beautifulsoup中的一种类型。

其中

[\x00-\x1F\x7F-\x9F]

是不能被转义的符号，故进行替换，防止脚本报错。

json_data = re.sub(r'[\x00-\x1F\x7F-\x9F]','', movie_info.find('script',type="application/ld+json").string.strip())

最后获得的表现如图所示，对应的CSV文件可以从CSDN资源库中下载——沈腾参演电影详细信息：

在这里插入图片描述

import json
import os
from bs4.element import Tag
import re

# create a directory to store the posters of film
dir_path ='/mnt/c/Users/ouyangkang/Desktop/film_poster/'ifnot os.path.exists(dir_path):
    os.makedirs(dir_path)# construct a dataframe to store new infomation of films
films_detail_df = pd.DataFrame(columns=['Film_name','URL','Year','Director','Actors','Genre','Rating_count','Rating_value','IMDb','Description','Interesting_count','Watched_count'])# index
initial_number =0for single_movie_url in shenteng_movies_df['URL'].tolist():
    time.sleep(4)
    movie_info = get_url_info(single_movie_url)# screen non-film infomation and not yet shownifisinstance(movie_info.find('div', class_="rating_sum"), Tag):if"暂无"notin movie_info.find('div', class_="rating_sum").text and"尚未"notin movie_info.find('div', class_="rating_sum").text:# construct directory data foramt
            json_data = re.sub(r'[\x00-\x1F\x7F-\x9F]','', movie_info.find('script',type="application/ld+json").string.strip())
            json_data = json.loads(json_data)if json_data['@type']=='Movie'and json_data['aggregateRating']['ratingValue']!=""and json_data['description']!=""and"真人秀"notin json_data['genre']and"脱口秀"notin json_data['genre']and'歌舞'notin json_data['genre']:# name
                name = shenteng_movies_df[shenteng_movies_df["URL"]== single_movie_url]['Film_Name'].tolist()[0]# url
                url = single_movie_url
                # year
                year = shenteng_movies_df[shenteng_movies_df["URL"]== single_movie_url]['Year'].tolist()[0]# director
                director = json_data['director'][0]['name'].split()[0]# actors
                actors =str([i['name'].split()[0]for i in json_data['actor']]).strip('[]')# only chinese name# genre
                genre =str(json_data['genre']).strip('[]')# rating count
                rating_count = json_data['aggregateRating']['ratingCount']# rating value
                rating_value = json_data['aggregateRating']['ratingValue']# IMDbifisinstance(movie_info.find('span', class_='pl', text="IMDb:"), Tag):
                    imdb = movie_info.find('span', class_='pl', text="IMDb:").next_sibling.strip()else:
                    imdb =None# description
                description = json_data['description']# how many people are interested in the film and had watchedifisinstance(movie_info.find("div", class_="subject-others-interests-ft"), Tag):
                    tem_info = movie_info.find("div", class_="subject-others-interests-ft").find_all('a')
                    interest_count = tem_info[0].text.split('人')[0]
                    watched_count = tem_info[1].text.split('人')[0]else:
                    interest_count =None
                    watched_count =None# poster url
                image_url = json_data['image']
                
                films_detail_df.loc[initial_number]=[name, url, year, director, actors, genre, rating_count, rating_value, imdb, description, interest_count, watched_count]
                initial_number +=1

                time.sleep(8)
                save_path = dir_path + name +'.jpg'
                download_image(image_url, save_path)
                time.sleep(8)

films_detail_df.head()# conserve file# films_detail_df.to_csv('/mnt/c/Users/ouyangkang/Desktop/films_info.csv', index=None, encoding='gbk')

总结

本文向大家介绍如何获取网页信息（以电影信息为例），但是相关的函数功能并没有详细介绍，如果你有疑问可以留言、私信或者自行百度，这里向大家提供的是一个思路，先定位信息的位置，然后通过将html数据转换为字典数据提取相关信息，当然你也可以使用正则表达式提取你想提取的信息。感谢大家的观看，如果期待后续的可视化文章，点点关注不迷路~

标签：爬虫 python

本文转载自: https://blog.csdn.net/ouyangk1026/article/details/132016703
版权归原作者 Bio大恐龙 所有，如有侵权，请联系我们删除。

爬虫获取电影数据----以沈腾参演电影为例

数据可视化&分析实战

文章目录

前言

1. 网页分析

2. 构建数据获取函数

2.1 网页数据获取函数

2.2 网页照片获取函数

3. 获取参演影视作品基本数据

4. 电影详细数据获取

4.1 导演、演员、描述、类型、投票人数、评分信息、电影海报获取

4.1.1 电影海报获取（以超能一家人为例）：

4.1.2 导演、演员信息获取：

4.1.3 描述、类型、投票人数、评分信息获取：

4.2 IMDb号、感兴趣的人数，看过的人数信息获取

4.2.1 IMDb号获取：

4.2.2 感兴趣的人数，看过的人数信息获取:

4.3 详细信息获取全代码

总结

发表评论

“爬虫获取电影数据----以沈腾参演电影为例”的评论:

关于作者

overfit同步小助手

相关阅读

文章导航