技能树-网络爬虫-BeautifulSoup

文章目录

前言

技能树-网络爬虫-BeautifulSoupPython入门技能树

大家好，我是空空star，本篇给大家分享一下
《技能树-网络爬虫-BeautifulSoup》
。

一、获取所有p标签

获取所有p标签里的文本

# -*- coding: UTF-8 -*-from bs4 import BeautifulSoup

deffetch_p(html):# TODO(You): 请在此实现代码return results

if __name__ =='__main__':
    html ='''
        <html>
            <head>
                <title>这是一个简单的测试页面</title>
            </head>
            <body>
                <p class="item-0">body 元素的内容会显示在浏览器中。</p>
                <p class="item-1">title 元素的内容会显示在浏览器的标题栏中。</p>
            </body>
        </html>
        '''
    p_text = fetch_p(html)print(p_text)

请选出下列能正确实现这一功能的选项。
A.

def fetch_p(html):
soup = BeautifulSoup(html, ‘lxml’)
p_list = soup.xpath(“p”)
results = [p.text for p in p_list]
return results

def fetch_p(html):
soup = BeautifulSoup(html, ‘lxml’)
p_list = soup.find_all(“p”)
results = [p.text for p in p_list]
return results

def fetch_p(html):
soup = BeautifulSoup(html, ‘lxml’)
results = soup.find_all(“p”)
return results

def fetch_p(html):
soup = BeautifulSoup(html, ‘lxml’)
p_list = soup.findAll(“p”)
results = [p.text for p in p_list]
return results

分析：
A是错的，没有xpath方法；
B是对的，
['body 元素的内容会显示在浏览器中。', 'title 元素的内容会显示在浏览器的标题栏中。']
C是错的，获取到的不仅有文本，还有标签
[<p class="item-0">body 元素的内容会显示在浏览器中。</p>, <p class="item-1">title 元素的内容会显示在浏览器的标题栏中。</p>]
D也是对的，在BeautifulSoup中，find_all()和findAll()是等价的方法，都用于查找文档中符合条件的所有tag。它们的参数都可以传入tag名称、属性名或属性值等。
之所以有这两个方法的不同写法，是因为BeautifulSoup早期的版本使用的是findAll()方法，而后续版本为了与Python的命名规范保持一致，增加了find_all()方法，但实际上它们的功能和用法是完全相同的。

二、获取所有text

获取网页的text

# -*- coding: UTF-8 -*-from bs4 import BeautifulSoup

deffetch_text(html):# TODO(You): 请在此实现代码return result

if __name__ =='__main__':
    html ='''
        <html>
            <head>
                <title>这是一个简单的测试页面</title>
            </head>
            <body>
                <p class="item-0">body 元素的内容会显示在浏览器中。</p>
                <p class="item-1">title 元素的内容会显示在浏览器的标题栏中。</p>
            </body>
        </html>
        '''
    text = fetch_text(html)print(text)

请选出下列能正确实现这一功能的选项。
A.

deffetch_text(html):
    soup = BeautifulSoup(html,'lxml')
    result = soup.find_all('text')return result

deffetch_text(html):
    soup = BeautifulSoup(html,'lxml')
    result = soup.text
    return result

deffetch_text(html):
    soup = BeautifulSoup(html,'lxml')
    result = soup.find_text()return result

deffetch_text(html):
    soup = BeautifulSoup(html,'lxml')
    result = soup.text()return result

分析：
A是错的，find_all是根据tag查，该题目是要求获得文本，而不是获得tag为text的；
B是对的，


这是一个简单的测试页面

body 元素的内容会显示在浏览器中。
title 元素的内容会显示在浏览器的标题栏中。

Process finished with exit code 0

C是错的，没有find_text()；
D是错的，没有text()

三、获取所有图片地址

查找网页里所有图片地址

from bs4 import BeautifulSoup

deffetch_imgs(html):# TODO(You): 请在此实现代码return imgs

deftest():
    imgs = fetch_imgs('<p><img src="http://example.com"/><img src="http://example.com"/></p>')print(imgs)if __name__ =='__main__':
    test()

请选出下列能正确实现这一功能的选项。
A.

deffetch_imgs(html):
    soup = BeautifulSoup('html.parser', html)
    imgs =[tag['src']for tag in soup.find_all('img')]return imgs

deffetch_imgs(html):
    soup = BeautifulSoup(html,'html.parser')
    imgs =[tag['src']for tag in soup.find_all('img')]return imgs

deffetch_imgs(html):
    soup = BeautifulSoup(html,'html.parser')
    imgs =[tag for tag in soup.find_all('img')]return imgs

deffetch_imgs(html):
    soup = BeautifulSoup(html,'html.parser')
    imgs = soup.find_all('img')return imgs

分析：
A是错的，BeautifulSoup中参数写反了；
B是对的，
['http://example.com', 'http://example.com']
C是错的，会把img的标签也会带上，
[<img src="http://example.com"/>, <img src="http://example.com"/>]
D是错的，会把img的标签也会带上，
[<img src="http://example.com"/>, <img src="http://example.com"/>]

总结

标签：爬虫 beautifulsoup python

本文转载自: https://blog.csdn.net/weixin_38093452/article/details/131354415
版权归原作者 空空star 所有，如有侵权，请联系我们删除。

技能树-网络爬虫-BeautifulSoup

文章目录

前言

一、获取所有p标签

二、获取所有text

三、获取所有图片地址

总结

发表评论

“技能树-网络爬虫-BeautifulSoup”的评论:

关于作者

overfit同步小助手

相关阅读

文章导航