0


Scraping A Website With An Embedded Javascript With Selenium

题意:使用 Selenium 抓取嵌入 JavaScript 的网站

问题背景:

I am new to Selenium and trying to scrape the contents of this website. But, the site seems to be based on a template and a Javascript that is run to populate it and I don't know how to access the contents that I see, like the title (Auf dem Bahnhof) or the Objective, etc. using Selenium.

我刚接触 Selenium,正在尝试抓取这个网站的内容。但是,网站似乎基于一个模板,并且通过运行 JavaScript 来填充内容,我不知道如何使用 Selenium 访问我所看到的内容,比如标题(Auf dem Bahnhof)或目标等。

I can locate the tags of elements that I need by browsing the Web Developer Tools, but they return nothing after I run my sample script below:

我可以通过浏览网页开发者工具来定位所需元素的标签,但在运行下面的示例脚本后,它们什么都没有返回:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import Select,WebDriverWait

class Demo():

    def demo_get_contents(self):

        # create webdriver object
        service = Service(executable_path=ChromeDriverManager().install())
        driver = webdriver.Chrome(service=service)

        driver.get('https://gloss.dliflc.edu/LessonViewer.aspx?lessonId=26143&lessonName=ger_soc434&linkTypeId=0')
        element = WebDriverWait(driver, 2).until(EC.visibility_of_all_elements_located((By.CLASS_NAME,'gloss_Overview')))
        print(element.get_attribute('text'))

demo = Demo()
demo.demo_get_contents()

I am using Python3.8

Looking at the Page Source, I can see the Javascript and the iframe that presumably runs the accessActivity() function, but don't know how to run that using Selenium to access the actual page contents.

查看页面源代码时,我可以看到 JavaScript 和 iframe,推测它运行了 accessActivity() 函数,但我不知道如何使用 Selenium 运行它以访问实际的页面内容。

问题解决:

Actually, as an alternative, there's no need to use

Selenium

. If you inspect the Network calls, you'll see that the data is available as an XML file from

实际上,作为一种替代方案,并不需要使用 Selenium。如果你检查网络请求,会发现数据以 XML 文件的形式可从以下位置获取:

https://gloss.dliflc.edu/GlossHtml/templates/linksLO/glossLOs/ger_soc434.xml

You can use Python's built ElementTree library to scrape the correct Quiz data.

你可以使用 Python 内置的 ElementTree 库来抓取正确的测验数据。

import requests
import xml.etree.ElementTree as ET

url = 'https://gloss.dliflc.edu/GlossHtml/templates/linksLO/glossLOs/ger_soc434.xml'

def get_element_text(element):
    return ''.join(element.itertext()).strip()

def find_elements_texts(root, tag):
    elements = root.findall(f".//{tag}[@dir='ltr'][@esbox='0']")
    return [get_element_text(elem) for elem in elements]

response = requests.get(url).content
root = ET.fromstring(response)

objectives_texts = find_elements_texts(root, "OBJECTIVES")
descriptions_texts = find_elements_texts(root, "ACTY_DESCRIPTION")

print(f"Objective:\n {''.join(objectives_texts)}\n")

print(f"Descriptions:\n {descriptions_texts}")

Prints: 打印内容:

Objective:
 Strengthen listening skills and improve comprehension by focusing on terms related to train travel in an audio about a family at a train station before a trip.

Descriptions:
 ['Identify relevant vocabulary and get a more detailed idea of the topic.', 'Preview useful terms and expressions that appear in the upcoming dialogue.', 'Become familiar with the specifics of the situation by listening to several dialogues.', 'Transcribe portions of another dialogue.', 'Assess your knowledge by matching questions with answers.']


本文转载自: https://blog.csdn.net/suiusoar/article/details/144265632
版权归原作者 营赢盈英 所有, 如有侵权,请联系我们删除。

“Scraping A Website With An Embedded Javascript With Selenium”的评论:

还没有评论