• 首页 首页 icon
  • 工具库 工具库 icon
    • IP查询 IP查询 icon
  • 内容库 内容库 icon
    • 快讯库 快讯库 icon
    • 精品库 精品库 icon
    • 问答库 问答库 icon
  • 更多 更多 icon
    • 服务条款 服务条款 icon

使用硒和beautifulsoup进行Web抓取..和选择按钮时遇到麻烦

用户头像
it1352
帮助1

问题说明

我正尝试在以下网站上使用网址"url =' https://angel.co/life-科学' .该网站包含8000多个数据.在此页面上,我需要诸如公司名称和链接,加入日期和关注者之类的信息.在此之前,我需要通过单击按钮对关注者列进行排序.然后,通过单击更多隐藏项来加载更多信息按钮.该页面最多可点击20次(更多隐藏的内容),此后它不会加载更多信息.但是我只能通过排序来获取主要关注者信息.在这里,我实现了click()事件,但它是显示错误.

I am trying to web srape the following website "url='https://angel.co/life-sciences' ". The website contains more than 8000 data. From this page I need the information like company name and link, joined date and followers. Before that I need to sort the followers column by clicking the button. then load more information by clicking more hidden button. The page is clickable (more hidden) content at the max 20 times, after that it doesn't load more information. But I can take only top follower information by sorting it. Here I have implemented the click() event but it's showing error.

Unable to locate element: {"method":"xpath","selector":"//div[@class="column followers sortable sortable"]"} #before edit this was my problem, using wrong class name

所以我需要在这里给更多的睡眠时间吗?(尝试给出相同但错误相同的结果)

So do I need to give more sleep time here?(tried giving that but same error)

我需要解析所有上述信息,然后访问那些网站的各个链接以仅刮取该html页面的内容div.

I need to parse all the above information then visit individual link of those website to scrape content div of that html page only.

请给我建议一种方法

这是我当前的代码,我还没有使用beautifulsoup添加html解析部分.

Here is my current code, I have not added html parsing part using beautifulsoup.

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from time import sleep
from selenium import webdriver 
from bs4 import BeautifulSoup
#import urlib2
driver = webdriver.Chrome()
url='https://angel.co/life-sciences'
driver.get(url)
sleep(10)

driver.find_element_by_xpath('//div[@class="column followers sortable"]').click()#edited
sleep(5)
for i in range(2):
    driver.find_element_by_xpath('//div[@class="more hidden"]').click()
    sleep(8)

sleep(8)
element = driver.find_element_by_id("root").get_attribute('innerHTML')
#driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
#WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, 'more hidden')))
'''
results = html.find_elements_by_xpath('//div[@class="name"]')
 # wait for the page to load

for result in results:
    startup = result.find_elements_by_xpath('.//a')
    link = startup.get_attribute('href')
    print(link)
'''
page_source = driver.page_source

html = BeautifulSoup(element, 'html.parser')
#for link in html.findAll('a', {'class': 'startup-link'}):
#       print link

divs = html.find_all("div", class_=" dts27 frw44 _a _jm")

在添加关注者"点击事件之前,上面的代码可以正常工作并且为我提供了html源代码.

The above code was working and was giving me html source before I have added the Followers click event.

我的最终目标是将所有这五种信息(例如公司名称,其链接,加入日期,关注者数量和公司描述(在访问其各自的链接后获得))导入CSV或xls文件.

My final goal is to import all these five information like Name of the company, Its link, Joined date, No of Followers and the company description (which to be obtained after visiting their individual links) into a CSV or xls file.

可以得到帮助和评论. 这是我的第一个python和selenium工作,很少混淆,需要指导.

Help and comments are apprecieted. This is my first python work and selenium, so little confused and need guidance.

谢谢:-)

正确答案

#1

click方法旨在模拟鼠标单击.它用于可单击的元素,例如按钮,下拉列表,复选框等.您已将此方法应用于不可可单击的div元素.诸如divspanframe之类的元素用于组织HTML并提供字体装饰等.

The click method is intended to emulate a mouse click; it's for use on elements that can be clicked, such as buttons, drop-down lists, check boxes, etc. You have applied this method to a div element which is not clickable. Elements like div, span, frame and so on are used to organise HTML and provide for decoration of fonts, etc.

要使此代码正常工作,您需要确定页面中实际可单击的元素.

To make this code work you will need to identify the elements in the page that are actually clickable.

这篇好文章是转载于:学新通技术网

  • 版权申明: 本站部分内容来自互联网,仅供学习及演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,请提供相关证据及您的身份证明,我们将在收到邮件后48小时内删除。
  • 本站站名: 学新通技术网
  • 本文地址: /reply/detail/tanhcfekhk
系列文章
更多 icon
同类精品
更多 icon
继续加载