Python Selenium,删除LinkedIn:循环浏览工作和教育历史

2024-04-19 17:27:53 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在使用Selenium从Python中的LinkedIn配置文件中抓取数据。它主要是工作,但我不知道如何提取每个雇主或学校在他们的历史部分的信息

我正在学习以下教程:https://www.linkedin.com/pulse/how-easy-scraping-data-from-linkedin-profiles-david-craven/

我正在看这个档案:https://www.linkedin.com/in/pauljgarner/?originalSubdomain=uk

下面是我正在处理的HTML部分的一部分片段:

<section id="experience-section" class="pv-profile-section experience-section ember-view"><header class="pv-profile-section__card-header">
  <h2 class="pv-profile-section__card-heading">
    Experience
  </h2>

<!----></header>

  <ul class="pv-profile-section__section-info section-info pv-profile-section__section-info--has-more">
<li id="ember136" class="pv-entity__position-group-pager pv-profile-section__list-item ember-view">        <section id="1762786165" class="pv-profile-section__card-item-v2 pv-profile-section pv-position-entity ember-view">  <div class="display-flex justify-space-between full-width">
    <div class="display-flex flex-column full-width">
<a data-control-name="background_details_company" href="/company/wagestream/" id="ember138" class="full-width ember-view">          <div class="pv-entity__logo company-logo">
  <img src="https://media-exp1.licdn.com/dms/image/C560BAQEkzVWoORqWFQ/company-logo_100_100/0/1615996325297?e=1631145600&amp;v=beta&amp;t=SoZQKV09PqqYxYTzbjqV4XTJa7HkGUZRe4QT0jU5hmE" loading="lazy" alt="Wagestream" id="ember140" class="pv-entity__logo-img EntityPhoto-square-5 lazy-image ember-view">
</div>
<div class="pv-entity__summary-info pv-entity__summary-info--background-section ">
  <h3 class="t-16 t-black t-bold">Senior Software Engineer</h3>
  <p class="visually-hidden">Company Name</p>
  <p class="pv-entity__secondary-title t-14 t-black t-normal">
      Wagestream
        <span class="pv-entity__secondary-title separator">Full-time</span>
  </p>
    <div class="display-flex">
    <h4 class="pv-entity__date-range t-14 t-black--light t-normal">
      <span class="visually-hidden">Dates Employed</span>
      <span>Apr 2021 – Present</span>
    </h4>
      <h4 class="t-14 t-black--light t-normal">
        <span class="visually-hidden">Employment Duration</span>
        <span class="pv-entity__bullet-item-v2">3 mos</span>
      </h4>
  </div>

  <h4 class="pv-entity__location t-14 t-black--light t-normal block">
    <span class="visually-hidden">Location</span>
    <span>London, England, United Kingdom</span>
  </h4>
<!---->
</div>

</a>
<!---->    </div>

<!---->  </div>
</section>

接下来是更多的“li”部分。因此,整个历史部分可以用^{id1}标识$

我正试图从这一部分获得职位、公司、工作年限等信息,但不知道如何做到这一点。下面是一段python代码,展示了我的尝试(跳过登录):

from parsel import Selector
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests

path = r'C:\Program Files (x86)\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(path)

# driver.get method() will navigate to a page given by the URL address
driver.get('https://www.linkedin.com/in/pauljgarner/?originalSubdomain=uk')

text=driver.page_source
sel = Selector(text) 

# Using the "Copy xPath" option in Inspect in Google Chrome, I can manually extract the company name
sel.xpath('//*[@id="ember187"]/div[2]/p[2]/text()').extract_first()  

# This will give me all of the text in the Work Experience section
stuff = driver.find_element_by_id("experience-section")
items = html_list.find_elements_by_tag_name("ul")
items = html_list.find_elements_by_tag_name("h3")
for item in items:
    print(type(item))
    text = item.text
    print(text)

但是,这些方法对于跨配置文件从每项工作中自动、系统地提取信息来说并不是很好。我想做的是在每个“ul”部分的“li”部分之间循环,在“li”部分中,仅提取带有class=“pv-entity\uu secondary-title t-14 t-black t-normal”的公司名称。但是通过类名称查找元素只会产生非类型

从概念上讲,我不确定如何使用selenium生成一个可编辑的“ul”和“li”列表,并在每次迭代中使用类名提取特定的文本位


Tags: textindivviewidsectionprofileitem
1条回答
网友
1楼 · 发布于 2024-04-19 17:27:53

我想出了一个解决办法。我应该指出,我在YouTube上对以下教程的评论中“交叉发布”:https://www.youtube.com/watch?v=W4Md-koupmE

运行整个代码,但替换您的电子邮件和密码

首先,打开浏览器,登录LinkedIn,然后导航到相关的个人资料

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests
from time import sleep

# Path to the chromedriver.exe
path = r'C:\Program Files (x86)\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(path)

driver.get('https://www.linkedin.com')

# Log into LinkedIn
username = driver.find_element_by_id('session_key')
username.send_keys('mail@mail.com')

sleep(0.5)

password = driver.find_element_by_id('session_password')
password.send_keys('password')

sleep(0.5)

log_in_button = driver.find_element_by_class_name('sign-in-form__submit-button')
log_in_button.click()

sleep(3)

# The example profile I am trying to scrape
driver.get('https://www.linkedin.com/in/pauljgarner/?originalSubdomain=uk')
sleep(3)

如果我刚开始尝试提取东西,我会得到一个错误。事实证明,我需要向下滚动到相关部分才能加载,否则不会创建任何数据:

# The experience section doesn't load until you scroll to it, this will scroll to the section
l= driver.find_element_by_xpath('//*[@id="oc-background-section"]')
driver.execute_script("arguments[0].scrollIntoView(true);", l)

要循环浏览工作经验,首先我确定它的“id”值,在本例中为“experience section”。使用“按id查找元素”方法获取它

# Get stuff in work experience section
html_list = driver.find_element_by_id("experience-section")

本节包含一个“li”元素列表(即标记值“li”),每个元素都包含过去每个作业的所有工作信息。使用“按标签名称查找元素”创建这些WebElement类型的列表

# Jobs listed as li sections, create list of li 
items = html_list.find_elements_by_tag_name("li")

查看源代码,我注意到,例如,雇主名称可以通过标记“p”来识别。这将生成一个列表,有时它包含多个项目。确保您选择了您需要的:

x = items[0].find_elements_by_tag_name("p")
print(x[0].text)
# "Company Name"
print(x[1].text)
# "Wagestream Full-time"

最后,循环浏览“li”部分,提取相关信息,提取字符串,并打印所需信息(或在CSV中另存为行):

# Loop through li list, extract each piece by tag name
for item in items:
    name_job = item.find_elements_by_tag_name("h3")
    name_emp = item.find_elements_by_tag_name("p")
    more = item.find_elements_by_tag_name("h4")
    job = name_job[0].text
    emp = name_emp[1].text
    # This just cleans up the string
    yrs = [item for item in more[0].text.split('\n')][1]
    loc = [item for item in more[2].text.split('\n')][1]
    
    print(job)
    print(emp)
    print(yrs)
    print(loc)

# terminates the application
driver.quit()

相关问题 更多 >