Python Selenium，删除LinkedIn：循环浏览工作和教育历史

<section id="experience-section" class="pv-profile-section experience-section ember-view"><header class="pv-profile-section__card-header"> <h2 class="pv-profile-section__card-heading"> Experience </h2> </header> <ul class="pv-profile-section__section-info section-info pv-profile-section__section-info--has-more"> <li id="ember136" class="pv-entity__position-group-pager pv-profile-section__list-item ember-view"> <section id="1762786165" class="pv-profile-section__card-item-v2 pv-profile-section pv-position-entity ember-view"> <div class="display-flex justify-space-between full-width"> <div class="display-flex flex-column full-width"> <a data-control-name="background_details_company" href="/company/wagestream/" id="ember138" class="full-width ember-view"> <div class="pv-entity__logo company-logo"> <img src="https://media-exp1.licdn.com/dms/image/C560BAQEkzVWoORqWFQ/company-logo_100_100/0/1615996325297?e=1631145600&v=beta&t=SoZQKV09PqqYxYTzbjqV4XTJa7HkGUZRe4QT0jU5hmE" loading="lazy" alt="Wagestream" id="ember140" class="pv-entity__logo-img EntityPhoto-square-5 lazy-image ember-view"> </div> <div class="pv-entity__summary-info pv-entity__summary-info--background-section "> <h3 class="t-16 t-black t-bold">Senior Software Engineer</h3> <p class="visually-hidden">Company Name</p> <p class="pv-entity__secondary-title t-14 t-black t-normal"> Wagestream <span class="pv-entity__secondary-title separator">Full-time</span> </p> <div class="display-flex"> <h4 class="pv-entity__date-range t-14 t-black--light t-normal"> <span class="visually-hidden">Dates Employed</span> <span>Apr 2021 – Present</span> </h4> <h4 class="t-14 t-black--light t-normal"> <span class="visually-hidden">Employment Duration</span> <span class="pv-entity__bullet-item-v2">3 mos</span> </h4> </div> <h4 class="pv-entity__location t-14 t-black--light t-normal block"> <span class="visually-hidden">Location</span> <span>London, England, United Kingdom</span> </h4>  </div> </a>  </div>  </div> </section>

from parsel import Selector from selenium import webdriver from selenium.webdriver.common.keys import Keys import requests path = r'C:\Program Files (x86)\chromedriver_win32\chromedriver.exe' driver = webdriver.Chrome(path) # driver.get method() will navigate to a page given by the URL address driver.get('https://www.linkedin.com/in/pauljgarner/?originalSubdomain=uk') text=driver.page_source sel = Selector(text) # Using the "Copy xPath" option in Inspect in Google Chrome, I can manually extract the company name sel.xpath('//*[@id="ember187"]/div[2]/p[2]/text()').extract_first() # This will give me all of the text in the Work Experience section stuff = driver.find_element_by_id("experience-section") items = html_list.find_elements_by_tag_name("ul") items = html_list.find_elements_by_tag_name("h3") for item in items: print(type(item)) text = item.text print(text)

1条回答

网友

1楼 · 发布于 2024-04-19 17:27:53

我想出了一个解决办法。我应该指出，我在YouTube上对以下教程的评论中“交叉发布”：https://www.youtube.com/watch?v=W4Md-koupmE

运行整个代码，但替换您的电子邮件和密码

首先，打开浏览器，登录LinkedIn，然后导航到相关的个人资料

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import requests
from time import sleep

# Path to the chromedriver.exe
path = r'C:\Program Files (x86)\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(path)

driver.get('https://www.linkedin.com')

# Log into LinkedIn
username = driver.find_element_by_id('session_key')
username.send_keys('mail@mail.com')

sleep(0.5)

password = driver.find_element_by_id('session_password')
password.send_keys('password')

sleep(0.5)

log_in_button = driver.find_element_by_class_name('sign-in-form__submit-button')
log_in_button.click()

sleep(3)

# The example profile I am trying to scrape
driver.get('https://www.linkedin.com/in/pauljgarner/?originalSubdomain=uk')
sleep(3)

如果我刚开始尝试提取东西，我会得到一个错误。事实证明，我需要向下滚动到相关部分才能加载，否则不会创建任何数据：

# The experience section doesn't load until you scroll to it, this will scroll to the section
l= driver.find_element_by_xpath('//*[@id="oc-background-section"]')
driver.execute_script("arguments[0].scrollIntoView(true);", l)

要循环浏览工作经验，首先我确定它的“id”值，在本例中为“experience section”。使用“按id查找元素”方法获取它

# Get stuff in work experience section
html_list = driver.find_element_by_id("experience-section")

本节包含一个“li”元素列表（即标记值“li”），每个元素都包含过去每个作业的所有工作信息。使用“按标签名称查找元素”创建这些WebElement类型的列表

# Jobs listed as li sections, create list of li 
items = html_list.find_elements_by_tag_name("li")

查看源代码，我注意到，例如，雇主名称可以通过标记“p”来识别。这将生成一个列表，有时它包含多个项目。确保您选择了您需要的：

x = items[0].find_elements_by_tag_name("p")
print(x[0].text)
# "Company Name"
print(x[1].text)
# "Wagestream Full-time"

最后，循环浏览“li”部分，提取相关信息，提取字符串，并打印所需信息（或在CSV中另存为行）：

# Loop through li list, extract each piece by tag name
for item in items:
    name_job = item.find_elements_by_tag_name("h3")
    name_emp = item.find_elements_by_tag_name("p")
    more = item.find_elements_by_tag_name("h4")
    job = name_job[0].text
    emp = name_emp[1].text
    # This just cleans up the string
    yrs = [item for item in more[0].text.split('\n')][1]
    loc = [item for item in more[2].text.split('\n')][1]
    
    print(job)
    print(emp)
    print(yrs)
    print(loc)

# terminates the application
driver.quit()

相关问题更多 >

编程相关推荐

热门问题

热门文章