如何从twitter上删除所有主题

2024-04-26 05:52:13 发布

您现在位置:Python中文网/ 问答频道 /正文

twitter上的所有主题都可以在这个link中找到 我想刮与每个子类别内的所有

BeautifulSoup在这里似乎没有什么用处。我尝试使用selenium,但我不知道如何匹配单击主类别后出现的XPath

from selenium import webdriver
from selenium.common import exceptions

url = 'https://twitter.com/i/flow/topics_selector'
driver = webdriver.Chrome('absolute path to chromedriver')
driver.get(url)
driver.maximize_window()

main_topics = driver.find_elements_by_xpath('/html/body/div[1]/div/div/div[1]/div[2]/div/div/div/div/div/div[2]/div[2]/div/div/div[2]/div[2]/div/div/div/div/span')

topics = {}
for main_topic in main_topics[2:]:
    print(main_topic.text.strip())
    topics[main_topic.text.strip()] = {}

我知道我可以使用main_topics[3].click()单击主类别,但我不知道如何递归地单击它们,直到我只找到右边有Follow的类别


Tags: textfromimportdivurl主题topicmain
2条回答

刮除所有主要主题,例如艺术与艺术;文化商业和;金融等使用Selenium您必须为visibility_of_all_elements_located()诱导WebDriverWait,并且您可以使用以下任一Locator Strategies

  • 使用XPATH文本属性:

    driver.get("https://twitter.com/i/flow/topics_selector")
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[contains(., 'see top Tweets about them in your timeline')]//following::div[@role='button']/div/span")))])
    
  • 使用XPATHget_attribute()

    driver.get("https://twitter.com/i/flow/topics_selector")
    print([my_elem.get_attribute("textContent") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[contains(., 'see top Tweets about them in your timeline')]//following::div[@role='button']/div/span")))])
    
  • 控制台输出:

    ['Arts & culture', 'Business & finance', 'Careers', 'Entertainment', 'Fashion & beauty', 'Food', 'Gaming', 'Lifestyle', 'Movies and TV', 'Music', 'News', 'Outdoors', 'Science', 'Sports', 'Technology', 'Travel']
    

要使用Selenium和WebDriver刮取所有子主题,您可以使用以下定位策略

  • 使用XPATHget_attribute("textContent")

    driver.get("https://twitter.com/i/flow/topics_selector")
    elements =  WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[contains(., 'see top Tweets about them in your timeline')]//following::div[@role='button']/div/span")))
    for element in elements:
        element.click()
    print([my_elem.get_attribute("textContent") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@role='button']/div/span[text()]")))])
    driver.quit()
    
  • 控制台输出:

    ['Arts & culture', 'Animation', 'Art', 'Books', 'Dance', 'Horoscope', 'Theater', 'Writing', 'Business & finance', 'Business personalities', 'Business professions', 'Cryptocurrencies', 'Careers', 'Education', 'Fields of study', 'Entertainment', 'Celebrities', 'Comedy', 'Digital creators', 'Entertainment brands', 'Podcasts', 'Popular franchises', 'Theater', 'Fashion & beauty', 'Beauty', 'Fashion', 'Food', 'Cooking', 'Cuisines', 'Gaming', 'Esports', 'Game development', 'Gaming hardware', 'Gaming personalities', 'Tabletop gaming', 'Video games', 'Lifestyle', 'Animals', 'At home', 'Collectibles', 'Family', 'Fitness', 'Unexplained phenomena', 'Movies and TV', 'Movies', 'Television', 'Music', 'Alternative', 'Bollywood music', 'C-pop', 'Classical music', 'Country music', 'Dance music', 'Electronic music', 'Hip-hop & rap', 'J-pop', 'K-hip hop', 'K-pop', 'Metal', 'Musical instruments', 'Pop', 'R&B and soul', 'Radio stations', 'Reggae', 'Reggaeton', 'Rock', 'World music', 'News', 'COVID-19', 'Local news', 'Social movements', 'Outdoors', 'Science', 'Biology', 'Sports', 'American football', 'Australian rules football', 'Auto racing', 'Baseball', 'Basketball', 'Combat Sports', 'Cricket', 'Extreme sports', 'Fantasy sports', 'Football', 'Golf', 'Gymnastics', 'Hockey', 'Lacrosse', 'Pub sports', 'Rugby', 'Sports icons', 'Sports journalists & coaches', 'Tennis', 'Track & field', 'Water sports', 'Winter sports', 'Technology', 'Computer programming', 'Cryptocurrencies', 'Data science', 'Information security', 'Operating system', 'Tech brands', 'Tech personalities', 'Travel', 'Adventure travel', 'Destinations', 'Transportation']
    
  • 注意:您必须添加以下导入:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

看看XPATH是如何工作的。只需输入“//element[@attribute=“foo”]”,就不必写出整个路径。请小心,因为主主题和子主题(单击主主题后可见)具有相同的类名。这是导致错误的原因。下面是我如何单击子主题的,但我相信有更好的方法:

我使用以下方法找到主题元素:

topics = WebDriverWait(browser, 5).until(
        EC.presence_of_all_elements_located((By.XPATH, '//div[@class="css-901oao r-13gxpu9 r-1qd0xha r-1b6yd1w r-1vr29t4 r-ad9z0x r-bcqeeo r-qvutc0"]'))
    )

然后我创建了一个名为:

main_topics = []

然后,我循环浏览主题并将每个element.text显示到main_topics列表中,然后单击每个元素以显示主要主题

for topic in topics:
    main_topics.append(topic.text)
    topic.click()

然后,我创建了一个名为sub_topics的新变量:(它现在是所有打开的主题)

sub_topics = WebDriverWait(browser, 5).until(
        EC.presence_of_all_elements_located((By.XPATH, '//span[@class="css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0"]'))
    )

然后,我又创建了两个空列表,名为:

subs_list = []

skip_these_words = ["Done", "Follow your favorite Topics", "You’ll see top Tweets about them in your timeline. Don’t see your favorite Topics yet? New Topics are added every week.", "Follow"]
]

然后,我for循环遍历sub_主题,并做了一个if语句,仅当元素不在主主题中时才将elements.text附加到subs_列表,并跳过这些单词列表。我这样做是为了过滤掉顶部的主要主题和不必要的文本,因为所有这些dern元素都具有相同的类名。最后,单击每个子主题。最后一部分令人困惑,因此下面是一个示例:

for sub in sub_topics:
    if sub.text not in main_topics and sub.text not in skip_these_words:
        subs_list.append(sub.text)
        sub.click()

还有一些隐藏的子主题。查看是否可以单击其余的子主题。然后,查看是否可以找到followbutton元素并单击每个元素

相关问题 更多 >