Python Get Links脚本需要通配符搜索

2024-04-20 05:26:51 发布

您现在位置:Python中文网/ 问答频道 /正文

我有下面的代码,当你把一个网址与一堆链接,它会返回列表给你。这工作得很好,除了我只希望链接,以。。。这将返回每个链接,包括home/back/等。有没有方法使用通配符或“start with”函数?你知道吗

from bs4 import BeautifulSoup
import requests

url = ""

# Getting the webpage, creating a Response object.
response = requests.get(url)

# Extracting the source code of the page.
data = response.text

# Passing the source code to BeautifulSoup to create a BeautifulSoup object for it.
soup = BeautifulSoup(data, 'lxml')

# Extracting all the <a> tags into a list.
tags = soup.find_all('a')

# Extracting URLs from the attribute href in the <a> tags.
for tags in tags:
    print(tags.get('href'))

还有,有没有导出到excel的方法?我对python不是很在行,老实说,我也不知道我是怎么做到这一步的。你知道吗

谢谢你


Tags: the方法fromimporturlsourcegetobject
3条回答

关于你的第二个问题:有没有导出到Excel的方法-我一直在使用python模块XlsxWriter。你知道吗

import xlsxwriter

# Create a workbook and add a worksheet.
workbook = xlsxwriter.Workbook('Expenses01.xlsx')
worksheet = workbook.add_worksheet()

# Some data we want to write to the worksheet.
expenses = (
    ['Rent', 1000],
    ['Gas',   100],
    ['Food',  300],
    ['Gym',    50],
)

# Start from the first cell. Rows and columns are zero indexed.
row = 0
col = 0

# Iterate over the data and write it out row by row.
for item, cost in (expenses):
    worksheet.write(row, col,     item)
    worksheet.write(row, col + 1, cost)
    row += 1

# Write a total using a formula.
worksheet.write(row, 0, 'Total')
worksheet.write(row, 1, '=SUM(B1:B4)')

workbook.close()

XlsxWriter允许编码遵循基本的excel约定-我是python的新手,第一次尝试就很容易建立、运行和工作。你知道吗

以下是您的代码的更新版本,它将从该页获取所有https HREF:

from bs4 import BeautifulSoup
import requests

url = "https://www.google.com"

# Getting the webpage, creating a Response object.
response = requests.get(url)

# Extracting the source code of the page.
data = response.text

# Passing the source code to BeautifulSoup to create a BeautifulSoup object for it.
soup = BeautifulSoup(data)

# Extracting all the <a> tags into a list.
tags = soup.find_all('a')

# Extracting URLs from the attribute href in the <a> tags.
for tag in tags:
    if str.startswith(tag.get('href'), 'https'):
        print(tag.get('href'))

如果要获取以https以外的内容开头的HREF,请将第2行更改为最后一行:)

参考文献: https://www.tutorialspoint.com/python/string_startswith.htm

您可以使用startswith()

for tag in tags:
    if tag.get('href').startswith('pre'):
        print(tag.get('href'))

相关问题 更多 >