使用BeautifulSoup提取重复的URL

0 投票

1 回答

564 浏览

提问于 2025-04-18 11:21

我尝试使用BeautifulSoup和正则表达式从网页中提取网址。这是我的代码：

Ref_pattern = re.compile('<TD width="200"><A href="(.*?)" target=')
Ref_data = Ref_pattern.search(web_page)
if Ref_data:
    Ref_data.group(1)
data = [item for item in csv.reader(output_file)]
new_column1 = ["Reference", Ref_data.group(1)]
new_data = []
for i, item in enumerate(data):
    try:
        item.append(new_column1[i])
    except IndexError, e:
        item.append(Ref_data.group(1)).next()
    new_data.append(item)

虽然网页中有很多网址，但它只重复了第一个网址。我知道这里面有问题，

except IndexError, e:
    item.append(Ref_data.group(1)).next()

因为如果我把它去掉，就只会得到第一个网址（没有重复）。你能帮我提取所有的网址并把它们写入一个CSV文件吗？谢谢。

正则表达式 URL提取网页抓取 beautifulsoup csv文件

1 个回答

虽然不太清楚你具体想要什么，但根据你说的，如果你要提取的链接有特定的元素（比如类名、ID或者文本），你可以尝试下面的做法：

from bs4 import BeautifulSoup
string = """\
        <a href="http://example.com">Linked Text</a>
        <a href="http://example.com/link" class="pooper">Linked Text</a>
        <a href="http://example.com/page" class="pooper">Image</a>
        <a href="http://anotherexmpaple.com/page">Phone Number</a>"""

soup = BeautifulSoup(string)

for link in soup.findAll('a', { "class" : "pooper" }, href=True, text='Linked Text'):
    print link['href']

如你所见，我使用了bs4的属性功能，只选择那些包含“pooper”类的链接标签（class="pooper"）。然后，我通过传递一个文本参数来进一步缩小返回的结果（选择Linked Text而不是Image）。

根据你下面的反馈，试试以下代码。告诉我结果如何。

for items in soup.select("td[width=200]"):
    for link in items:
        link.findAll('a', { "target" : "_blank" }, href=True)
        print link['href']

回答于 2025-04-18 由 Python大师

分享举报

使用BeautifulSoup提取重复的URL

1 个回答

撰写回答