我想做一个简单的程序,从一个网站提取网址,然后将它们转储到一个.txt文件。你知道吗
下面的代码运行得很好,但当我试图将其转储到文件时,会出现错误。你知道吗
from bs4 import BeautifulSoup, SoupStrainer
import requests
url = "https://stackoverflow.com"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)
cr='C:\Users\Admin\Desktop\extracted.txt'
for link in soup.find_all('a'):
print(link.get('href'))
我试过了
open(cr, 'w') as f:
for link in soup.find_all('a'):
print(link.get('href'))
f.write(link.get('href'))
它转储一些链接,而不是所有链接——它们都在一行中(我得到TypeError:应该是字符串或其他字符缓冲区对象)
.txt
中的结果应该如下所示:
/teams/customers
/teams/use-cases
/questions
/teams
/enterprise
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2f
https://stackoverflow.com/users/signup?ssrc=head&returnurl=%2fusers%2fstory%2fcurrent
https://stackoverflow.com
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
https://stackoverflow.com/users/signup?ssrc=site_switcher&returnurl=%2fusers%2fstory%2fcurrent
https://stackoverflow.com/users/login?ssrc=site_switcher&returnurl=https%3a%2f%2fstackoverflow.com%2f
https://stackexchange.com/sites
https://stackoverflow.blog
https://stackoverflow.com/legal/cookie-policy
https://stackoverflow.com/legal/privacy-policy
https://stackoverflow.com/legal/terms-of-service/public
from bs4 import BeautifulSoup, SoupStrainer
import requests
url = "https://stackoverflow.com"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)
cr='C:\Users\Admin\Desktop\crawler\extracted.txt'
with open(cr, 'w') as f:
for link in soup.find_all('a'):
print(link.get('href'))
f.write(link.get('href'))
所以。。。正如西蒙·芬克所说的那样。但是我找到了另一个
但我认为西蒙·芬克提出的方法更好。非常感谢
试试这个:
相关问题 更多 >
编程相关推荐