如何将输出转储到txt?

2024-03-28 14:22:19 发布

您现在位置:Python中文网/ 问答频道 /正文

我想做一个简单的程序,从一个网站提取网址,然后将它们转储到一个.txt文件。你知道吗

下面的代码运行得很好,但当我试图将其转储到文件时,会出现错误。你知道吗

from bs4 import BeautifulSoup, SoupStrainer
import requests

url = "https://stackoverflow.com"

page = requests.get(url)    
data = page.text
soup = BeautifulSoup(data)
cr='C:\Users\Admin\Desktop\extracted.txt'

for link in soup.find_all('a'):
  print(link.get('href'))

我试过了

open(cr, 'w') as f:
  for link in soup.find_all('a'):
    print(link.get('href'))
    f.write(link.get('href'))

它转储一些链接,而不是所有链接——它们都在一行中(我得到TypeError:应该是字符串或其他字符缓冲区对象)

.txt中的结果应该如下所示:

/teams/customers
/teams/use-cases
/questions
/teams
/enterprise
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head&returnurl=https%3a%2f%2fstackoverflow.com%2f
https://stackoverflow.com/users/signup?ssrc=head&returnurl=%2fusers%2fstory%2fcurrent
https://stackoverflow.com
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
https://stackoverflow.com/users/signup?ssrc=site_switcher&returnurl=%2fusers%2fstory%2fcurrent
https://stackoverflow.com/users/login?ssrc=site_switcher&returnurl=https%3a%2f%2fstackoverflow.com%2f
https://stackexchange.com/sites
https://stackoverflow.blog
https://stackoverflow.com/legal/cookie-policy
https://stackoverflow.com/legal/privacy-policy
https://stackoverflow.com/legal/terms-of-service/public
from bs4 import BeautifulSoup, SoupStrainer
import requests

url = "https://stackoverflow.com"

page = requests.get(url)    
data = page.text
soup = BeautifulSoup(data)
cr='C:\Users\Admin\Desktop\crawler\extracted.txt'

with open(cr, 'w') as f:
 for link in soup.find_all('a'):
  print(link.get('href'))
  f.write(link.get('href'))

Tags: httpsimporttxtcomurldatagetpage
3条回答
from bs4 import BeautifulSoup, SoupStrainer
import requests

url = "https://stackoverflow.com"

page = requests.get(url)    
data = page.text
soup = BeautifulSoup(data)
cr= r'C:\Users\Admin\Desktop\extracted.txt'
links = []

for link in soup.find_all('a'):
    print(link.get('href'))
    if link.get('href'):
        links.append(link.get('href'))


with open(cr, 'w') as f:
    for link in links:
        print(link)
        f.write(link + '\n')

所以。。。正如西蒙·芬克所说的那样。但是我找到了另一个

with open(cr, 'w') as f:
 for link in soup.find_all('a'):
  print(link.get('href'))
  try:
   f.write(link.get('href')+'\n')
  except:
      continue

但我认为西蒙·芬克提出的方法更好。非常感谢

试试这个:

with open(cr, 'w') as f:
   for link in soup.find_all('a'):
      link_text = link.get('href')
      if link_text is not None:
          print(link.get('href'))
          f.write(link.get('href') + '\n')

相关问题 更多 >