Python Web Scrape写入到Fi的输出

import requests from bs4 import BeautifulSoup as BS import json data='C:/test.json' url="http://sfbay.craigslist.org/search/sby/sss?sort=rel&query=baby" r=requests.get(url) soup=BS(r.content) links=soup.find_all("p") #print soup.prettify() for link in links: connections=link.text f=open(data,'a') f.write(json.dumps(connections,indent=1)) f.close()

3条回答

网友

1楼 · 编辑于 2024-04-26 00:37:43

决定你到底想要什么样的数据。价格？描述？列出日期？在
决定一个好的数据结构来保存这些信息。我推荐一个包含相关字段或列表的类。在
使用正则表达式或许多其他方法中的一种来获取所需的数据。在
扔掉你不需要的东西

5a.将列表内容写入到一个文件中，该格式可供以后使用（XML、逗号分隔等）

或者

5b.按照上面的Mike Ounsworth的建议对对象进行Pickle。在

如果您还不熟悉XML解析，只需为每个链接写一行，并用一个字符分隔所需字段，以便以后拆分。e、 g.：

import re #I'm going to use regular expressions here

link_content_matcher = re.compile("""\$(?P<price>[1-9]{1,4})\s+(?P<list_date>[A-Z]{1}[a-z]{2}\s+[0-9]{1,2})\s+(?P<description>.*)\((?P<location>.*)\)""")

some_link = "$50    Sep 5 Baby Carrier - Black/Silver (san jose)"

# Grab the matches
matched_fields = link_content_matcher.search(some_link)

# Write what you want to a file using a delimiter that 
# probably won't exist in the description. This is risky,
# but will do in a pinch.
output_file = open('results.txt', 'w')
output_file.write("{price}^{date}^{desc}^{location}\n".format(price=matched_fields.group('price'),
    date=matched_fields.group('list_date'),
    desc=matched_fields.group('description'),
    location=matched_fields.group('location')))
output_file.close()

当您想重新访问这些数据时，从文件中逐行获取它并使用split进行解析。在

^{pr2}$

网友

2楼 · 编辑于 2024-04-26 00:37:43

听起来你的问题更多的是如何解析从craigslist获取的数据，而不是如何处理文件。一种方法是获取每个<p>元素并用空格标记字符串。例如，将字符串标记化

"$25 Sep 5 Porcelain Baby Deer $25 (sunnyvale) pic household items - by owner"

可以使用split完成：

s = " $25 Sep 5 Porcelain Baby Deer $25 (sunnyvale) pic household items - by owner "
L = s.strip().split(' ') #remove whitespace at ends and break string apart by spaces

L现在是一个包含值的列表

^{pr2}$

从这里，您可以尝试根据列表元素出现的顺序来确定它们的含义。L[0]可能总是持有价格，L[1]月，L[2]月日，等等。如果您对将这些值写入文件并在以后再次解析感兴趣，请考虑阅读csv module。在

网友

3楼 · 编辑于 2024-04-26 00:37:43

如果您想将它从python写入一个文件，然后再读回python，可以使用Pickle-Pickle Tutorial。在

Pickle文件是二进制的，不可读，如果这对您很重要，那么您可以看看yaml，我承认它有一点学习曲线，但可以生成格式良好的文件。在

import yaml

f = open(filename, 'w')
f.write( yaml.dump(data) )
f.close()

...


stream = open(filename, 'r')
data = yaml.load(stream)

相关问题更多 >

编程相关推荐

热门问题

热门文章