我把按摩师的名字和地址从一个目录中划掉。所有的地址都被保存到CSV的一列中作为整个字符串,但是每个治疗师的标题/名字在2或3列中每列保存一个单词。你知道吗
我需要做些什么才能将提取的字符串保存在一列中,就像保存地址一样?(前两行代码是页面中的示例html,下一组代码是针对该元素的脚本的摘录)
<span class="name">
<img src="/images/famt-placeholder-sm.jpg" class="thumb" alt="Tiffani D Abraham"> Tiffani D Abraham</span>
import mechanize
from lxml import html
import csv
import io
from time import sleep
def save_products (products, writer):
for product in products:
for price in product['prices']:
writer.writerow([ product["title"].encode('utf-8') ])
writer.writerow([ price["contact"].encode('utf-8') ])
writer.writerow([ price["services"].encode('utf-8') ])
f_out = open('mtResult.csv', 'wb')
writer = csv.writer(f_out)
links = ["https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY","https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=2&PageSize=10","https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=3&PageSize=10","https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=4&PageSize=10","https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=5&PageSize=10","https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=6&PageSize=10","https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=7&PageSize=10", "https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=8&PageSize=10", "https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=9&PageSize=10", "https://www.amtamassage.org/findamassage/results.html?match=exact&l=NY&PageIndex=10&PageSize=10" ]
br = mechanize.Browser()
for link in links:
print(link)
r = br.open(link)
content = r.read()
products = []
tree = html.fromstring(content)
product_nodes = tree.xpath('//ul[@class="famt-results"]/li')
for product_node in product_nodes:
product = {}
price_nodes = product_node.xpath('.//a')
product['prices'] = []
for price_node in price_nodes:
price = {}
try:
product['title'] = product_node.xpath('.//span[1]/text()')[0]
except:
product['title'] = ""
try:
price['services'] = price_node.xpath('./span[2]/text()')[0]
except:
price['services'] = ""
try:
price['contact'] = price_node.xpath('./span[3]/text()')[0]
except:
price['contact'] = ""
product['prices'].append(price)
products.append(product)
save_products(products, writer)
f_out.close()
我不确定这是否解决了你的问题,但无论哪种方式有一些改进和修改,你可能会感兴趣。你知道吗
例如,由于每个链接因页面索引而异,因此您可以轻松地循环浏览链接,而不是将所有50个链接复制到一个列表中。每页的每个治疗师都有自己的索引,所以你也可以通过XPath循环每个治疗师的信息。你知道吗
脚本在提供的网页上循环浏览了所有50个链接,如果提供的话,似乎是在为每个治疗师搜集所有相关信息。最后,它将所有数据打印到一个csv中,所有数据存储在“Name”、“technology(s)”和“Contact Info”的相应列中,如果这是您最初遇到的问题。你知道吗
希望这有帮助!你知道吗
相关问题 更多 >
编程相关推荐