我做了一个简单的爬虫程序,用googleplay包提取CSV文件,比如com.viber.voip
,然后转到完整的链接,比如https://play.google.com/store/apps/details?id=com.viber.voip&hl=en
。在
然后它爬网的标题,出版商,下载等和存储到一个列表。 问题是,当我试图将结果保存到CSV文件中时,如果我使用pandas导出到chu CSV,会给我一个错误。或者在发现未知字符时抛出UnicodeError。我试着添加。编码或解码,但没用。有人能帮忙吗?在
import bs4 as bs
import urllib.request
import pandas as pd
import csv
def searcher(bundles):
html = urllib.request.urlopen(base_url+bundles+post_url).read()
soup = bs.BeautifulSoup(html, 'html.parser')
title_app = soup.title.get_text()
publisher_name = soup.find('a', {'class':'document-subtitle primary'}).get_text()
category = soup.find('a', {'class':'document-subtitle category'}).get_text()
ratings = soup.find('meta', {'itemprop':'ratingValue'}).get('content')
reviews = soup.find('span', {'class':'reviews-num'}).get_text()
downloads = soup.find('div', {'itemprop':'numDownloads'}).get_text()
updated_last_time = soup.find('div', {'class':'content'}).get_text()
text = (bundles, title_app, publisher_name, category, ratings, reviews, downloads, updated_last_time)
return (text)
def store(crawled_data):
writer = csv.writer(f)
labels = ['bundles', 'title_app', 'publisher_name', 'category', 'ratings', 'reviews', 'downloads', 'updated_last_time']
writer.writerow(labels)
df = pd.DataFrame(crawled_data)
for row in df:
if row != None:
writer.writerow(row)
base_url = 'https://play.google.com/store/apps/details?id='
post_url = '&hl=en'
crawled_data = []
crawled_packages = 0
with open('links.csv', 'r') as f:
df = pd.read_csv(f)
urls = df['URLs']
for bundles in urls:
if bundles != None:
aaa = searcher(bundles)
print(crawled_packages)
crawled_packages += 1
if crawled_data != None:
crawled_data.append(aaa)
store(crawled_data)
通过指定要使用的输出文件,可以使用
to_csv()
。在构建数据帧时,还应指定列名:这将为您提供一个
^{pr2}$output.csv
文件,其中包含:相关问题 更多 >
编程相关推荐