Google Play Crawler结果保存到CSV

2024-05-13 20:32:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我做了一个简单的爬虫程序,用googleplay包提取CSV文件,比如com.viber.voip,然后转到完整的链接,比如https://play.google.com/store/apps/details?id=com.viber.voip&hl=en。在

然后它爬网的标题,出版商,下载等和存储到一个列表。 问题是,当我试图将结果保存到CSV文件中时,如果我使用pandas导出到chu CSV,会给我一个错误。或者在发现未知字符时抛出UnicodeError。我试着添加。编码或解码,但没用。有人能帮忙吗?在

import bs4 as bs
import urllib.request
import pandas as pd
import csv

def searcher(bundles):
    html = urllib.request.urlopen(base_url+bundles+post_url).read()
    soup = bs.BeautifulSoup(html, 'html.parser')
    title_app = soup.title.get_text()
    publisher_name = soup.find('a', {'class':'document-subtitle primary'}).get_text()
    category = soup.find('a', {'class':'document-subtitle category'}).get_text()
    ratings = soup.find('meta', {'itemprop':'ratingValue'}).get('content')
    reviews = soup.find('span', {'class':'reviews-num'}).get_text()
    downloads = soup.find('div', {'itemprop':'numDownloads'}).get_text()
    updated_last_time = soup.find('div', {'class':'content'}).get_text()
    text = (bundles, title_app, publisher_name, category, ratings, reviews, downloads, updated_last_time)
    return (text)

def store(crawled_data):
    writer = csv.writer(f)
    labels = ['bundles', 'title_app', 'publisher_name', 'category', 'ratings', 'reviews', 'downloads', 'updated_last_time']
    writer.writerow(labels)
    df = pd.DataFrame(crawled_data)
    for row in df:
        if row != None:
            writer.writerow(row)

base_url = 'https://play.google.com/store/apps/details?id='
post_url = '&hl=en'
crawled_data = []
crawled_packages = 0

with open('links.csv', 'r') as f:
    df = pd.read_csv(f)
    urls = df['URLs']
    for bundles in urls:
        if bundles != None:
            aaa = searcher(bundles)
            print(crawled_packages)
            crawled_packages += 1
            if crawled_data != None:
                crawled_data.append(aaa)

store(crawled_data)

Tags: csvstoretextimportcomurldataget
1条回答
网友
1楼 · 发布于 2024-05-13 20:32:16

通过指定要使用的输出文件,可以使用to_csv()。在构建数据帧时,还应指定列名:

import bs4 as bs
import urllib.request
import pandas as pd
import csv

def searcher(bundles):
    html = urllib.request.urlopen(base_url+bundles+post_url).read()
    soup = bs.BeautifulSoup(html, 'html.parser')
    title_app = soup.title.get_text()

    publisher_name = soup.find('a', {'class':'document-subtitle primary'}).get_text(strip=True)
    category = soup.find('a', {'class':'document-subtitle category'}).get_text(strip=True)
    ratings = soup.find('meta', {'itemprop':'ratingValue'}).get('content')
    reviews = soup.find('span', {'class':'reviews-num'}).get_text(strip=True)
    downloads = soup.find('div', {'itemprop':'numDownloads'}).get_text(strip=True)
    updated_last_time = soup.find('div', {'class':'content'}).get_text(strip=True)
    return (bundles, title_app, publisher_name, category, ratings, reviews, downloads, updated_last_time)


def store(crawled_data):
    labels = ['bundles', 'title_app', 'publisher_name', 'category', 'ratings', 'reviews', 'downloads', 'updated_last_time']
    df = pd.DataFrame(crawled_data, columns=labels)
    df.to_csv('output.csv', index=False)


base_url = 'https://play.google.com/store/apps/details?id='
post_url = '&hl=en'
crawled_data = []
crawled_packages = 0

with open('links.csv', 'r') as f:
    df = pd.read_csv(f)
    urls = df['URLs']
    for bundles in urls:
        if bundles != None:
            aaa = searcher(bundles)
            print(crawled_packages)
            crawled_packages += 1
            if crawled_data != None:
                crawled_data.append(aaa)

store(crawled_data)

这将为您提供一个output.csv文件,其中包含:

^{pr2}$

相关问题 更多 >