我不熟悉编程和Python。我正在采用代码(https://github.com/rileypredum/East-Bay-Housing-Web-Scrape/blob/master/EB_Room_Prices.ipynb)来清除Craiglist。我的目标是检索和存储芝加哥所有的汽车岗位。我能够存储的职位标题,张贴时间,价格和邻居。我的下一个目标是创建一个新的列,通过搜索Post Title只添加车辆的品牌,即丰田、日产、本田等。我该怎么做?你知道吗
我相信这就是我在这里添加逻辑的地方:在[13]中,为变量“post\u make”搜索“post\u title”。你知道吗
#build out the loop
from time import sleep
from random import randint
from warnings import warn
from time import time
from IPython.core.display import clear_output
import numpy as np
#find the total number of posts to find the limit of the pagination
results_num = html_soup.find('div', class_= 'search-legend')
results_total = int(results_num.find('span', class_='totalcount').text)
pages = np.arange(0, results_total, 120)
iterations = 0
post_timing = []
post_hoods = []
post_title_texts = []
post_links = []
post_prices = []
for page in pages:
#get request
response = get("https://sfbay.craigslist.org/search/eby/roo?"
+ "s="
+ str(page)
+ "&hasPic=1"
+ "&availabilityMode=0")
sleep(randint(1,5))
#throw warning for status codes that are not 200
if response.status_code != 200:
warn('Request: {}; Status code: {}'.format(requests, response.status_code))
#define the html text
page_html = BeautifulSoup(response.text, 'html.parser')
#define the posts
posts = html_soup.find_all('li', class_= 'result-row')
#extract data item-wise
for post in posts:
if post.find('span', class_ = 'result-hood') is not None:
#posting date
#grab the datetime element 0 for date and 1 for time
post_datetime = post.find('time', class_= 'result-date')['datetime']
post_timing.append(post_datetime)
#neighborhoods
post_hood = post.find('span', class_= 'result-hood').text
post_hoods.append(post_hood)
#title text
post_title = post.find('a', class_='result-title hdrlnk')
post_title_text = post_title.text
post_title_texts.append(post_title_text)
#post link
post_link = post_title['href']
post_links.append(post_link)
post_price = post.a.text
post_prices.append(post_price)
iterations += 1
print("Finished iteration: " + str(iterations))
试图找出如何显示输出。你知道吗
excel中的当前输出为: 发布,邻居,帖子标题,网址,价格
我的目标是在价格后面加上“后期制作”。你知道吗
我也在寻找有关如何显示从Jupyter笔记本输出的建议。你知道吗
把它拔出来相当棘手。我尝试使用另一个软件包Spacy来撤出与组织/汽车公司有联系的实体。虽然不完美,但这是个开始:
代码:
输出:
相关问题 更多 >
编程相关推荐