在Python中通过bs4遍历URL列表
我有一个.txt文件(叫做test_1.txt),它的格式如下:
https://maps.googleapis.com/maps/api/directions/xml?origin=Bethesda,MD&destination=Washington,DC&sensor=false&mode=walking
https://maps.googleapis.com/maps/api/directions/xml?origin=Miami,FL&destination=Mobile,AL&sensor=false&mode=walking
https://maps.googleapis.com/maps/api/directions/xml?origin=Chicago,IL&destination=Scranton,PA&sensor=false&mode=walking
https://maps.googleapis.com/maps/api/directions/xml?origin=Baltimore,MD&destination=Charlotte,NC&sensor=false&mode=walking
如果你去上面提到的某个链接,你会看到以XML格式输出的内容。下面的代码让我能够遍历到第二个方向请求(从迈阿密到莫比尔),但它打印出来的数据看起来很随机,并不是我想要的。我也能让代码正常工作,当我一次只访问一个URL时,它能准确打印出我需要的数据,直接从代码中读取这个.txt文件。有没有什么原因导致它只访问第二个URL并打印错误的信息呢?下面是Python代码:
import urllib2
from bs4 import BeautifulSoup
with open('test_1.txt', 'r') as f:
f.readline()
mapcalc = f.readline()
response = urllib2.urlopen(mapcalc)
soup = BeautifulSoup(response)
for leg in soup.select('route > leg'):
duration = leg.duration.text.strip()
distance = leg.distance.text.strip()
start = leg.start_address.text.strip()
end = leg.end_address.text.strip()
print duration
print distance
print start
print end
编辑:
这是Python代码在Shell中的输出:
56
1 min
77
253 ft
Miami, FL, USA
Mobile, AL, USA
1 个回答
1
这里有一个链接,可以帮助你更好地理解打开文件和读取行时的行为,这和Lev Levitsky的评论有关。
一种方法是:
import httplib2
from bs4 import BeautifulSoup
http = httplib2.Http()
with open('test_1.txt', 'r') as f:
for mapcalc in f:
status, response = http.request(mapcalc)
for leg in BeautifulSoup(response):
duration = leg.duration.text.strip()
distance = leg.distance.text.strip()
start = leg.start_address.text.strip()
end = leg.end_address.text.strip()
print duration
print distance
print start
print end
f.close()
我对这种事情还很陌生,但我让上面的代码运行成功,得到了以下输出:
4877
1 hour 21 mins
6582
4.1 mi
Bethesda, MD, USA
Washington, DC, USA
56
1 min
77
253 ft
Miami, FL, USA
Mobile, AL, USA
190
3 mins
269
0.2 mi
Chicago, IL, USA
Scranton, PA, USA
12
1 min
15
49 ft
Baltimore, MD, USA
Charlotte, NC, USA