我有一个专利数字的电子表格,我正在通过谷歌专利、美国专利商标局网站和其他一些网站获取额外的数据。我大部分时间都在运行,但有一件事我一整天都在坚持。当我去USPTO网站获取源代码时,它有时会给我全部的东西,并且工作得非常出色,但其他时候它只给我大约后半部分(我要找的是前半部分)。你知道吗
在这附近找了不少,但我没看到有人有这个问题。下面是一段相关的代码(因为我已经尝试了一段时间,所以它有一些冗余,但我确信这是它的最小问题):
from bs4 import BeautifulSoup
import html5lib
import re
import csv
import urllib
import requests
# This is the base URL for Google Patents
gpatbase = "https://www.google.com/patents/US"
ptobase = "http://patft.uspto.gov/netacgi/nph-Parser?Sect2=PTO1&Sect2=HITOFF&p=1&u=/netahtml/PTO/search-bool.html&r=1&f=G&l=50&d=PALL&RefSrch=yes&Query=PN/"
# Bring in the patent numbers and define the writer we'll use to add the new info we get
with open(r'C:\Users\Filepathblahblahblah\Patent Data\scrapeThese.csv', newline='') as csvfile:
patreader = csv.reader(csvfile)
writer = csv.writer(csvfile)
for row in patreader:
patnum = row[0]
#print(row)
print(patnum)
# Take each patent and append it to the base URL to get the actual one
gpaturl = gpatbase + patnum
ptourl = ptobase + patnum
gpatreq = requests.get(gpaturl)
gpatsource = gpatreq.text
soup = BeautifulSoup(gpatsource, "html5lib")
# Find the number of academic citations on that patent
# From the Google Patents page, find the link labeled USPTO and extract the url
for tag in soup.find_all("a"):
if tag.next_element == "USPTO":
uspto_link = tag.get('href')
#uspto_link = ptourl
requested = urllib.request.urlopen(uspto_link)
source = requested.read()
pto_soup = BeautifulSoup(source, "html5lib")
print(uspto_link)
# From the USPTO page, find the examiner's name and save it
for italics in pto_soup.find_all("i"):
if italics.next_element == "Primary Examiner:":
prim = italics.next_element
else:
prim = "Not found"
if prim != "Not found":
examiner = prim.next_element
else:
examiner = "Not found"
print(examiner)
到现在为止,关于我是得到考官的名字还是“找不到”大概是五五开,我看不出两组成员之间有什么共同点,所以我都没办法了。你知道吗
我仍然不知道是什么导致了这个问题,但是如果有人有类似的问题,我能够找到一个解决方法。如果您将源代码发送到文本文件而不是直接使用它,它将不会被切断。我猜问题是在数据下载之后,但在数据导入到“工作区”之前。下面是我在scraper中写的一段代码:
相关问题 更多 >
编程相关推荐