识别从谷歌专利下载的文件名

2024-04-29 03:48:04 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个csv文件,其中有大约500个来自谷歌专利的链接,我用scrapy迭代它们,以便从每个链接下载csv文件(每个链接中都有一个下载链接)。我已经成功地实现了这一点,但我现在想做的是从html标记中发现每个下载文件的名称,以便使用python进行编辑。一个示例链接是https://patents.google.com/?q=O1C(%3dCCCC1C)C&oq=O1C(%3dCCCC1C)C。下载文件的名称是动态生成的,所以有办法找到它吗


Tags: 文件csvhttps标记名称com编辑示例
2条回答

名字就是日期:gp-search-20210816-142027.csv 2021-08-16 14:20:27

作为您可能想做什么的演示,如果我理解了这个问题,您可以按照下面代码中的说明进行操作。注意:这只是一个建议的想法,它只是从第一页抓取PDF链接来显示想法

代码:

from bs4 import BeautifulSoup
from requests import get
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import pandas as pd


# next 12 lines have the job of getting the links of the PDF files from the URL below
# just the FIRST PAGE as a demo
url = "https://patents.google.com/?q=O1C(%3dCCCC1C)C&oq=O1C(%3dCCCC1C)C"
path = r'chromedriver'
options = webdriver.ChromeOptions()
options.add_argument(" start-maximized")
driver = webdriver.Chrome(path, options=options)
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
links = []
print('# this just gets the FIRST PAGE for a demo')
for link in soup.find_all('a',attrs={'class':'pdfLink style-scope search-result-item'}):
    print(link['href'])
    links.append(link['href'])


# next 11 lines cover creating a dataframe from the downloaded CSV file on the Google search page
# and a 2nd frame for the links scraped.  The two frames are eventually joined by a partial match
# of the 'result_link' from the first frame and a partial match of the filename of the pdf URL from the 2nd frame
pattern = r'/([A-Z]{2}\w{7})'
df = pd.read_csv('gp-search-20210816-190925.csv', skiprows=1)
df.columns = df.columns.str.replace(' ', '_')
df['partial_file_name'] = df['result_link'].str.extract(pattern)
df1 = pd.DataFrame(links, columns=['pdf_link'])
df1['partial_file_name'] = df1['pdf_link'].str.extract(pattern)
df = pd.concat([df, df1], axis=1)
df['filename'] = df['pdf_link'].str.extract(r'/([A-Z]{2}\w+)\.')
del df['partial_file_name']
print('\n\n', df.columns)

# 12 columns in total but for demo showing five
df[['filename', 'id', 'title', 'filing/creation_date', 'pdf_link']].head(

)

输出:

# this just gets the FIRST PAGE for a demo
https://patentimages.storage.googleapis.com/01/d1/77/6b0b7640eaccda/US7550931.pdf
https://patentimages.storage.googleapis.com/a2/32/15/69cf7713e8e2bf/JP2008525498A.pdf
https://patentimages.storage.googleapis.com/7e/6b/b7/001a8040e216ee/TWI686424B.pdf
https://patentimages.storage.googleapis.com/0f/14/fc/ecb56564f14f6b/WO2005009447A1.pdf
https://patentimages.storage.googleapis.com/95/fd/d5/ed4fe960bdec1c/KR20140096378A.pdf
https://patentimages.storage.googleapis.com/7e/29/01/231cc0813a0f6a/US5026677.pdf
https://patentimages.storage.googleapis.com/ff/f9/c9/7b775d6534d9cb/EP0628427A1.pdf
https://patentimages.storage.googleapis.com/bd/b3/ba/f38866e0b298e2/KR960004857B1.pdf
https://patentimages.storage.googleapis.com/79/e2/11/78aea87078687f/US5942486.pdf
https://patentimages.storage.googleapis.com/62/f5/da/f291e7552a45a6/US5142089.pdf


 Index(['id', 'title', 'assignee', 'inventor/author', 'priority_date',
       'filing/creation_date', 'publication_date', 'grant_date', 'result_link',
       'representative_figure_link', 'pdf_link', 'filename'],
      dtype='object')

    filename          id                   title                                              filing/creation_date      pdf_link
0   US7550931         US-7550931-B2        Controlled lighting methods and apparatus          2007-03-15                https://patentimages.stora.....ccda/US7550931.pdf
1   JP2008525498A     JP-2008525498-A      Enzyme modulators and therapy                      2005-12-23                https://patentimages.stora....f/JP2008525498A.pdf
2   TWI686424B        TW-I686424-B         Polymer containing triazine ring and compositi...  2016-01-15                https://patentimages.storage.googleapis.com/7e...
3   WO2005009447A1    WO-2005009447-A1     Single dose fast dissolving azithromycin           2004-07-22                https://patentimages.storage.googleapis.com/0f...
4   KR20140096378A    KR-20140096378-A     Low chloride compositions of olefinically func...  2012-11-19                https://patentimages.storage.googleapis.com/95...

它显示了一种使文件名、链接和其他字段对齐的方法

相关问题 更多 >