仅在PDF嵌入URL中删除包含特定单词的段落

from urllib.request import Request, urlopen from bs4 import BeautifulSoup import re url1 = "https://brainybackpackers.com/best-places-for-whale-watching-in-the-world/" url2 = "https://www.environment.gov.au/system/files/resources/7f15bfc1-ed3d-40b6-a177-c81349028ef6/files/aust-national-guidelines-whale-dolphin-watching-2017.pdf" url = url1 req = Request(url, headers={"User-Agent": 'Mozilla/5.0'}) page = urlopen(req, timeout = 5) # Open page within 5 seconds. This line skips 'empty' websites htmlParse = BeautifulSoup(page.read(), 'lxml') SearchWords = ["orca", "killer whale", "humpback"] # text must contain these words # Check if the article text mentions the SearchWord(s). If so, continue the analysis. if any(word in htmlParse.text for word in SearchWords): textP = "" text = "" # Look for paragraphs ("p") that contain a SearchWord for word in SearchWords: print(word) for para in htmlParse.find_all("p", text = re.compile(word)): textParagraph = para.get_text() textP = textP + textParagraph text= text + textP print(text)

2条回答

网友

1楼 · 编辑于 2024-06-16 14:06:13

你可以尝试的一件事是pdfminer.six package。导入此函数后，我们可以利用pdfminer.high_level.extract_text()函数。通过导入它，我们可以获取pdf：

import pdfminer.high_level as pdfminer

infile = "my/file/path.pdf" # file you want to turn into text

out_text = pdfminer.extract_text(infile) # extract the text to out_file var

# out_text now contains a string of your pdf contents

应该注意的是extract_text函数在本地文件上工作，因此我们需要将pdf保存到某个本地缓冲区，您可以稍后删除该缓冲区。如果您使用的是类Unix操作系统，我会说类似/tmp/

谈到您的实现，我相信您会想要这样的东西：

import pdfminer.high_level as pdfminer
import requests

# get the pdf and save it
url = "https://www.environment.gov.au/system/files/resources/7f15bfc1-ed3d-40b6-a177-c81349028ef6/files/aust-national-guidelines-whale-dolphin-watching-2017.pdf"
response = requests.get(url)
pdf_name = url.split('/')[-1] # everything right of the last slash
pdf_path = "/tmp/" + pdf_name # CHANGE TO WHATEVER "BUFFER" FOLDER YOU WANT

# save the pdf locally to be used with the pdf parser
with open(pdf_path,'wb') as outfile:
    outfile.write(response.content)

# read the contents of the pdf into the out_text var
out_text = pdfminer.extract_text(pdf_path)

# out_text now contains a string of your pdf contents

从这里你可以自由地刮你想要的东西

网友

2楼 · 编辑于 2024-06-16 14:06:13

您可以阅读PDF并在页面中搜索您要查找的内容：

# pip install pyPDF2

import io
import requests
import PyPDF2


URI = "https://www.environment.gov.au/system/files/resources/7f15bfc1-ed3d-40b6-a177-c81349028ef6/files/aust-national-guidelines-whale-dolphin-watching-2017.pdf"

r = requests.get(URI)
with io.BytesIO(r.content) as f:
  reader = PyPDF2.PdfFileReader(f)
  num_pages = reader.numPages
  
  data = []
  # place page text to data
  for page in range(num_pages):
    page_data = reader.getPage(page)
    data.append(page_data.extractText())

# look up
search_words = set(["orca", "killer whale", "humpback"])

# get pages containing your lookup
wanted_page = []
for page_contents in data:
     for word in search_words:
         if word in page_contents.lower():
             wanted_page.append(page_contents)

             
print(wanted_page)

相关问题更多 >

编程相关推荐

热门问题

热门文章