试图用python在网页上删除最后一个文档的日期

网友

1楼 · 编辑于 2024-04-19 23:22:01

使用标题尝试以下操作：

import requests
from bs4 import BeautifulSoup

url = "https://iapps.courts.state.ny.us/nyscef/DocumentList?docketId=npvulMdOYzFDYIAomW_PLUS_elw==&display=all"

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content,'html.parser')
date = soup.find('span', {'class': 'grayItalic'}).get_text().strip()
converted_date = int(date.split("/")[-2])
print(converted_date)
print(date)

网友

2楼 · 编辑于 2024-04-19 23:22:01

我无法使用请求或urllib模块加载URL。我猜该网站正在阻止自动连接请求。因此，我打开网页，将源代码保存在文件名page.html中，并在其中运行BeautifulSoup操作。这似乎奏效了

html = open("page.html")
soup = BeautifulSoup(html, 'html.parser')
date_span = soup.find('span', {'class': 'grayItalic'})

if date_span is not None:
    print(str(date_span.text).strip().replace("Received: ", ""))
    # output: 04/25/2019

我试着用请求库删除源代码，如下所示，但没有成功（可能是网页阻止了请求）。看看它是否能在你的机器上工作

url = "..."
headers = {
    'Access-Control-Allow-Origin': '*',
    'Access-Control-Allow-Methods': 'GET',
    'Access-Control-Allow-Headers': 'Content-Type',
    'Access-Control-Max-Age': '3600',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
}

response = requests.get(url, headers=headers)
html = response.content

print(html)

网友

3楼 · 编辑于 2024-04-19 23:22:01

import dateutil.parser
from bs4 import BeautifulSoup
html_doc=""""<span class="grayItalic">
    Received: 01/19/2021
</span>"""
soup=BeautifulSoup(html_doc,'html.parser')

date_ = soup.find('span', {'class': 'grayItalic'}).get_text()

dateutil.parser.parse(date_,fuzzy=True)

输出：

datetime.datetime(2021, 1, 19, 0, 0)

date_输出'\n Received: 01/19/2021\n'您可以使用字符串切片，而可以使用^{}。它将为您返回datetime.datetime对象。在本例中，我假设您只需要日期。如果您也需要文本，可以使用fuzzy_with_tokens=True

 if the fuzzy_with_tokens option is True, returns a tuple, the first element being a datetime.datetime object, the second a tuple containing the fuzzy tokens.

dateutil.parser.parse(date_,fuzzy_with_tokens=True)

(datetime.datetime(2021, 1, 19, 0, 0), (' Received: ', ' '))

相关问题更多 >

编程相关推荐

热门问题

热门文章