执行某些步骤后,无法从网页中获取动态填充的数字

2024-05-12 18:12:01 发布

您现在位置:Python中文网/ 问答频道 /正文

我已经创建了一个脚本,使用请求模块和BeautifulSoup库从网页中获取一些表格内容。要生成表,必须手动执行我在所附图像中显示的步骤。我在下面粘贴的代码可以正常工作,但我试图解决的主要问题是以编程方式获取title编号,在本例中,它是628086906,附加到我在这里硬编码的table_link

在第6步中单击工具按钮后,当您将光标悬停在地图上时,您可以看到此选项Multiple,当您单击此选项时,它将引导您找到包含标题号的url

start page

这正是脚本所遵循的the steps

这是linc编号0030278592,需要在步骤6的输入框中输入

我尝试过(在table_link中使用硬编码的标题号时使用了一个):

import requests
from bs4 import BeautifulSoup

link = 'https://alta.registries.gov.ab.ca/spinii/logon.aspx'
lnotice = 'https://alta.registries.gov.ab.ca/spinii/legalnotice.aspx'
search_page = 'https://alta.registries.gov.ab.ca/SpinII/SearchSelectType.aspx'
map_page = 'http://alta.registries.gov.ab.ca/SpinII/mapindex.aspx'
map_find = 'http://alta.registries.gov.ab.ca/SpinII/mapfinds.aspx'
table_link = 'https://alta.registries.gov.ab.ca/SpinII/popupTitleSearch.aspx?title=628086906'

def get_content(s,link):   
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    payload['uctrlLogon:cmdLogonGuest.x'] = '80'
    payload['uctrlLogon:cmdLogonGuest.y'] = '20'

    r = s.post(link,data=payload)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    payload['cmdYES.x'] = '52'
    payload['cmdYES.y'] = '8'

    s.post(lnotice,data=payload)
    s.headers['Referer'] = 'https://alta.registries.gov.ab.ca/spinii/welcomeguest.aspx'
    
    s.get(search_page)
    s.headers['Referer'] = 'https://alta.registries.gov.ab.ca/SpinII/SearchSelectType.aspx'
    
    s.get(map_page)
    
    r = s.get(map_find)
    s.headers['Referer'] = 'http://alta.registries.gov.ab.ca/SpinII/mapfinds.aspx'
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    payload['__EVENTTARGET'] = 'Finds$lstFindTypes'
    payload['Finds:lstFindTypes'] = 'Linc'
    payload['Finds:ctlLincNumber:txtLincNumber'] = '0030278592'
    
    r = s.post(map_find,data=payload)
    
    r = s.get(table_link)
    print(r.text)


if __name__ == "__main__":
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'
        get_content(s,link)

How can I grab the title number from the url?

How can I fetch all the linc numbers from that site so that I don't need to use map at all?

The only problem with this site is that it is unavailable in daytime for maintenance.


Tags: namehttpsmapgetablinkcagov
2条回答

有两个选项可以获取您正在寻找的信息,其中一个是您可能已经知道的selenium

打开“网络”选项卡,并在您将鼠标悬停在地图上时监视浏览器正在传递的请求(无论是否向服务器发出请求)。对于请求和BS4,您最好的选择是,如果已经加载了数据,则下面的解决方案可能无法工作

import re 
print(re.findall(r’628086906’, r.text) )

如果它打印出数字,则意味着数据以json格式提供,并通过页面加载,您可以加载json或使用正则表达式查找。否则,您唯一的选择就是硒

从以下位置调用数据:

POST http://alta.registries.gov.ab.ca/SpinII/mapserver.aspx

内容在被OpenLayers library使用之前以自定义格式编码。所有解码都位于this JS file。如果你美化它,你可以寻找解码它的WayTo.Wtb.Format.WTBOpenLayers.Class。二进制文件按字节进行解码,如下所示:

switch(elementType){
    case 1:
        var lineColor = new WayTo.Wtb.Element.LineColor();
        byteOffset = lineColor.parse(dataReader, byteOffset);
        outputElement = lineColor;
        break;
    case 2:
        var lineStyle = new WayTo.Wtb.Element.LineStyle();
        byteOffset = lineStyle.parse(dataReader, byteOffset);
        outputElement = lineStyle;
        break;
    case 3:
        var ellipse = new WayTo.Wtb.Element.Ellipse();
        byteOffset = ellipse.parse(dataReader, byteOffset);
        outputElement = ellipse;
        break;
    ........
}

为了得到原始数据,我们必须复制这种解码算法。我们不需要解码所有的对象,我们只想得到正确的偏移量并正确地提取strings。以下是用于解码部分的脚本,该解码部分解码来自文件的数据(输出):

with open("wtb.bin", mode='rb') as file:
    encodedData = file.read()
    offset = 0
    objects = []

    while offset < len(encodedData):

        elementSize = encodedData[offset]
        offset+=1
        elementType = encodedData[offset]
        offset+=1

        if elementType == 0:
            break

        curElemSize = elementSize
        curElemType = elementType

        if elementType== 114:
            largeElementSize = int.from_bytes(encodedData[offset:offset + 4], "big")
            offset+=4
            largeElementType = int.from_bytes(encodedData[offset:offset+2], "little")
            offset+=2
            curElemSize = largeElementSize
            curElemType = largeElementType

        print(f"type {curElemType} | size {curElemSize}")
        offsetInit = offset

        if curElemType == 1:
            offset+=4
        elif curElemType == 2:
            offset+=2
        elif curElemType == 3:
            offset+=20
        elif curElemType == 4:
            offset+=28
        elif curElemType == 5:
            offset+=12
        elif curElemType == 6:
            textLength = curElemSize - 3
            objects.append({
                "type": "Text",
                "x_position": int.from_bytes(encodedData[offset:offset+2], "little"),
                "y_position": int.from_bytes(encodedData[offset+2:offset+4], "little"),
                "rotation": int.from_bytes(encodedData[offset+4:offset+6], "little"),
                "text": encodedData[offset+6:offset+6+(textLength*2)].decode("utf-8").replace('\x00','')
            })
            offset+=6+(textLength*2)
        elif curElemType == 7:
            numPoint = int(curElemSize / 2)
            offset+=4*numPoint
        elif curElemType == 27:
            numPoint = int(curElemSize / 4)
            offset+=8*numPoint
        elif curElemType == 8:
            numPoint = int(curElemSize / 2)
            offset+=4*numPoint
        elif curElemType == 28:
            numPoint = int(curElemSize / 4)
            offset+=8*numPoint
        elif curElemType == 13:
            offset+=4
        elif curElemType == 14:
            offset+=2
        elif curElemType == 15:
            offset+=2
        elif curElemType == 100:
            pass
        elif curElemType == 101:
            offset+=20
        elif curElemType == 102:
            offset+=2
        elif curElemType == 103:
            pass
        elif curElemType == 104:
            highShort = int.from_bytes(encodedData[offset+2:offset+4], "little")
            lowShort = int.from_bytes(encodedData[offset+4:offset+6], "little")
            objects.append({
                "type": "StartNumericCell",
                "entity": int.from_bytes(encodedData[offset:offset+2], "little"),
                "occurrence": (highShort << 16) + lowShort
            })
            offset+=6
        elif curElemType == 105:
            #end cell
            pass
        elif curElemType == 109:
            textLength = curElemSize - 1
            objects.append({
                "type": "StartAlphanumericCell",
                "entity": int.from_bytes(encodedData[offset:offset+2], "little"),
                "occurrence":encodedData[offset+2:offset+2+(textLength*2)].decode("utf-8").replace('\x00','')
            })
            offset+=2+(textLength*2)
        elif curElemType == 111:
            offset+=40
        elif curElemType == 112:
            objects.append({
                "type": "CoordinatePlane",
                "projection_code": encodedData[offset+48:offset+52].decode("utf-8").replace('\x00','')
            })
            offset+=52
        elif curElemType == 113:
            offset+=24
        elif curElemType == 256:
            nameLength = int.from_bytes(encodedData[offset+14:offset+16], "little")
            objects.append({
                "type": "LargePolygon",
                "name": encodedData[offset+16:offset+16+nameLength].decode("utf-8").replace('\x00',''),
                "occurence": int.from_bytes(encodedData[offset+2:offset+6], "little")
            })
            if nameLength > 0:
                offset+= 16 + nameLength
                if encodedData[offset] == 0:
                    offset+=1
            else:
                offset+= 16
            numberOfPoints = int.from_bytes(encodedData[offset:offset+2], "little")
            offset+=2
            offset+=numberOfPoints*8
        elif curElemType == 257:
            pass
        else:
            offset+= curElemSize*2
        print(f"offset diff {offset-offsetInit}")
        print("                ")

    print(objects)
    print(len(encodedData))
    print(offset)

(旁注:注意元素大小以大端为单位,所有其他值以小端为单位)

运行this repl.it查看它如何解码文件

在此基础上,我们构建了获取数据的步骤,为了清晰起见,我将描述所有步骤(即使是您已经了解的步骤):

登录

使用以下方式登录网站:

GET https://alta.registries.gov.ab.ca/spinii/logon.aspx

刮取输入名称/值并添加uctrlLogon:cmdLogonGuest.xuctrlLogon:cmdLogonGuest.y,然后调用

POST https://alta.registries.gov.ab.ca/spinii/logon.aspx

法律公告

获取地图值不需要法律通知,但获取项目信息需要法律通知(文章的最后一步)

GET https://alta.registries.gov.ab.ca/spinii/legalnotice.aspx

刮取input标记名称/值,设置cmdYES.xcmdYES.y,然后调用

POST https://alta.registries.gov.ab.ca/spinii/legalnotice.aspx

地图数据

调用服务器映射API:

POST http://alta.registries.gov.ab.ca/SpinII/mapserver.aspx

使用以下数据:

{
    "mt":"titleresults",
    "qt":"lincNo",
    "LINCNumber": lincNumber,
    "rights": "B", #not required
    "cx": 1920, #screen definition
    "cy": 1080,
}

cx/xy是画布大小

使用上述方法对编码数据进行解码。您将获得:

[{'type': 'LargePolygon', 'name': '0010495134 8722524;1;162', 'entity': 23, 'occurence': 628079167, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0012170859 8022146;8;99', 'entity': 23, 'occurence': 628048595, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0010691822 8722524;1;163', 'entity': 23, 'occurence': 628222354, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0012169736 8022146;8;89', 'entity': 23, 'occurence': 628021327, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0010694454 8722524;1;179', 'entity': 23, 'occurence': 628191678, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0010694362 8722524;1;178', 'entity': 23, 'occurence': 628307403, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0010433381 8722524;1;177', 'entity': 23, 'occurence': 628209696, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0012169710 8022146;8;88A', 'entity': 23, 'occurence': 628021328, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0010694355 8722524;1;176', 'entity': 23, 'occurence': 628315826, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0012170866 8022146;8;100', 'entity': 23, 'occurence': 628163431, 'line_color_green': 0, 'line_color_red': 129, 'line_color_blue': 129, 'fill_color_green': 255, 'fill_color_red': 255, 'fill_color_blue': 180}, {'type': 'LargePolygon', 'name': '0010694347 8722524;1;175', 'entity': 23, 'occurence': 628132810, 'line_color_green': 0, 'line_color_red': 129, 

提取信息

如果您想针对一个特定的lincNumber,您需要查找多边形的样式,因为对于“多个”值(例如具有多个项目的值),响应中没有提到lincNumberid,只是一个链接引用。以下内容将获取所选项目:

selectedZone = [
    t 
    for t in objects 
    if t.get("fill_color_green", 255) < 255 and t.get("line_color_red") == 255
][0]
print(selectedZone)

调用您在帖子中提到的url以获取数据并提取表格:

GET https://alta.registries.gov.ab.ca/SpinII/popupTitleSearch.aspx?title={selectedZone["occurence"]}

完整代码:

import requests
from bs4 import BeautifulSoup
import pandas as pd

lincNumber = "0030278592"
#lincNumber = "0010661156"

s = requests.Session()

# 1) login
r = s.get("https://alta.registries.gov.ab.ca/spinii/logon.aspx")
soup = BeautifulSoup(r.text, "html.parser")

payload = dict([
    (t["name"], t.get("value", ""))
    for t in soup.findAll("input")
])
payload["uctrlLogon:cmdLogonGuest.x"] = 76
payload["uctrlLogon:cmdLogonGuest.y"] = 25
s.post("https://alta.registries.gov.ab.ca/spinii/logon.aspx",data=payload)

# 2) legal notice
r = s.get("https://alta.registries.gov.ab.ca/spinii/legalnotice.aspx")
soup = BeautifulSoup(r.text, "html.parser")
payload = dict([
    (t["name"], t.get("value", ""))
    for t in soup.findAll("input")
])
payload["cmdYES.x"] = 82
payload["cmdYES.y"] = 3
s.post("https://alta.registries.gov.ab.ca/spinii/legalnotice.aspx", data = payload)

# 3) map data
r = s.post("http://alta.registries.gov.ab.ca/SpinII/mapserver.aspx",
    data= {
        "mt":"titleresults",
        "qt":"lincNo",
        "LINCNumber": lincNumber,
        "rights": "B", #not required
        "cx": 1920, #screen definition
        "cy": 1080,
    })

def decodeWtb(encodedData):
    offset = 0

    objects = []
    iteration = 0

    while offset < len(encodedData):

        elementSize = encodedData[offset]
        offset+=1
        elementType = encodedData[offset]
        offset+=1

        if elementType == 0:
            break

        curElemSize = elementSize
        curElemType = elementType

        if elementType== 114:
            largeElementSize = int.from_bytes(encodedData[offset:offset + 4], "big")
            offset+=4
            largeElementType = int.from_bytes(encodedData[offset:offset+2], "little")
            offset+=2
            curElemSize = largeElementSize
            curElemType = largeElementType

        offsetInit = offset

        if curElemType == 1:
            offset+=4
        elif curElemType == 2:
            offset+=2
        elif curElemType == 3:
            offset+=20
        elif curElemType == 4:
            offset+=28
        elif curElemType == 5:
            offset+=12
        elif curElemType == 6:
            textLength = curElemSize - 3
            offset+=6+(textLength*2)
        elif curElemType == 7:
            numPoint = int(curElemSize / 2)
            offset+=4*numPoint
        elif curElemType == 27:
            numPoint = int(curElemSize / 4)
            offset+=8*numPoint
        elif curElemType == 8:
            numPoint = int(curElemSize / 2)
            offset+=4*numPoint
        elif curElemType == 28:
            numPoint = int(curElemSize / 4)
            offset+=8*numPoint
        elif curElemType == 13:
            offset+=4
        elif curElemType == 14:
            offset+=2
        elif curElemType == 15:
            offset+=2
        elif curElemType == 100:
            pass
        elif curElemType == 101:
            offset+=20
        elif curElemType == 102:
            offset+=2
        elif curElemType == 103:
            pass
        elif curElemType == 104:
            offset+=6
        elif curElemType == 105:
            pass
        elif curElemType == 109:
            textLength = curElemSize - 1
            offset+=2+(textLength*2)
        elif curElemType == 111:
            offset+=40
        elif curElemType == 112:
            offset+=52
        elif curElemType == 113:
            offset+=24
        elif curElemType == 256:
            nameLength = int.from_bytes(encodedData[offset+14:offset+16], "little")
            objects.append({
                "type": "LargePolygon",
                "name": encodedData[offset+16:offset+16+nameLength].decode("utf-8").replace('\x00',''),
                "entity": int.from_bytes(encodedData[offset:offset+2], "little"),
                "occurence": int.from_bytes(encodedData[offset+2:offset+6], "little"),
                "line_color_green": encodedData[offset + 8],
                "line_color_red": encodedData[offset + 7],
                "line_color_blue": encodedData[offset + 9],
                "fill_color_green": encodedData[offset + 10],
                "fill_color_red": encodedData[offset + 11],
                "fill_color_blue": encodedData[offset + 13]
            })
            if nameLength > 0:
                offset+= 16 + nameLength
                if encodedData[offset] == 0:
                    offset+=1
            else:
                offset+= 16
            numberOfPoints = int.from_bytes(encodedData[offset:offset+2], "little")
            offset+=2
            offset+=numberOfPoints*8
        elif curElemType == 257:
            pass
        else:
            offset+= curElemSize*2

    return objects

# 4) decode custom format
objects = decodeWtb(r.content)

# 5) get the selected area
selectedZone = [
    t 
    for t in objects 
    if t.get("fill_color_green", 255) < 255 and t.get("line_color_red") == 255
][0]
print(selectedZone)

# 6) get the info about item
r = s.get(f'https://alta.registries.gov.ab.ca/SpinII/popupTitleSearch.aspx?title={selectedZone["occurence"]}')
df = pd.read_html(r.content, attrs = {'class': 'bodyText'}, header =0)[0]
del df['Add to Cart']
del df['View']
print(df[:-1])

Run this on repl.it

输出

  Title Number           Type LINC Number Short Legal   Rights Registration Date Change/Cancel Date
0    052400228  Current Title  0030278592  0420091;16  Surface        19/09/2005         13/11/2019
1    072294084  Current Title  0030278551  0420091;12  Surface        22/05/2007         21/08/2007
2    072400529  Current Title  0030278469   0420091;3  Surface        05/07/2007         28/08/2007
3    072498228  Current Title  0030278501   0420091;7  Surface        18/08/2007         08/02/2008
4    072508699  Current Title  0030278535  0420091;10  Surface        23/08/2007         13/12/2007
5    072559500  Current Title  0030278477   0420091;4  Surface        17/09/2007         19/11/2007
6    072559508  Current Title  0030278576  0420091;14  Surface        17/09/2007         09/01/2009
7    072559521  Current Title  0030278519   0420091;8  Surface        17/09/2007         07/11/2007
8    072559530  Current Title  0030278493   0420091;6  Surface        17/09/2007         25/08/2008
9    072559605  Current Title  0030278485   0420091;5  Surface        17/09/2007         23/12/2008

如果您想获得更多条目,可以查看objects字段。如果你想获得更多关于物品的信息,比如坐标等,你可以改进解码器

也可以通过查看包含lincNumber的name字段来匹配目标周围的其他lincNumber,除非其中有“多个”名称

有趣的事实:

no http header need to be set in this flow

相关问题 更多 >