通过网页表单提交数据并提取结果

17 投票

3 回答

50301 浏览

提问于 2025-04-17 07:40

我的Python水平是初学者。我从来没有写过网页爬虫或抓取工具。我写了一段Python代码来连接一个API，并提取我想要的数据。但是对于一些提取的数据，我想知道作者的性别。我找到这个网站 http://bookblog.net/gender/genie.php，但缺点是没有提供API。我在想怎么写一段Python代码，把数据提交到这个页面的表单里，然后提取返回的数据。如果能得到一些指导，那将对我帮助很大。

这是表单的结构：

<form action="analysis.php" method="POST">
<textarea cols="75" rows="13" name="text"></textarea>
<div class="copyright">(NOTE: The genie works best on texts of more than 500 words.)</div>
<p>
<b>Genre:</b>
<input type="radio" value="fiction" name="genre">
fiction&nbsp;&nbsp;
<input type="radio" value="nonfiction" name="genre">
nonfiction&nbsp;&nbsp;
<input type="radio" value="blog" name="genre">
blog entry
</p>
<p>
</form>

结果页面的结构：

<p>
<b>The Gender Genie thinks the author of this passage is:</b>
male!
</p>

数据提取数据解析网页抓取表单提交 api连接网页表单

3 个回答

你可以使用 mechanize 这个工具，具体的例子可以查看这里。

from mechanize import ParseResponse, urlopen, urljoin

uri = "http://bookblog.net"

response = urlopen(urljoin(uri, "/gender/genie.php"))
forms = ParseResponse(response, backwards_compat=False)
form = forms[0]

#print form

form['text'] = 'cheese'
form['genre'] = ['fiction']

print urlopen(form.click()).read()

回答于 2025-04-17 由 Python大师

分享举报

你可以使用mechanize来提交和获取内容，同时用re模块来提取你想要的信息。比如，下面的脚本就是用来处理你自己提问的内容：

import re
from mechanize import Browser

text = """
My python level is Novice. I have never written a web scraper 
or crawler. I have written a python code to connect to an api and 
extract the data that I want. But for some the extracted data I want to 
get the gender of the author. I found this web site 
http://bookblog.net/gender/genie.php but downside is there isn't an api 
available. I was wondering how to write a python to submit data to the 
form in the page and extract the return data. It would be a great help 
if I could get some guidance on this."""

browser = Browser()
browser.open("http://bookblog.net/gender/genie.php")

browser.select_form(nr=0)
browser['text'] = text
browser['genre'] = ['nonfiction']

response = browser.submit()

content = response.read()

result = re.findall(
    r'<b>The Gender Genie thinks the author of this passage is:</b> (\w*)!', content)

print result[0]

这个脚本是干什么的呢？它创建了一个mechanize.Browser对象，并访问指定的网址：

browser = Browser()
browser.open("http://bookblog.net/gender/genie.php")

接着，它选择表单（因为只有一个表单需要填写，所以它会选择第一个）：

browser.select_form(nr=0)

然后，它设置表单中的各项内容...

browser['text'] = text
browser['genre'] = ['nonfiction']

...并提交这个表单：

response = browser.submit()

现在，我们可以得到结果：

content = response.read()

我们知道结果是在这个表单中的：

<b>The Gender Genie thinks the author of this passage is:</b> male!

所以我们创建一个正则表达式来匹配，并使用re.findall()：

result = re.findall(
    r'<b>The Gender Genie thinks the author of this passage is:</b> (\w*)!',
    content)

现在结果可以供你使用了：

print result[0]

回答于 2025-04-17 由 Python大师

分享举报

其实不需要使用 mechanize，只要在 POST 请求中发送正确的表单数据就可以了。

另外，用正则表达式来解析 HTML 其实不是个好主意。你最好使用像 lxml.html 这样的 HTML 解析器。

import requests
import lxml.html as lh


def gender_genie(text, genre):
    url = 'http://bookblog.net/gender/analysis.php'
    caption = 'The Gender Genie thinks the author of this passage is:'

    form_data = {
        'text': text,
        'genre': genre,
        'submit': 'submit',
    }

    response = requests.post(url, data=form_data)

    tree = lh.document_fromstring(response.content)

    return tree.xpath("//b[text()=$caption]", caption=caption)[0].tail.strip()


if __name__ == '__main__':
    print gender_genie('I have a beard!', 'blog')

回答于 2025-04-17 由 Python大师

分享举报

通过网页表单提交数据并提取结果

3 个回答

撰写回答