通过web表单提交数据并提取结果

<form action="analysis.php" method="POST"> <textarea cols="75" rows="13" name="text"></textarea> <div class="copyright">(NOTE: The genie works best on texts of more than 500 words.)</div> <p> <b>Genre:</b> <input type="radio" value="fiction" name="genre"> fiction   <input type="radio" value="nonfiction" name="genre"> nonfiction   <input type="radio" value="blog" name="genre"> blog entry </p> <p> </form>

3条回答

网友

1楼 · 编辑于 2024-05-16 13:39:00

无需使用mechanize，只需在POST请求中发送正确的表单数据。

另外，使用正则表达式解析HTML也是一个坏主意。最好使用像lxml.HTML这样的HTML解析器。

import requests
import lxml.html as lh


def gender_genie(text, genre):
    url = 'http://bookblog.net/gender/analysis.php'
    caption = 'The Gender Genie thinks the author of this passage is:'

    form_data = {
        'text': text,
        'genre': genre,
        'submit': 'submit',
    }

    response = requests.post(url, data=form_data)

    tree = lh.document_fromstring(response.content)

    return tree.xpath("//b[text()=$caption]", caption=caption)[0].tail.strip()


if __name__ == '__main__':
    print gender_genie('I have a beard!', 'blog')

网友

2楼 · 编辑于 2024-05-16 13:39:00

您可以使用mechanize，有关详细信息，请参见examples。

from mechanize import ParseResponse, urlopen, urljoin

uri = "http://bookblog.net"

response = urlopen(urljoin(uri, "/gender/genie.php"))
forms = ParseResponse(response, backwards_compat=False)
form = forms[0]

#print form

form['text'] = 'cheese'
form['genre'] = ['fiction']

print urlopen(form.click()).read()

网友

3楼 · 编辑于 2024-05-16 13:39:00

可以使用mechanize提交和检索内容，使用re模块获取所需内容。例如，下面的脚本是为您自己的问题的文本编写的：

import re
from mechanize import Browser

text = """
My python level is Novice. I have never written a web scraper 
or crawler. I have written a python code to connect to an api and 
extract the data that I want. But for some the extracted data I want to 
get the gender of the author. I found this web site 
http://bookblog.net/gender/genie.php but downside is there isn't an api 
available. I was wondering how to write a python to submit data to the 
form in the page and extract the return data. It would be a great help 
if I could get some guidance on this."""

browser = Browser()
browser.open("http://bookblog.net/gender/genie.php")

browser.select_form(nr=0)
browser['text'] = text
browser['genre'] = ['nonfiction']

response = browser.submit()

content = response.read()

result = re.findall(
    r'<b>The Gender Genie thinks the author of this passage is:</b> (\w*)!', content)

print result[0]

它是做什么的？它创建一个mechanize.Browser，并转到给定的URL：

browser = Browser()
browser.open("http://bookblog.net/gender/genie.php")

然后选择表单（因为只有一个表单需要填写，所以它将是第一个）：

browser.select_form(nr=0)

同时，它设置窗体的条目。。。

browser['text'] = text
browser['genre'] = ['nonfiction']

。。。并提交：

response = browser.submit()

现在，我们得到结果：

content = response.read()

我们知道结果是这样的：

<b>The Gender Genie thinks the author of this passage is:</b> male!

因此，我们创建一个用于匹配的regex并使用re.findall()：

result = re.findall(
    r'<b>The Gender Genie thinks the author of this passage is:</b> (\w*)!',
    content)

现在，结果可供您使用：

print result[0]

相关问题更多 >

编程相关推荐

热门问题

热门文章