如何使用lxml在XHTML文档中查找元素文本

2024-04-23 11:34:41 发布

您现在位置:Python中文网/ 问答频道 /正文

我这么多年来一直在努力,一定是做了些蠢事。在

我试图检索所有可能支持Wikipedia的语言,并通过遍历List_of_Wikipedias上的表将它们输出到文本文件中。在

下面是我目前为止的python代码,它只是尝试检索其中一个表:

import httplib
from lxml import etree

def main():
    conn = httplib.HTTPConnection("meta.wikimedia.org")
    conn.request("GET","/wiki/List_of_Wikipedias")
    res = conn.getresponse()
    root = etree.fromstring(res.read())
    table = root.xpath('//table')
    print table

main()

在我的机器上这只打印一个空列表。为了提高速度,我在本地缓存了页面并使用:

^{pr2}$

但这没有任何影响(除了明显的加速)。我也试过了

lxml.find('table')

以及:

for element in root.iter():
    print("%s - %s" % (element.tag, element.text))

它成功地打印出了所有元素,所以我知道树正在被创建。在

我做错什么了?在

任何帮助都将不胜感激。 谢谢。在


Tags: ofimportmaintableresrootelementconn
3条回答
I am trying to retrieve all of the possible Wikipedia supported languages and output them to a text file by traversing the tables on List_of_Wikipedias

您的问题是文档中的元素名称位于默认命名空间中。如何编写包含这些元素名的XPath表达式是XPath中最常见的问题,soxpath标记中有许多很好的答案。只需搜索它们。在

这里有一个完整的解决方案:

使用

^{pr2}$

其中注册了绑定到前缀"x"的XHTML命名空间("http://www.w3.org/1999/xhtml")。在

当我根据从以下位置获得的文档计算此XPath表达式时:http://s23.org/wikistats/wikipedias_html

我需要在文档的开头添加以下内容,因为我在本地工作,没有XHTML的DTD,也许您不需要这些:

<!DOCTYPE html [
<!ENTITY uarr "&#8593;">
<!ENTITY darr "&#8595;">
<!ENTITY ccedil "&#199;">
<!ENTITY oslash "&#216;">
<!ENTITY aacute "&#225;">
<!ENTITY aring "&#229;">
<!ENTITY agrave "&#192;">
<!ENTITY egrave "&#232;">
<!ENTITY ograve "&#210;">
<!ENTITY ocirc "&#244;">
]>

将上述XPath表达式应用于此文档的结果是

                    English

                    German

                    French

                    Polish

                    Italian

                    Japanese

                    Spanish

                    Portuguese

                    Dutch

                    Russian

                    Swedish

                    Chinese

                    Catalan

                    Norwegian (Bokmål)

                    Finnish

                    Ukrainian

                    Czech

                    Hungarian

                    Romanian

                    Korean

                    Turkish

                    Vietnamese

                    Indonesian

                    Danish

                    Arabic

                    Esperanto

                    Serbian

                    Lithuanian

                    Slovak

                    Volapük

                    Persian

                    Hebrew

                    Bulgarian

                    Slovenian

                    Malay

                    Waray-Waray

                    Croatian

                    Estonian

                    Newar / Nepal Bhasa

                    Simple English

                    Hindi

                    Galician

                    Thai

                    Basque

                    Norwegian (Nynorsk)

                    Aromanian

                    Greek

                    Haitian

                    Azerbaijani

                    Tagalog

                    Latin

                    Telugu

                    Georgian

                    Macedonian

                    Cebuano

                    Serbo-Croatian

                    Breton

                    Piedmontese

                    Marathi

                    Latvian

                    Luxembourgish

                    Javanese

                    Belarusian (Taraškievica)

                    Welsh

                    Icelandic

                    Bosnian

                    Albanian

                    Tamil

                    Belarusian

                    Bishnupriya Manipuri

                    Aragonese

                    Occitan

                    Bengali

                    Swahili

                    Ido

                    Lombard

                    West Frisian

                    Gujarati

                    Afrikaans

                    Low Saxon

                    Malayalam

                    Quechua

                    Sicilian

                    Urdu

                    Kurdish

                    Cantonese

                    Sundanese

                    Asturian

                    Neapolitan

                    Samogitian

                    Armenian

                    Yoruba

                    Irish

                    Chuvash

                    Walloon

                    Nepali

                    Ripuarian

                    Western Panjabi

                    Kannada

                    Tajik

                    Tarantino

                    Venetian

                    Yiddish

                    Scottish Gaelic

                    Tatar

                    Min Nan

                    Ossetian

                    Uzbek

                    Alemannic

                    Kapampangan

                    Sakha

                    Egyptian Arabic

                    Kazakh

                    Maori

                    Limburgian

                    Amharic

                    Nahuatl

                    Upper Sorbian

                    Gilaki

                    Corsican

                    Gan

                    Mongolian

                    Scots

                    Interlingua

                    Central_Bicolano

                    Burmese

                    Faroese

                    Võro

                    Dutch Low Saxon

                    Sinhalese

                    Turkmen

                    West Flemish

                    Sanskrit

                    Bavarian

                    Malagasy

                    Manx

                    Ilokano

                    Divehi

                    Norman

                    Pangasinan

                    Banyumasan

                    Sorani

                    Romansh

                    Northern Sami

                    Zazaki

                    Mazandarani

                    Wu

                    Friulian

                    Uyghur

                    Ligurian

                    Maltese

                    Bihari

                    Novial

                    Tibetan

                    Anglo-Saxon

                    Kashubian

                    Sardinian

                    Classical Chinese

                    Fiji Hindi

                    Khmer

                    Ladino

                    Zamboanga Chavacano

                    Pali

                    Franco-Provençal/Arpitan

                    Pashto

                    Hakka

                    Cornish

                    Punjabi

                    Navajo

                    Silesian

                    Kalmyk

                    Pennsylvania German

                    Hawaiian

                    Saterland Frisian

                    Interlingue

                    Somali

                    Komi

                    Karachay-Balkar

                    Crimean Tatar

                    Tongan

                    Acehnese

                    Meadow Mari

                    Picard

                    Erzya

                    Lingala

                    Kinyarwanda

                    Extremaduran

                    Guarani

                    Kirghiz

                    Emilian-Romagnol

                    Assyrian Neo-Aramaic

                    Papiamentu

                    Aymara

                    Chechen

                    Lojban

                    Wolof

                    Banjar

                    Bashkir

                    North Frisian

                    Greenlandic

                    Tok Pisin

                    Udmurt

                    Kabyle

                    Tahitian

                    Sranan

                    Zealandic

                    Hill Mari

                    Komi-Permyak

                    Lower Sorbian

                    Abkhazian

                    Gagauz

                    Igbo

                    Oriya

                    Lao

                    Kongo

                    Avar

                    Moksha

                    Mirandese

                    Romani

                    Old Church Slavonic

                    Karakalpak

                    Samoan

                    Moldovan

                    Tetum

                    Gothic

                    Kashmiri

                    Bambara

                    Inupiak

                    Sindhi

                    Bislama

                    Lak

                    Nauruan

                    Norfolk

                    Inuktitut

                    Pontic

                    Assamese

                    Cherokee

                    Min Dong

                    Swati

                    Palatinate German

                    Hausa

                    Ewe

                    Tigrinya

                    Oromo

                    Zulu

                    Zhuang

                    Venda

                    Tsonga

                    Kirundi

                    Dzongkha

                    Sango

                    Cree

                    Chamorro

                    Luganda

                    Buginese

                    Buryat (Russia)

                    Fijian

                    Chichewa

                    Akan

                    Sesotho

                    Xhosa

                    Fula

                    Tswana

                    Kikuyu

                    Tumbuka

                    Shona

                    Twi

                    Cheyenne

                    Ndonga

                    Sichuan Yi

                    Choctaw

                    Marshallese

                    Afar

                    Kuanyama

                    Hiri Motu

                    Muscogee

                    Kanuri

                    Herero

注意事项:每秒钟选定的节点都是一个仅限空白的文本节点。如果不想选择这些选项,请使用:

(//x:table)[1]/x:tr[not(x:th)]/x:td[2]//text()[normalize-space()]

XPath需要命名空间。您下载的页面开始:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" dir="ltr">

所以你真的想要

^{pr2}$

其中html是绑定到"http://www.w3.org/1999/xhtml"的前缀

您必须了解如何在lxml中绑定名称空间-我不是python专家。在

如果这是你的问题,我表示同情-它已经把我和其他许多人弄出来了!在

将其解析为html。在

from lxml import html

url = 'http://meta.wikimedia.org/wiki/List_of_Wikipedias'
tree = html.parse(url)
languages = tree.xpath('//table/tr/td[2]/a/text()')
print('\n'.join(languages))

输出

^{pr2}$

相关问题 更多 >