如何使用lxml在XHTML文档中查找元素文本
我一直在为这个问题苦恼,感觉自己一定是做了什么傻事。
我想要获取所有维基百科支持的语言,并把它们输出到一个文本文件里,我是通过访问维基百科列表上的表格来实现的。
这是我目前写的Python代码,主要是想获取其中一个表格:
import httplib
from lxml import etree
def main():
conn = httplib.HTTPConnection("meta.wikimedia.org")
conn.request("GET","/wiki/List_of_Wikipedias")
res = conn.getresponse()
root = etree.fromstring(res.read())
table = root.xpath('//table')
print table
main()
在我的电脑上,这段代码只打印出一个空列表。为了提高速度,我把页面缓存到本地,并使用了:
wikipage = open("wikipage.html")
root = lxml.parse(wikipage)
但这并没有任何效果(除了明显的速度提升)。我还尝试了
lxml.find('table')
和:
for element in root.iter():
print("%s - %s" % (element.tag, element.text))
这两种方法都成功打印出了所有元素,所以我知道树结构是创建成功的。
我到底哪里做错了呢?
任何帮助都非常感谢。谢谢。
3 个回答
0
XPath需要使用命名空间。你下载的页面开头是:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" dir="ltr">
所以你实际上想要的是
xpath('//html:table')
其中html
是绑定到"http://www.w3.org/1999/xhtml"
的前缀
你需要了解如何在lxml中绑定命名空间——我不是Python专家。
如果这就是你的问题,我很理解——这也曾让我和很多其他人困扰过!
3
把它当作HTML来解析。
from lxml import html
url = 'http://meta.wikimedia.org/wiki/List_of_Wikipedias'
tree = html.parse(url)
languages = tree.xpath('//table/tr/td[2]/a/text()')
print('\n'.join(languages))
输出
English
German
French
Polish
Italian
Japanese
Spanish
Portuguese
Dutch
Russian
Swedish
Chinese
Catalan
Norwegian (Bokmål)
Finnish
Ukrainian
Czech
Hungarian
Romanian
Korean
Turkish
Vietnamese
Indonesian
Danish
Arabic
Esperanto
Serbian
Lithuanian
Slovak
Volapük
Persian
Hebrew
Bulgarian
Slovenian
Malay
Waray-Waray
Croatian
Estonian
Newar / Nepal Bhasa
Simple English
Hindi
Galician
Thai
Basque
Norwegian (Nynorsk)
Aromanian
Greek
Haitian
Azerbaijani
Tagalog
Latin
Telugu
Georgian
Macedonian
Cebuano
Serbo-Croatian
Breton
Piedmontese
Marathi
Latvian
Luxembourgish
Javanese
Belarusian (Taraškievica)
Welsh
Icelandic
Bosnian
Albanian
Tamil
Belarusian
Bishnupriya Manipuri
Aragonese
Occitan
Bengali
Swahili
Ido
Lombard
West Frisian
Gujarati
Afrikaans
Low Saxon
Malayalam
Quechua
Sicilian
Urdu
Kurdish
Cantonese
Sundanese
Asturian
Neapolitan
Samogitian
Armenian
Yoruba
Irish
Chuvash
Walloon
Nepali
Ripuarian
Western Panjabi
Kannada
Tajik
Tarantino
Venetian
Yiddish
Scottish Gaelic
Tatar
Min Nan
Ossetian
Uzbek
Alemannic
Kapampangan
Sakha
Kazakh
Egyptian Arabic
Maori
Amharic
Limburgian
Nahuatl
Upper Sorbian
Gilaki
Corsican
Gan
Mongolian
Scots
Interlingua
Central_Bicolano
Burmese
Faroese
Võro
Dutch Low Saxon
Sinhalese
Turkmen
West Flemish
Sanskrit
Bavarian
Malagasy
Manx
Ilokano
Divehi
Norman
Pangasinan
Banyumasan
Sorani
Romansh
Northern Sami
Zazaki
Mazandarani
Wu
Friulian
Uyghur
Ligurian
Maltese
Bihari
Novial
Tibetan
Anglo-Saxon
Kashubian
Sardinian
Classical Chinese
Fiji Hindi
Khmer
Ladino
Zamboanga Chavacano
Pali
Franco-Provençal/Arpitan
Pashto
Hakka
Cornish
Punjabi
Navajo
Silesian
Kalmyk
Pennsylvania German
Hawaiian
Saterland Frisian
Interlingue
Somali
Komi
Karachay-Balkar
Crimean Tatar
Tongan
Acehnese
Meadow Mari
Picard
Kinyarwanda
Erzya
Lingala
Extremaduran
Guarani
Kirghiz
Emilian-Romagnol
Assyrian Neo-Aramaic
Papiamentu
Aymara
Chechen
Lojban
Wolof
Banjar
Bashkir
North Frisian
Greenlandic
Tok Pisin
Udmurt
Kabyle
Tahitian
Sranan
Zealandic
Hill Mari
Komi-Permyak
Lower Sorbian
Abkhazian
Gagauz
Igbo
Oriya
Lao
Kongo
Avar
Moksha
Mirandese
Romani
Old Church Slavonic
Karakalpak
Samoan
Moldovan
Tetum
Gothic
Kashmiri
Bambara
Inupiak
Sindhi
Bislama
Lak
Nauruan
Norfolk
Inuktitut
Pontic
Assamese
Cherokee
Min Dong
Palatinate German
Swati
Hausa
Ewe
Tigrinya
Oromo
Zulu
Zhuang
Venda
Tsonga
Kirundi
Cree
Dzongkha
Sango
Chamorro
Luganda
Buginese
Buryat (Russia)
Fijian
Chichewa
Akan
Sesotho
Xhosa
Fula
Tswana
Kikuyu
Tumbuka
Shona
Twi
Cheyenne
Ndonga
Sichuan Yi
Choctaw
Marshallese
Afar
Kuanyama
Hiri Motu
Muscogee
Kanuri
Herero
3
I am trying to retrieve all of the possible Wikipedia supported languages and output them to a text file by traversing the tables on List_of_Wikipedias
你的问题是文档中的元素名称在一个默认的命名空间里。如何写出涉及这些元素名称的XPath表达式是XPath中最常见的问题之一,网上有很多好的答案可以参考。你只需搜索一下就能找到。
这里有一个完整的解决方案:
使用:
(//x:table)[1]/x:tr[not(x:th)]/x:td[2]//text()
在这里你需要注册XHTML命名空间("http://www.w3.org/1999/xhtml"
),并将其绑定到前缀"x"
。
当我在从这个链接获取的文档上评估这个XPath表达式时:http://s23.org/wikistats/wikipedias_html
我需要在文档的开头添加以下内容,因为我是在本地工作,并没有XHTML的DTD——也许你不需要这些:
<!DOCTYPE html [
<!ENTITY uarr "↑">
<!ENTITY darr "↓">
<!ENTITY ccedil "Ç">
<!ENTITY oslash "Ø">
<!ENTITY aacute "á">
<!ENTITY aring "å">
<!ENTITY agrave "À">
<!ENTITY egrave "è">
<!ENTITY ograve "Ò">
<!ENTITY ocirc "ô">
]>
将上述XPath表达式应用于这个文档的结果是:
English
German
French
Polish
Italian
Japanese
Spanish
Portuguese
Dutch
Russian
Swedish
Chinese
Catalan
Norwegian (Bokmål)
Finnish
Ukrainian
Czech
Hungarian
Romanian
Korean
Turkish
Vietnamese
Indonesian
Danish
Arabic
Esperanto
Serbian
Lithuanian
Slovak
Volapük
Persian
Hebrew
Bulgarian
Slovenian
Malay
Waray-Waray
Croatian
Estonian
Newar / Nepal Bhasa
Simple English
Hindi
Galician
Thai
Basque
Norwegian (Nynorsk)
Aromanian
Greek
Haitian
Azerbaijani
Tagalog
Latin
Telugu
Georgian
Macedonian
Cebuano
Serbo-Croatian
Breton
Piedmontese
Marathi
Latvian
Luxembourgish
Javanese
Belarusian (Taraškievica)
Welsh
Icelandic
Bosnian
Albanian
Tamil
Belarusian
Bishnupriya Manipuri
Aragonese
Occitan
Bengali
Swahili
Ido
Lombard
West Frisian
Gujarati
Afrikaans
Low Saxon
Malayalam
Quechua
Sicilian
Urdu
Kurdish
Cantonese
Sundanese
Asturian
Neapolitan
Samogitian
Armenian
Yoruba
Irish
Chuvash
Walloon
Nepali
Ripuarian
Western Panjabi
Kannada
Tajik
Tarantino
Venetian
Yiddish
Scottish Gaelic
Tatar
Min Nan
Ossetian
Uzbek
Alemannic
Kapampangan
Sakha
Egyptian Arabic
Kazakh
Maori
Limburgian
Amharic
Nahuatl
Upper Sorbian
Gilaki
Corsican
Gan
Mongolian
Scots
Interlingua
Central_Bicolano
Burmese
Faroese
Võro
Dutch Low Saxon
Sinhalese
Turkmen
West Flemish
Sanskrit
Bavarian
Malagasy
Manx
Ilokano
Divehi
Norman
Pangasinan
Banyumasan
Sorani
Romansh
Northern Sami
Zazaki
Mazandarani
Wu
Friulian
Uyghur
Ligurian
Maltese
Bihari
Novial
Tibetan
Anglo-Saxon
Kashubian
Sardinian
Classical Chinese
Fiji Hindi
Khmer
Ladino
Zamboanga Chavacano
Pali
Franco-Provençal/Arpitan
Pashto
Hakka
Cornish
Punjabi
Navajo
Silesian
Kalmyk
Pennsylvania German
Hawaiian
Saterland Frisian
Interlingue
Somali
Komi
Karachay-Balkar
Crimean Tatar
Tongan
Acehnese
Meadow Mari
Picard
Erzya
Lingala
Kinyarwanda
Extremaduran
Guarani
Kirghiz
Emilian-Romagnol
Assyrian Neo-Aramaic
Papiamentu
Aymara
Chechen
Lojban
Wolof
Banjar
Bashkir
North Frisian
Greenlandic
Tok Pisin
Udmurt
Kabyle
Tahitian
Sranan
Zealandic
Hill Mari
Komi-Permyak
Lower Sorbian
Abkhazian
Gagauz
Igbo
Oriya
Lao
Kongo
Avar
Moksha
Mirandese
Romani
Old Church Slavonic
Karakalpak
Samoan
Moldovan
Tetum
Gothic
Kashmiri
Bambara
Inupiak
Sindhi
Bislama
Lak
Nauruan
Norfolk
Inuktitut
Pontic
Assamese
Cherokee
Min Dong
Swati
Palatinate German
Hausa
Ewe
Tigrinya
Oromo
Zulu
Zhuang
Venda
Tsonga
Kirundi
Dzongkha
Sango
Cree
Chamorro
Luganda
Buginese
Buryat (Russia)
Fijian
Chichewa
Akan
Sesotho
Xhosa
Fula
Tswana
Kikuyu
Tumbuka
Shona
Twi
Cheyenne
Ndonga
Sichuan Yi
Choctaw
Marshallese
Afar
Kuanyama
Hiri Motu
Muscogee
Kanuri
Herero
请注意:每第二个被选中的节点都是只有空格的文本节点。如果你不想选中这些,可以使用:
(//x:table)[1]/x:tr[not(x:th)]/x:td[2]//text()[normalize-space()]