用于处理mediawiki xml内容转储的python包

mediawiki-dump的Python项目详细描述


mediawiki转储

Build Status

pip install mediawiki_dump

Python3 package用于处理MediaWiki XML content dumps

支持Wikipedia(BZ2压缩)和Wikia(7ZIP)内容转储。

依赖关系

为了读取7zip存档(由wikia的xml转储使用),您需要安装^{}

sudo apt install libarchive-dev

API

标记器

允许您清理WikiText:

frommediawiki_dump.tokenizerimportcleanclean('[[Foo|bar]] is a link')'bar is a link'

然后标记文本:

frommediawiki_dump.tokenizerimporttokenizetokenize('11. juni 2007 varð kunngjørt, at Svínoyar kommuna verður løgd saman við Klaksvíkar kommunu eftir komandi bygdaráðsval.')['juni','varð','kunngjørt','at','Svínoyar','kommuna','verður','løgd','saman','við','Klaksvíkar','kommunu','eftir','komandi','bygdaráðsval']

转储读卡器

获取和分析转储(使用本地文件缓存):

frommediawiki_dump.dumpsimportWikipediaDumpfrommediawiki_dump.readerimportDumpReaderdump=WikipediaDump('fo')pages=DumpReader().read(dump)[page.titleforpageinpages][:10]['Main Page','Brúkari:Jon Harald Søby','Forsíða','Ormurin Langi','Regin smiður','Fyrimynd:InterLingvLigoj','Heimsyvirlýsingin um mannarættindi','Bólkur:Kvæði','Bólkur:Yrking','Kjak:Forsíða']

read方法为每个修订生成DumpEntry对象。

通过使用DumpReaderArticles类,您只能阅读文章页面:

importlogging;logging.basicConfig(level=logging.INFO)frommediawiki_dump.dumpsimportWikipediaDumpfrommediawiki_dump.readerimportDumpReaderArticlesdump=WikipediaDump('fo')reader=DumpReaderArticles()pages=reader.read(dump)print([page.titleforpageinpages][:25])print(reader.get_dump_language())# fo

会给你:

INFO:DumpReaderArticles:Parsing XML dump...
INFO:WikipediaDump:Checking /tmp/wikicorpus_62da4928a0a307185acaaa94f537d090.bz2 cache file...
INFO:WikipediaDump:Fetching fo dump from <https://dumps.wikimedia.org/fowiki/latest/fowiki-latest-pages-meta-current.xml.bz2>...
INFO:WikipediaDump:HTTP 200 (14105 kB will be fetched)
INFO:WikipediaDump:Cache set
...
['WIKIng', 'Føroyar', 'Borðoy', 'Eysturoy', 'Fugloy', 'Forsíða', 'Løgmenn í Føroyum', 'GNU Free Documentation License', 'GFDL', 'Opið innihald', 'Wikipedia', 'Alfrøði', '2004', '20. juni', 'WikiWiki', 'Wiki', 'Danmark', '21. juni', '22. juni', '23. juni', 'Lívfrøði', '24. juni', '25. juni', '26. juni', '27. juni']

阅读wikia的转储文件

importlogging;logging.basicConfig(level=logging.INFO)frommediawiki_dump.dumpsimportWikiaDumpfrommediawiki_dump.readerimportDumpReaderArticlesdump=WikiaDump('plnordycka')pages=DumpReaderArticles().read(dump)print([page.titleforpageinpages][:25])

会给你:

INFO:DumpReaderArticles:Parsing XML dump...
INFO:WikiaDump:Checking /tmp/wikicorpus_f7dd3b75c5965ee10ae5fe4643fb806b.7z cache file...
INFO:WikiaDump:Fetching plnordycka dump from <https://s3.amazonaws.com/wikia_xml_dumps/p/pl/plnordycka_pages_current.xml.7z>...
INFO:WikiaDump:HTTP 200 (129 kB will be fetched)
INFO:WikiaDump:Cache set
INFO:WikiaDump:Reading wikicorpus_f7dd3b75c5965ee10ae5fe4643fb806b file from dump
...
INFO:DumpReaderArticles:Parsing completed, entries found: 615
['Nordycka Wiki', 'Strona główna', '1968', '1948', 'Ormurin Langi', 'Mykines', 'Trollsjön', 'Wyspy Owcze', 'Nólsoy', 'Sandoy', 'Vágar', 'Mørk', 'Eysturoy', 'Rakfisk', 'Hákarl', '1298', 'Sztokfisz', '1978', '1920', 'Najbardziej na północ', 'Svalbard', 'Hamferð', 'Rok w Skandynawii', 'Islandia', 'Rissajaure']

获取完整历史记录

full_history传递给BaseDump构造函数以获取具有完整历史记录的XML内容转储:

importlogging;logging.basicConfig(level=logging.INFO)frommediawiki_dump.dumpsimportWikiaDumpfrommediawiki_dump.readerimportDumpReaderArticlesdump=WikiaDump('macbre',full_history=True)# fetch full history, including old revisionspages=DumpReaderArticles().read(dump)print('\n'.join([repr(page)forpageinpages]))

会给你:

INFO:DumpReaderArticles:Parsing completed, entries found: 384
<DumpEntry "Macbre Wiki" by Default at 2016-10-12T19:51:06+00:00>
<DumpEntry "Macbre Wiki" by Wikia at 2016-10-12T19:51:05+00:00>
<DumpEntry "Macbre Wiki" by Macbre at 2016-11-04T10:33:20+00:00>
<DumpEntry "Macbre Wiki" by FandomBot at 2016-11-04T10:37:17+00:00>
<DumpEntry "Macbre Wiki" by FandomBot at 2017-01-25T14:47:37+00:00>
<DumpEntry "Macbre Wiki" by Ryba777 at 2017-04-10T11:20:25+00:00>
<DumpEntry "Macbre Wiki" by Ryba777 at 2017-04-10T11:21:20+00:00>
<DumpEntry "Macbre Wiki" by Macbre at 2018-03-07T12:51:12+00:00>
<DumpEntry "Main Page" by Wikia at 2016-10-12T19:51:05+00:00>
<DumpEntry "FooBar" by Anonymous at 2016-11-08T10:15:33+00:00>
<DumpEntry "FooBar" by Anonymous at 2016-11-08T10:15:49+00:00>
...
<DumpEntry "YouTube tag" by FANDOMbot at 2018-06-05T11:45:44+00:00>
<DumpEntry "Maps" by Macbre at 2018-06-06T08:51:24+00:00>
<DumpEntry "Maps" by Macbre at 2018-06-07T08:17:13+00:00>
<DumpEntry "Maps" by Macbre at 2018-06-07T08:17:36+00:00>
<DumpEntry "Scary transclusion" by Macbre at 2018-07-24T14:52:20+00:00>
<DumpEntry "Lua" by Macbre at 2018-09-11T14:04:15+00:00>
<DumpEntry "Lua" by Macbre at 2018-09-11T14:14:24+00:00>
<DumpEntry "Lua" by Macbre at 2018-09-11T14:14:37+00:00>

阅读选定文章的转储文件

你可以使用^{} Python library 并从任何mediawiki支持的站点获取所选文章的“实时”转储。

importmwclientsite=mwclient.Site('vim.fandom.com',path='/')frommediawiki_dump.dumpsimportMediaWikiClientDumpfrommediawiki_dump.readerimportDumpReaderArticlesdump=MediaWikiClientDump(site,['Vim documentation','Tutorial'])pages=DumpReaderArticles().read(dump)print('\n'.join([repr(page)forpageinpages]))

会给你:

<DumpEntry "Vim documentation" by Anonymous at 2019-07-05T09:39:47+00:00>
<DumpEntry "Tutorial" by Anonymous at 2019-07-05T09:41:19+00:00>

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
Java中的开源字典组件   即使在成功执行删除查询之后,java更新的列表也不会显示在jsp页面中   java Apache:无法启动上下文路径/网站上的失败应用程序   java验证CSV中的特定列   对于具有专用内存的java应用程序,最小堆大小低于最大堆大小有意义吗?   java将数组中的值转换为多维数组   java在给定程序中,垃圾收集器在对象被取消引用之前正在运行。。。使用jre 7(32位)   java在运行时动态刷新文件夹   eclipse如何解决“java.net.BindException:地址已在使用:JVM_Bind”错误?   Java数组与数组   每次任务完成任务时,Java多线程都会安排任务   java部分编译时使用maven编织第三方jar   java Dokku单一回购中的多个应用程序   用apachevelocity生成javac/C++语言文件   java如何使用spring应用程序上下文中的属性文件实例化列表   java访问智能卡文件结构   具有GlobalMethodSecurity的java自定义UserDetailService循环引用   java如何集成Spring和JSF