Python cdxj-indexer包_程序模块 - PyPI

用于warc和arc文件的cdxj索引器

cdxj-indexer的Python项目详细描述

cdxj索引器

用于从warc和arc文件生成cdxj（和cdx）索引的命令行工具。索引器是为快速灵活的索引而重新设计的一种新工具。（基于pywb的索引功能）

使用pip install cdxj-indexer安装或使用python setup.py install本地安装

索引器支持经典的索引格式，以及更灵活的索引器。使用cdxj，索引器支持自定义字段和对warc文件的request记录访问。有关最新功能，请参见下面的示例和命令行-h选项。（这是一项正在进行的工作）。

用法示例

生成cdxj索引：

> cdxj-indexer /path/to/archive-file.warc.gz
com,example)/ 20170730223850 {"url": "http://example.com/", "mime": "text/html", "status": "200", "digest": "G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK", "length": "1219", "offset": "771", "filename": "example-20170730223917.warc.gz"}

CDX索引（11字段）：

> cdxj-indexer -11 /path/to/archive-file.warc.gz
CDX N b a m s k r M S V g
com,example)/ 20170730223850 http://example.com/ text/html 200 G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK - - 1219 771 example-20170730223917.warc.gz

更高级的用例：添加额外的http头作为字段。http:前缀指定当前记录头，而req.http:指定相应的请求记录头。下面将日期、referer头和请求方法添加到索引中：

> cdxj-indexer -f req.http:method,http:date,req.http:referer /path/to/archive-file.warc.gz
com,example)/ 20170801032435 {"url": "http://example.com/", "mime": "text/html", "status": "200", "digest": "A6DESOVDZ3WLYF57CS5E4RIC4ARPWRK7", "length": "1207", "offset": "834", "filename": "temp-20170801032445.warc.gz", "req.http:method": "GET", "http:date": "Tue, 01 Aug 2017 03:24:35 GMT", "referrer": "https://webrecorder.io/temp-NU34HBNO/temp/recording-session/record/http://example.com/"}
org,iana)/domains/example 20170801032437 {"url": "http://www.iana.org/domains/example", "mime": "text/html", "status": "302", "digest": "RP3Y66FDBYBZKSFYQ4VJ4RMDA5BPDJX2", "length": "675", "offset": "2652", "filename": "temp-20170801032445.warc.gz", "req.http:method": "GET", "http:date": "Tue, 01 Aug 2017 02:35:05 GMT", "referrer": "http://example.com/"}

CDXJ索引器扩展了^ {A2}中的^ {TT7}$功能，并且应该是灵活的扩展。

欢迎加入QQ群-->： 979659372

cdxj-indexer 1.0

cdxj-indexer的Python项目详细描述

cdxj索引器

用法示例

推荐PyPI第三方库

tomcom.content.tcteaser

sandcage

python-omegle

elephantor

JW11601160

nose-logpertest

xadix.argparse-tree

cellSN

opsviewclient

django-sentrylogs

WatchMySASS

PriceIndices

persisting-theor

bpy-ensure

null

导航栏

项目链接

标签

维护者

最新PyPI项目

最新Python常见问题

cdxj-indexer 1.0

cdxj-indexer的Python项目详细描述

cdxj索引器

用法示例

推荐PyPI第三方库

tomcom.content.tcteaser

sandcage

python-omegle

elephantor

JW11601160

nose-logpertest

xadix.argparse-tree

cellSN

opsviewclient

django-sentrylogs

WatchMySASS

PriceIndices

persisting-theor

bpy-ensure

null

导 航 栏

项目 链接

标 签

维护者

最新PyPI项目

最新Python常见问题

导航栏

项目链接

标签