支持mediawiki数据流处理的脚本和实用程序的集合。
mwstreaming的Python项目详细描述
流处理mediawiki数据的一组实用程序。
用法
^{tt1}$
^{tt2}$
数据处理实用程序
- ^{tt3}$
- Generates token persistence statistics using revision JSON blobs with diff information.
- ^{tt4}$
- Converts an XML dump to a stream of revision JSON blobs
- ^{tt5}$
- Computes diffs directly from an XML dump
- ^{tt6}$
- Computes and adds a “diff” field to a stream of revision JSON blobs
- ^{tt7}$
- Mends diffs that were computed in chunks and out of order.
- ^{tt8}$
- Aggregates a token persistence statistics to revision statistics
- ^{tt9}$
- Converts a Wikihadoop-processed stream of XML pages to JSON blobs
一般公用设施
- ^{tt10}$
- Converts a stream of JSON blobs to tab-separated values based a set of fieldnames.
- ^{tt11}$
- Normalizes old versions of RevisionDocument json schemas to correspond to the most recent schema version.
- ^{tt12}$
- Validates JSON against a provided schema.
- ^{tt13}$
- Truncates the ‘text’ field of JSON blobs to a limited length in unicode characters. (addresses content dump vandalism issues) and adds a boolean ‘truncated’ field.
安装
^{tt14}$