直接从文本或python源轻松获取干净的数据

textdata的Python项目详细描述


Travis CI build statusPyPI Package latest releaseSupported versionsSupported implementationsWheel packaging supportTest line coverage

通常需要在程序源中声明数据。然而,Python需要 程序行缩进,所以。因此,多行字符串通常有额外的 空格和换行符不是你真正想要的。不少开发商“修好” 这是通过使用python listliterals实现的,但这很乏味、冗长,而且经常 不太清晰。

textdata包使您很容易获得干净、空白的内容 在程序中指定的数据,但不需要额外的语法就可以获取数据 把东西弄乱了它允许生成python所需的布局 代码的外观和工作正常,而不反映 结果数据。

文本(字符串和列表)

>>> lines("""
...     There was an old woman who lived in a shoe.
...     She had so many children, she didn't know what to do;
...     She gave them some broth without any bread;
...     Then whipped them all soundly and put them to bed.
... """)['There was an old woman who lived in a shoe.',
 "She had so many children, she didn't know what to do;",
 'She gave them some broth without any bread;',
 'Then whipped them all soundly and put them to bed.']

注意,“额外的”换行符和前导空格 处理和丢弃。或者你只想要一个 弦好的:

>>> text("""
...     There was an old woman who lived in a shoe.
...     She had so many children, she didn't know what to do;
...     She gave them some broth without any bread;
...     Then whipped them all soundly and put them to bed.
... """)"There was an old woman who lived in a shoe.\nShe ...put them to bed."

这里text()在开始处对无意义的空白进行相同的剥离 行的末尾,将数据作为一个干净、方便的字符串返回或者如果你 不需要大多数行尾,请在同一个输入上尝试textline以获取 单曲不间断线

单词和短语

其他时候,你需要的数据几乎是,但不完全是,一系列 话。一个名称列表,一个颜色列表-主要是 单字,但有时有一个嵌入的空格。textdata有你 覆盖范围:

>>> words(' Billy Bobby "Mr. Smith" "Mrs. Jones"  ')['Billy', 'Bobby', 'Mr. Smith', 'Mrs. Jones']

嵌入的引号(单引号或双引号)可用于构造 包含空格(包括制表符和换行符)的“单词”(或短语)。

words与其他textdata工具一样,允许您 注释单独的行,否则会弄脏字符串文本:

exclude = words("""
    __pycache__ *.pyc *.pyo     # compilation artifacts
    .hg* .git*                  # repository artifacts
    .coverage                   # code tool artifacts
    .DS_Store                   # platform artifacts
""")

产量:

['__pycache__', '*.pyc', '*.pyo', '.hg*', '.git*',
 '.coverage', '.DS_Store']

段落

您可能需要收集“段落”而不是单词——连续的文本行 用空行划定的线。例如,标记和RST文档格式, 使用此约定。

>>> rhyme="""
    Hey diddle diddle,

    The cat and the fiddle,
    The cow jumped over the moon.
    The little dog laughed,
    To see such sport,

    And the dish ran away with the spoon.
"""
>>> paras(rhyme)[['Hey diddle diddle,'],
 ['The cat and the fiddle,',
  'The cow jumped over the moon.',
  'The little dog laughed,',
  'To see such sport,'],
 ['And the dish ran away with the spoon.']]

或者如果您想要段落,但每个段落都是一个字符串:

>>> paras(rhyme,join="\n")['Hey diddle diddle,',
 'The cat and the fiddle,\nThe cow jumped over the moon.\nThe little dog laughed,\nTo see such sport,',
 'And the dish ran away with the spoon.']

词典

或者你想要一个dictattrs函数使 抓取:

.. code-block:: pycon
>>> attrs("a=1 b=2 c='something more'")
{'a': 1, 'b': 2, 'c': 'something more'}

如果要直接从javascript、json、html、css或 XML,简单易懂不需要文本编辑

>>> # JavaScript>>> attrs("a: 1, b: 2, c: 'something more'"){'a': 1, 'b': 2, 'c': 'something more'}

>>> # JSON>>> attrs('"a": 1, "b": 2, "c": "something more"'){'a': 1, 'b': 2, 'c': 'something more'}

>>> # HTML or XML>>> attrs('a="1" b="2" c="something more"'){'a': '1', 'b': '2', 'c': 'something more'}

>>> # above returns strings, because values quoted, which denotes strings>>> # 'full' evaluation needed to transform strings into values>>> attrs('a="1" b="2" c="something more"',evaluate='full'){'a': 1, 'b': 2, 'c': 'something more'}

>>> # CSS>>> attrs("a: 1; b: 2; c: 'something more'"){'a': 1, 'b': 2, 'c': 'something more'}

表格

或者你有表格数据。

>>> tabledata="""
...     name  age  strengths
...     ----  ---  ---------------
...     Joe   12   woodworking
...     Jill  12   slingshot
...     Meg   13   snark, snapchat
... """>>> table(tabledata)[['name', 'age', 'strengths'],
 ['Joe', 12, 'woodworking'],
 ['Jill', 12, 'slingshot'],
 ['Meg', 13, 'snark, snapchat']]

>>> records(tabledata)[{'name': 'Joe', 'age': 12, 'strengths': 'woodworking'},
 {'name': 'Jill', 'age': 12, 'strengths': 'slingshot'},
 {'name': 'Meg', 'age': 13, 'strengths': 'snark, snapchat'}]

即使您的桌子上有很多多余的绒毛,也可以这样做:

>>> fancy="""
... +------+-----+-----------------+
... | name | age | strengths       |
... +------+-----+-----------------+
... | Joe  |  12 | woodworking     |
... | Jill |  12 | slingshot       |
... | Meg  |  13 | snark, snapchat |
... +------+-----+-----------------+
... """>>> asserttable(tabledata)==table(fancy)>>> assertrecords(tabledata)==records(fancy)

它可以处理以多种方式格式化的表,包括markdown、rst, ANSI/Unicode行绘图字符、纯文本列和边框….你会的 可能认为表解析是一个不确定的命题,容易失败,但是 textdata数十个测试,包括相当复杂的案例,显示 这是一个可靠的,高概率的启发式方法。

总而言之

textdata是为了方便地从文本中获取所需的数据 文件和程序源,并在一个功能强大,方便, 经过考验的方法。今天就转一圈吧!

the full documentation at Read the Docs

欢迎加入QQ群-->: 979659372 Python中文网_新手群

推荐PyPI第三方库


热门话题
返回数组无效的java方法   异步Java CompletableFuture获取其请求   java是否可以像RDBMS那样使用视图?   java如何在屏幕上只运行一个片段?   java无法从Vertex jdbc查询中获取结果   java从jtable获取对象的正确方法   java Spring 3数据设备替代方案   Java BigDecimal:四舍五入到客户首选的数字和增量   JAVA主窗口没有出现,我必须左键单击主窗口。java并单击run查看它   Eclipse RCP中的java进程自定义设备事件   JavaEclipse一次又一次地构建代码(没有任何更改)?   java如何实现对象合并