直接从文本或python源轻松获取干净的数据
textdata的Python项目详细描述
通常需要在程序源中声明数据。然而,Python需要 程序行缩进,所以。因此,多行字符串通常有额外的 空格和换行符不是你真正想要的。不少开发商“修好” 这是通过使用python listliterals实现的,但这很乏味、冗长,而且经常 不太清晰。
textdata包使您很容易获得干净、空白的内容 在程序中指定的数据,但不需要额外的语法就可以获取数据 把东西弄乱了它允许生成python所需的布局 代码的外观和工作正常,而不反映 结果数据。
文本(字符串和列表)
>>> lines(""" ... There was an old woman who lived in a shoe. ... She had so many children, she didn't know what to do; ... She gave them some broth without any bread; ... Then whipped them all soundly and put them to bed. ... """)['There was an old woman who lived in a shoe.', "She had so many children, she didn't know what to do;", 'She gave them some broth without any bread;', 'Then whipped them all soundly and put them to bed.']
注意,“额外的”换行符和前导空格 处理和丢弃。或者你只想要一个 弦好的:
>>> text(""" ... There was an old woman who lived in a shoe. ... She had so many children, she didn't know what to do; ... She gave them some broth without any bread; ... Then whipped them all soundly and put them to bed. ... """)"There was an old woman who lived in a shoe.\nShe ...put them to bed."
这里text()在开始处对无意义的空白进行相同的剥离 行的末尾,将数据作为一个干净、方便的字符串返回或者如果你 不需要大多数行尾,请在同一个输入上尝试textline以获取 单曲不间断线
单词和短语
其他时候,你需要的数据几乎是,但不完全是,一系列 话。一个名称列表,一个颜色列表-主要是 单字,但有时有一个嵌入的空格。textdata有你 覆盖范围:
>>> words(' Billy Bobby "Mr. Smith" "Mrs. Jones" ')['Billy', 'Bobby', 'Mr. Smith', 'Mrs. Jones']
嵌入的引号(单引号或双引号)可用于构造 包含空格(包括制表符和换行符)的“单词”(或短语)。
words与其他textdata工具一样,允许您 注释单独的行,否则会弄脏字符串文本:
exclude = words("""
__pycache__ *.pyc *.pyo # compilation artifacts
.hg* .git* # repository artifacts
.coverage # code tool artifacts
.DS_Store # platform artifacts
""")
产量:
['__pycache__', '*.pyc', '*.pyo', '.hg*', '.git*',
'.coverage', '.DS_Store']
段落
您可能需要收集“段落”而不是单词——连续的文本行 用空行划定的线。例如,标记和RST文档格式, 使用此约定。
>>> rhyme=""" Hey diddle diddle, The cat and the fiddle, The cow jumped over the moon. The little dog laughed, To see such sport, And the dish ran away with the spoon. """ >>> paras(rhyme)[['Hey diddle diddle,'], ['The cat and the fiddle,', 'The cow jumped over the moon.', 'The little dog laughed,', 'To see such sport,'], ['And the dish ran away with the spoon.']]
或者如果您想要段落,但每个段落都是一个字符串:
>>> paras(rhyme,join="\n")['Hey diddle diddle,', 'The cat and the fiddle,\nThe cow jumped over the moon.\nThe little dog laughed,\nTo see such sport,', 'And the dish ran away with the spoon.']
词典
或者你想要一个dict。attrs函数使 抓取:
.. code-block:: pycon
>>> attrs("a=1 b=2 c='something more'") {'a': 1, 'b': 2, 'c': 'something more'}
如果要直接从javascript、json、html、css或 XML,简单易懂不需要文本编辑
>>> # JavaScript>>> attrs("a: 1, b: 2, c: 'something more'"){'a': 1, 'b': 2, 'c': 'something more'} >>> # JSON>>> attrs('"a": 1, "b": 2, "c": "something more"'){'a': 1, 'b': 2, 'c': 'something more'} >>> # HTML or XML>>> attrs('a="1" b="2" c="something more"'){'a': '1', 'b': '2', 'c': 'something more'} >>> # above returns strings, because values quoted, which denotes strings>>> # 'full' evaluation needed to transform strings into values>>> attrs('a="1" b="2" c="something more"',evaluate='full'){'a': 1, 'b': 2, 'c': 'something more'} >>> # CSS>>> attrs("a: 1; b: 2; c: 'something more'"){'a': 1, 'b': 2, 'c': 'something more'}
表格
或者你有表格数据。
>>> tabledata=""" ... name age strengths ... ---- --- --------------- ... Joe 12 woodworking ... Jill 12 slingshot ... Meg 13 snark, snapchat ... """>>> table(tabledata)[['name', 'age', 'strengths'], ['Joe', 12, 'woodworking'], ['Jill', 12, 'slingshot'], ['Meg', 13, 'snark, snapchat']] >>> records(tabledata)[{'name': 'Joe', 'age': 12, 'strengths': 'woodworking'}, {'name': 'Jill', 'age': 12, 'strengths': 'slingshot'}, {'name': 'Meg', 'age': 13, 'strengths': 'snark, snapchat'}]
即使您的桌子上有很多多余的绒毛,也可以这样做:
>>> fancy=""" ... +------+-----+-----------------+ ... | name | age | strengths | ... +------+-----+-----------------+ ... | Joe | 12 | woodworking | ... | Jill | 12 | slingshot | ... | Meg | 13 | snark, snapchat | ... +------+-----+-----------------+ ... """>>> asserttable(tabledata)==table(fancy)>>> assertrecords(tabledata)==records(fancy)
它可以处理以多种方式格式化的表,包括markdown、rst, ANSI/Unicode行绘图字符、纯文本列和边框….你会的 可能认为表解析是一个不确定的命题,容易失败,但是 textdata有数十个测试,包括相当复杂的案例,显示 这是一个可靠的,高概率的启发式方法。
总而言之
textdata是为了方便地从文本中获取所需的数据 文件和程序源,并在一个功能强大,方便, 经过考验的方法。今天就转一圈吧!