在Python中解析CSS
我正在合并数百个HTML页面,这些页面的头部都有嵌入的样式元素。我使用BeautifulSoup提取样式内容,但现在面临的任务是将字符串解析成一个字典,格式是{选择器字符串:属性字符串}。我查看了tinycss,它可以很容易地获取选择器'.c0',但却无法得到属性字符串'{...}'。
这里有一个示例字符串
'.c0 { padding: 1px 0px 0px; font-size: 11px } .c1 { margin: 0px; font-size: 11px } .c2 { font-size: 11px } .c3 { font-size: 11px; font-style: italic; font-weight: bold } '
有什么建议吗?使用正则表达式也可以。这是CSS的全部内容。每个页面都有类选择器从.c0到.c100,并且每个页面的格式都是一样的。
2 个回答
3
像这样吗?
from collections import defaultdict
properties = defaultdict(str)
for item in example_str.split("}"):
item_split = item.split("{")
properties[item_split[0]] = "{" + item_split[1] + "}"
0
这是我最终的解决方案。我使用了BadKarma的方法,通过分割字符串来处理。
from bs4 import BeautifulSoup
import re
class RichText(BeautifulSoup):
"""
subclass BeautifulSoup
add behavior for generating selectors and declaration_blocks from <style>
"""
def __init__(self, html_page):
super().__init__(html_page)
@property
def rules_as_str(self):
return str(self.style.string)
def rules(self):
split_rules = re.split('(\.c[0-9]*)', self.rules_as_str)
# side effect of split, first element is null
assert(split_rules[0] == '')
# enforce that it MUST be null, then pass over it
for i in range(1, len(split_rules), 2):
yield (split_rules[i].strip(), split_rules[i+1].strip())
if __name__ == '__main__':
with open('rich-text.html', 'r') as f:
html_file = f.read()
rich_text = RichText(html_file)
for selector, declaration_block in rich_text.rules():
print(selector)
print(declaration_block)
>>> with open("test.py") as f:
... code = compile(f.read(), "test.py", 'exec')
... exec(code)
...
.c0
{ padding: 1px 0px 0px; font-size: 11px }
.c1
{ margin: 0px; font-size: 11px }
.c2
{ font-size: 11px }
.c3
{ font-size: 11px; font-style: italic; font-weight: bold }
>>>