<p><a href="https://pandas.pydata.org/" rel="nofollow noreferrer">^{<cd1>}</a>绝对是处理详细表格数据的goto库。对于那些寻求非<code>pandas</code>选项的人,您可以构建自己的<em>映射</em>和<em>还原</em>函数。我使用这些术语的含义如下:</p>
<ul>
<li><em>映射</em>:重新组织按所需查询分组的数据</li>
<li><em>归约函数:一种聚合函数,用于将多个值归并成一个值</li>
</ul>
<p><code>pandas</code>类似的<em>groupby</em>/<em>聚合</em>概念。你知道吗</p>
<p><strong>给定的</strong></p>
<p>用单个分隔符替换多个空格的已清理数据,例如<code>","</code>。你知道吗</p>
<pre><code>%%file "test.txt"
status,gender,age_range,occ,rating
ma,M,young,student,PG
ma,F,adult,teacher,R
sin,M,young,student,PG
sin,M,adult,teacher,R
ma,M,young,student,PG
sin,F,adult,teacher,R
</code></pre>
<p><strong>代码</strong></p>
<pre><code>import csv
import collections as ct
</code></pre>
<p><em>步骤1:读取数据</em></p>
<p/>
<pre><code>def read_file(fname):
with open(fname, "r") as f:
reader = csv.DictReader(f)
for line in reader:
yield line
iterable = [line for line in read_file("test.txt")]
iterable
</code></pre>
<p>输出</p>
<pre><code>[OrderedDict([('status', 'ma'),
('gender', 'M'),
('age_range', 'young'),
('occ', 'student'),
('rating', 'PG')]),
OrderedDict([('status', 'ma'),
('gender', 'F'),
('age_range', 'adult'),
...]
...
]
</code></pre>
<p/>
<p><em>第2步:重新映射数据</em></p>
<p/>
<pre><code>def mapping(data, column):
"""Return a dict of regrouped data."""
dd = ct.defaultdict(list)
for d in data:
key = d[column]
value = {k: v for k, v in d.items() if k != column}
dd[key].append(value)
return dict(dd)
mapping(iterable, "gender")
</code></pre>
<p>输出</p>
<pre><code>{'M': [
{'age_range': 'young', 'occ': 'student', 'rating': 'PG', ...},
...]
'F': [
{'status': 'ma', 'age_range': 'adult', ...},
...]
}
</code></pre>
<p/>
<p><em>第3步:减少数据</em></p>
<p/>
<pre><code>def reduction(data):
"""Return a reduced mapping of Counters."""
final = {}
for key, val in data.items():
agg = ct.defaultdict(ct.Counter)
for d in val:
for k, v in d.items():
agg[k][v] += 1
final[key] = dict(agg)
return final
reduction(mapping(iterable, "gender"))
</code></pre>
<p>输出</p>
<pre><code>{'F': {
'age_range': Counter({'adult': 2}),
'occ': Counter({'teacher': 2}),
'rating': Counter({'R': 2}),
'status': Counter({'ma': 1, 'sin': 1})},
'M': {
'age_range': Counter({'adult': 1, 'young': 3}),
'occ': Counter({'student': 3, 'teacher': 1}),
'rating': Counter({'PG': 3, 'R': 1}),
'status': Counter({'ma': 2, 'sin': 2})}
}
</code></pre>
<p/>
<p>演示</p>
<p>有了这些工具,您可以构建数据管道并查询数据,将一个函数的结果输入到另一个函数中:</p>
<pre><code># Find the top age range amoung males
pipeline = reduction(mapping(iterable, "gender"))
pipeline["M"]["age_range"].most_common(1)
# [('young', 3)]
# Find the top ratings among teachers
pipeline = reduction(mapping(iterable, "occ"))
pipeline["teacher"]["rating"].most_common()
# [('R', 3)]
# Find the number of married people
pipeline = reduction(mapping(iterable, "gender"))
sum(v["status"]["ma"] for k, v in pipeline.items())
# 3
</code></pre>
<p>总的来说,您可以根据如何定义缩减函数来定制输出。你知道吗</p>
<p>注意,这个通用过程的代码比<a href="https://stackoverflow.com/questions/48680608/function-to-return-the-highest-count-value-using-a-rule">former example</a>更冗长,尽管它对许多数据列有强大的应用。<code>pandas</code>简洁地封装了这些概念。虽然学习曲线最初可能更陡峭,但它可以大大加快数据分析。你知道吗</p>
<hr/>
<p><strong>细节</strong></p>
<ol>
<li><em>读取数据-我们使用<a href="https://docs.python.org/3/library/csv.html#csv.DictReader" rel="nofollow noreferrer">^{<cd6>}</a>解析清理文件</em>的每一行,它将头名称作为字典的键来维护。这种结构便于按名称访问列。你知道吗</li>
<li><em>重新映射数据</em>-我们将数据分组为字典。
<ul>
<li>键是选定/查询列中的项,例如<code>"M"</code>、<code>"F"</code>。你知道吗</li>
<li>每个值都是一个字典列表。每个字典表示一行所有剩余的列数据(不包括键)。你知道吗</li>
</ul></li>
<li><em>Reduce data</em>—我们通过将所有列出的字典的相关条目制成表格,来聚合重新映射的数据的值。将<a href="https://docs.python.org/3/library/collections.html#collections.defaultdict" rel="nofollow noreferrer">^{<cd9>}</a>和<a href="https://docs.python.org/3/library/collections.html#collections.Counter" rel="nofollow noreferrer">^{<cd10>}</a>组合在一起可以构建一个优秀的简化数据结构,其中<code>defaultdict</code>的新条目初始化<code>Counter</code>,而重复的条目只是记录观察结果。你知道吗</li>
</ol>
<p><strong>应用程序</strong></p>
<p>管道是可选的。在这里,我们将构建一个处理串行请求的函数:</p>
<pre><code>def serial_reduction(iterable, val_queries):
"""Return a `Counter` that is reduced after serial queries."""
q1, *qs = val_queries
val_to_key = {v:k for k, v in iterable[0].items()}
values_list = mapping(iterable, val_to_key[q1])[q1]
counter = ct.Counter()
# Process queries for dicts in each row and build a counter
for q in qs:
try:
for row in values_list[:]:
if val_to_key[q] not in row:
continue
else:
reduced_vals = {v for v in row.values() if v not in qs}
for val in reduced_vals:
counter[val] += 1
except KeyError:
raise ValueError("'{}' not found. Try a new query.".format(q))
return counter
c = serial_reduction(iterable, "ma M young".split())
c.most_common()
# [('student', 2), ('PG', 2)]
serial_reduction(iterable, "ma M young teacher".split())
# ValueError: 'teacher' not found. Try a new query.
</code></pre>