用Python处理大文本文件

4 投票

4 回答

3043 浏览

提问于 2025-04-16 22:22

我有一个非常大的文件（3.8G），里面是我学校系统中用户的提取信息。我需要重新处理这个文件，让它只包含用户的ID和电子邮件地址，并用逗号分隔。

我对这方面的经验很少，想把这个当作学习Python的练习。

这个文件里的内容大概是这样的：

dn: uid=123456789012345,ou=Students,o=system.edu,o=system
LoginId: 0099886
mail: fflintstone@system.edu

dn: uid=543210987654321,ou=Students,o=system.edu,o=system
LoginId: 0083156
mail: brubble@system.edu

我想得到一个看起来像这样的文件：

0099886,fflintstone@system.edu
0083156,brubble@system.edu

有没有什么建议或者代码可以参考？

文本处理用户信息数据提取数据清洗文件格式转换大文件处理

4 个回答

假设你的文件格式是正确的：

with open(inputfilename) as inputfile, with open(outputfilename) as outputfile:
    mail = loginid = ''
    for line in inputfile:
        line = inputfile.split(':')
        if line[0] not in ('LoginId', 'mail'):
            continue
        if line[0] == 'LoginId':
            loginid = line[1].strip()
        if line[0] == 'mail':
            mail = line[1].strip()
        if mail and loginid:
            output.write(loginid + ',' + mail + '\n')
            mail = loginid = ''

这基本上和其他方法是一样的。

回答于 2025-04-16 由 Python大师

分享举报

假设每个条目的结构总是一样的，你可以这样做：

import csv

# Open the file
f = open("/path/to/large.file", "r")
# Create an output file
output_file = open("/desired/path/to/final/file", "w")

# Use the CSV module to make use of existing functionality.
final_file = csv.writer(output_file)

# Write the header row - can be skipped if headers not needed.
final_file.writerow(["LoginID","EmailAddress"])

# Set up our temporary cache for a user
current_user = []

# Iterate over the large file
# Note that we are avoiding loading the entire file into memory
for line in f:
    if line.startswith("LoginID"):
        current_user.append(line[9:].strip())
    # If more information is desired, simply add it to the conditions here
    # (additional elif's should do)
    # and add it to the current user.

    elif line.startswith("mail"):
        current_user.append(line[6:].strip())
        # Once you know you have reached the end of a user entry
        # write the row to the final file
        # and clear your temporary list.
        final_file.writerow(current_user)
        current_user = []

    # Skip lines that aren't interesting.
    else:
        continue

回答于 2025-04-16 由 Python大师

分享举报

这看起来像是一个 LDIF 文件。python-ldap 库有一个纯 Python 的 LDIF 处理库，如果你的文件里有一些麻烦的内容，比如 Base64 编码的值、条目折叠等，这个库会很有帮助。

你可以这样使用它：

import csv
import ldif

class ParseRecords(ldif.LDIFParser):
   def __init__(self, csv_writer):
       self.csv_writer = csv_writer
   def handle(self, dn, entry):
       self.csv_writer.writerow([entry['LoginId'], entry['mail']])

with open('/path/to/large_file') as input, with open('output_file', 'wb') as output:
    csv_writer = csv.writer(output)
    csv_writer.writerow(['LoginId', 'Mail'])
    ParseRecords(input, csv_writer).parse()

编辑

如果你想从一个活跃的 LDAP 目录中提取数据，使用 python-ldap 库，你可以做类似下面的事情：

import csv
import ldap

con = ldap.initialize('ldap://server.fqdn.system.edu')
# if you're LDAP directory requires authentication
# con.bind_s(username, password)

try:
    with open('output_file', 'wb') as output:
        csv_writer = csv.writer(output)
        csv_writer.writerow(['LoginId', 'Mail'])

        for dn, attrs in con.search_s('ou=Students,o=system.edu,o=system', ldap.SCOPE_SUBTREE, attrlist = ['LoginId','mail']:
            csv_writer.writerow([attrs['LoginId'], attrs['mail']])
finally:
    # even if you don't have credentials, it's usually good to unbind
    con.unbind_s()

建议你仔细阅读一下 ldap 模块的文档，特别是里面的示例。

注意，在上面的示例中，我完全没有提供过滤条件，而在实际应用中你可能需要这样做。LDAP 中的过滤条件类似于 SQL 语句中的 WHERE 子句；它限制了返回的对象。微软有一份关于 LDAP 过滤器的不错指南。LDAP 过滤器的权威参考是 RFC 4515。

同样，如果在应用了合适的过滤条件后，仍然可能有几千个条目，你可能需要了解一下 LDAP 分页控制，不过使用这个会让示例变得更复杂。希望这些信息能帮助你入门，如果有任何问题，随时可以问或者开个新问题。

祝你好运。

回答于 2025-04-16 由 Python大师

分享举报

用Python处理大文本文件

4 个回答

撰写回答