删除带制表符的行

2024-04-29 14:42:12 发布

您现在位置:Python中文网/ 问答频道 /正文

如何删除带制表符的行?你知道吗

我有一个这样的文件:

0   absinth
Bohemian-style absinth
Bohemian-style or Czech-style absinth (also called anise-free absinthe, or just “absinth” without the “e”) is an ersatz version of the traditional spirit absinthe, though is more accurately described as a kind of wormwood bitters.
It is produced mainly in the Czech Republic, from which it gets its designations as “Bohemian” or “Czech,” although not all absinthe from the Czech Republic is Bohemian-style.

1   acidophilus milk
Sweet acidophilus milk is consumed by individuals who suffer from lactose intolerance or maldigestion, which occurs when enzymes (lactase) cannot break down lactose (milk sugar) in the intestine.
To aid digestion in those with lactose intolerance, milk with added bacterial cultures such as "Lactobacillus acidophilus" ("acidophilus milk") and bifidobacteria ("a/B milk") is available in some areas.
High Activity of Lactobacillus Acidophilus Milk

2   adobo
Adobo
Adobo (Spanish: marinade, sauce, or seasoning) is the immersion of raw food in a stock (or sauce) composed variously of paprika, oregano, salt, garlic, and vinegar to preserve and enhance its flavor.
In the Philippines, the name "adobo" was given by the Spanish colonists to an indigenous cooking method that also uses vinegar, which although superficially similar had developed independent of Spanish influence.

所需的输出具有移除了制表符的行,即:

Bohemian-style absinth
Bohemian-style or Czech-style absinth (also called anise-free absinthe, or just “absinth” without the “e”) is an ersatz version of the traditional spirit absinthe, though is more accurately described as a kind of wormwood bitters.
It is produced mainly in the Czech Republic, from which it gets its designations as “Bohemian” or “Czech,” although not all absinthe from the Czech Republic is Bohemian-style.

Sweet acidophilus milk is consumed by individuals who suffer from lactose intolerance or maldigestion, which occurs when enzymes (lactase) cannot break down lactose (milk sugar) in the intestine.
To aid digestion in those with lactose intolerance, milk with added bacterial cultures such as "Lactobacillus acidophilus" ("acidophilus milk") and bifidobacteria ("a/B milk") is available in some areas.
High Activity of Lactobacillus Acidophilus Milk

Adobo
Adobo (Spanish: marinade, sauce, or seasoning) is the immersion of raw food in a stock (or sauce) composed variously of paprika, oregano, salt, garlic, and vinegar to preserve and enhance its flavor.
In the Philippines, the name "adobo" was given by the Spanish colonists to an indigenous cooking method that also uses vinegar, which although superficially similar had developed independent of Spanish influence.

我可以在python中执行以下操作以获得相同的结果:

with open('file.txt', 'r') as fin, open('file2.txt', 'w') as fout:
  for line in fin:
    if '\t' in line:
      continue
    else:
      fout.write(line)

但我有数百万条线路,效率不高。所以我试着用cut删除第二行,然后用单个字符删除行:

$ cut -f1 WIKI_WN_food | awk 'length>1' | less

什么是获得所需输出的更适合的方法?

有没有比我上面展示的cut+awk管道更有效的方法?


Tags: oroftheinfromwhichisstyle
3条回答
grep -v '\t' file

。。。。。。。。。。。。你知道吗

您的代码正常,您可以尝试优化只在字符串开头查找:

if `\t' not in l[:5]: fout.write(l)

如果子字符串的长度取决于最大记录数,那么它可能会对不匹配的长字符串产生影响,谁知道呢。。。你知道吗

此外,您可能希望测试mawkgrep等,如

# Edit : the following won't work. it strips also blank lines
# mawk -F"\t" "NF==1"  original > stripped
grep -vF "\t"        original > stripped
sed -e "/\t/d"       original > stripped

看看它是否比python解决方案快。你知道吗

测试

在我的系统里,有一个重复复制你的文件。它的尺寸是1418973184 我有大约的时间如下:grep1.6s、sed6.4s、python4.6s。你知道吗

附录

我用mawk测试了Jidder awk解决方案(在评论中给出),我的近似时间是3.2s。。。获胜者是grep -vF

测试成绩单

执行之间的运行时间相差0.1秒,这里我只报告每个命令的一个运行时间。。。为了接近结果,人们不能做出明确的决定。你知道吗

另一方面,不同的工具给出的结果与实验误差相差甚远,我认为我们可以得出一些结论。。。你知道吗

% ls -l original 
-rw-r--r-- 1 boffi boffi 1418973184 Dec  8 21:33 original
% cat doit.py
from sys import stdout
with open('original', 'r') as fin:
  for line in fin:
    if '\t' in line: continue
    else: stdout.write(line)
% time wc -l original 
15731133 original

real    0m0.407s
user    0m0.184s
sys     0m0.220s
% time python doit.py | wc -l
12584034

real    0m5.334s
user    0m4.880s
sys     0m1.428s
% time grep -vF "       "  original | wc -l
12584035

real    0m1.527s
user    0m1.112s
sys     0m1.400s
% time grep -v "        "  original | wc -l
12584035

real    0m1.556s
user    0m1.120s
sys     0m1.436s
% time sed -e "/\t/d"  original | wc -l
12584034

real    0m6.481s
user    0m6.104s
sys     0m1.404s
% time mawk '!/\t/'  original | wc -l
12584035

real    0m3.059s
user    0m2.608s
sys     0m1.488s
% time gawk '!/\t/'  original | wc -l
12584035

real    0m9.148s
user    0m8.680s
sys     0m1.468s
% 

我的示例文件有一个截断的最后一行,因此python和sed之间的行数相差一倍,而其他所有工具都是如此。你知道吗

你可以用sed做这个

sed '/\t/d' 'my_file'

查找“\t”并删除包含它的行

相关问题 更多 >