处理后文件不可编辑和不可读（为什么？）

Question

:) 我知道这看起来像是个很长的问题，但相信我，其实并不长。我现在无法弄清楚为什么处理完这个文本后，它就不能被读取和编辑了。我尝试用Python中的ord()函数来检查文本中是否包含除了ASCII字符以外的Unicode字符（非ASCII字符），结果发现有不少这样的字符。

输入文件：你可以把它复制粘贴到一个文件中，文件名为"acle5v1.txt"。

下面这段代码的目的是检查大写字母，并将其转换为小写，同时去掉所有标点符号，以便后续进行单词对齐处理。

#include<iostrea>
#include<fstream>
#include<ctype.h>
#include<cstring>

using namespace std;

ifstream fin2("acle5v1.txt");
ofstream fin3("acle5v1_op.txt");
ofstream fin4("chkcharadded.txt");
ofstream fin5("chkcharntadded.txt");
ofstream fin6("chkprintchar.txt");
ofstream fin7("chknonasci.txt");
ofstream fin8("nonprinchar.txt");

int main()
{
char ch,ch1;
fin2.seekg(0);
fin3.seekp(0);
int flag = 0;

            while(!fin2.eof())
    {
        ch1=ch;
        fin2.get(ch);

        if (isprint(ch))// if the character is printable
            flag = 1;

        if(flag)
        {
            fin6<<"Printable character:\t"<<ch<<"\t"<<(int)ch<<endl;
            flag = 0;
        }
        else
        {
            fin8<<"Non printable character caught:\t"<<ch<<"\t"<<int(ch)<<endl;
        }

        if( isalnum(ch) || ch == '@' || ch == ' ' )// checks for alpha numeric characters
        {
            fin4<<"char added: "<<ch<<"\tits ascii value: "<<int(ch)<<endl;
            if(isupper(ch))
            {
                //tolower(ch);
                fin3<<(char)tolower(ch);
            }
            else
            {
                fin3<<ch;
            }
        }
        else if( ( ch=='\t' || ch=='.' || ch==',' || ch=='#' || ch=='?' || ch=='!' || ch=='"' || ch != ';' || ch != ':') && ch1 != ' ' )
        {
            fin3<<' ';
        }
        else if( (ch=='\t' || ch=='.' || ch==',' || ch=='#' || ch=='?' || ch=='!' || ch=='"' || ch != ';' || ch != ':') && ch1 == ' ' )
        {
            //fin3<<" ';
        }
        else if( !(int(ch)>=0 && int(ch)<=127) )
        {
            fin5<<"Char of ascii within range not added: "<<ch<<"\tits ascii value: "<<int(ch)<<endl;
        }
        else
        {
            fin7<<"Non ascii character caught(could be a -ve value also)\t"<<ch<<int(ch)<<endl; 
        }   
    }
    return 0;
}

我有一段类似的Python代码，运行后输出的结果同样不可读且不可编辑。

这段Python代码看起来是这样的：

#!/usr/bin/python
# -*- coding: UTF-8 -*-

import sys

input_file=sys.argv[1]
output_file=sys.argv[2]

list1=[]

f=open(input_file)
for line in f:
    line=line.strip()   
    #line=line.rstrip('.')   
    line=line.replace('.','')
    line=line.replace(',','')
    line=line.replace('#','')
    line=line.replace('?','')
    line=line.replace('!','')
    line=line.replace('"','')
    line=line.replace('।','')
    line=line.replace('|','')       
    line = line.lower() 
    list1.append(line)
    f.close()

    f1=open(output_file,'w')

    f1.write(' '.join(list1))

    f1.close()

这个文件在运行时会接收输入和输出，如下：

python punc_remover.py acle5v1.txt acle5v1_op.txt

这个文件的输出结果保存在"acle5v1_op.txt"中。

现在，处理完这个特定的输出文件后，我需要它进行进一步处理。这个文件"acle5v1_op.txt"就是我无法使用的不可读和不可编辑的文件，我需要它来进行自然语言处理中的单词对齐。我尝试用以下程序读取这个输出文件：

#include<iostream>
#include<fstream>

using namespace std;

ifstream fin1("acle5v1_op.txt");
ofstream fout1("chckread_acle5v1_op.txt");
ofstream fout2("chcknotread_acle5v1_op.txt");

int main()
{
    char ch;
    int flag = 0;
    long int r = 0; long int nr = 0;

    while(!(fin1))
    {
        fin1.get(ch);

        if(ch)
        {
            flag = 1;
        }

        if(flag)
        {
            fout1<<ch;
            flag = 0;
            r++;
        }
        else
        {
            fout2<<"Char not been able to be read from source file\n";
            nr++;
        }
    }

    cout<<"Number of characters able to be read: "<<r;
    cout<<endl<<"Number of characters not been able to be read: "<<nr;

    return 0;
}

这个程序会打印出可读的字符，如果不可读则不打印。但我观察到两个文件的输出都是空白，因此我得出结论，这个文件"acle5v1_op.txt"确实是不可读和不可编辑的。你能帮我解决这个问题吗？

关于原始输入文件"acle5v1.txt"的一些统计信息，它大约有3441行，字符数大约有300万。

考虑到文件中的字符数量，你的编辑器可能无法打开这个文件。我在我当前使用的Fedora 10的gedit中能够打开这个文件。这只是想告诉你，至少在我这里，使用特定的编辑器并不是问题……

我可以使用像Python和Perl这样的脚本语言来解决这个问题吗？如果可以，请具体说明，因为我对Perl和Python还是个新手。或者你能告诉我如何用C++来解决这个问题吗？谢谢……:) 我真的很期待能得到一些帮助或指导，告诉我该如何处理这个问题……

文本处理字符串操作脚本语言数据清洗文本编辑自然语言处理文件编码 Unicode字符

处理后文件不可编辑和不可读（为什么？）

1 个回答

撰写回答