从cs中提取数据的问题

2021-11-29 21:56:39 发布

您现在位置:Python中文网/ 问答频道 /正文

class QuotesSpider(scrapy.Spider):
    name = "googlemailverif"

    with open('input.csv', "r") as csvfile:
        datareader = csv.reader(csvfile)

        start_urls=['https://www.google.fr/search?q=email'+str(row[2]) for row in datareader]



    # starting parsing
    def parse(self, response):
        yield {
                'url': response.url,
                'nom': "nom",
                'emails': re.findall(r"[a-zA-Z0-9_\.+-]+@[a-zA-Z0-9_\.+-]+\.[a-zA-Z]{2,6}",''.join(response.xpath("//body//text()").extract()).strip()),
                'SIRET':"SIRET",
                    }

这是一个尝试从csv文件(在第3列中提取一个公司名称)检查google上的电子邮件的代码。 第一列包含我想在csv中提取为“SIRET”的信息。 我该怎么做?你知道吗

如果我在读取csv时在start\u url中提取它,我的url将是坏的。如果我用它来解析,我不会:让好的数据相关的数据被解析,我可能会有一个错误,因为访问一个文件2次。你知道吗

如何将第一次读取的信息传递到解析函数中的SIRET?你知道吗

我为此挣扎了好几个小时:(

最好的

2条回答
网友
1楼 ·
"SIRET","NIC","L1_NORMALISEE","L2_NORMALISEE","L3_NORMALISEE","L4_NORMALISEE","L5_NORMALISEE","L6_NORMALISEE","L7_NORMALISEE","L1_DECLAREE","L2_DECLAREE","L3_DECLAREE","L4_DECLAREE","L5_DECLAREE","L6_DECLAREE","L7_DECLAREE","NUMVOIE","INDREP","TYPVOIE","LIBVOIE","CODPOS","CEDEX","RPET","LIBREG","DEPET","ARRONET","CTONET","COMET","LIBCOM","DU","TU","UU","EPCI","TCD","ZEMET","SIEGE","ENSEIGNE","IND_PUBLIPO","DIFFCOM","AMINTRET","NATETAB","LIBNATETAB","APET700","LIBAPET","DAPET","TEFET","LIBTEFET","EFETCENT","DEFET","ORIGINE","DCRET","DATE_DEB_ETAT_ADM_ET","ACTIVNAT","LIEUACT","ACTISURF","SAISONAT","MODET","PRODET","PRODPART","AUXILT","NOMEN_LONG","SIGLE","NOM","PRENOM","CIVILITE","RNA","NICSIEGE","RPEN","DEPCOMEN","ADR_MAIL","NJ","LIBNJ","APEN700","LIBAPEN","DAPEN","APRM","ESSEN","DATEESS","TEFEN","LIBTEFEN","EFENCENT","DEFEN","CATEGORIE","DCREN","AMINTREN","MONOACT","MODEN","PRODEN","ESAANN","TCA","ESAAPEN","ESASEC1N","ESASEC2N","ESASEC3N","ESASEC4N","VMAJ","VMAJ1","VMAJ2","VMAJ3","DATEMAJ"
"005720164","00028","SA SAINTE ISABELLE","","","236 ROUTE D AMIENS","","80100 ABBEVILLE","FRANCE","SA SAINTE-ISABELLE","","","236 RTE D AMIENS","","80100 ABBEVILLE","","236","","RTE","D AMIENS","80100","","32","Nord-Pas-de-Calais-Picardie","80","1","98","001","ABBEVILLE","80","4","01","248000556","41","2209","1","","1","O","201209","","","8610Z","Activités hospitalières","2008","22","100 à 199 salariés","100","2015","1","19830928","19830928","NR","99","","P","S","O","","0","SA SAINTE-ISABELLE","","","","","","00028","32","80001","","5599","SA à conseil d'administration (s.a.i.)","8610Z","Activités hospitalières","2008","","","","22","100 à 199 salariés","100","2015","ETI","19570101","201209","1","S","O","","","","","","","","","","","","2014-07-30T00:00:00"
"005720784","00031","ETABLISSEMENTS DECAYEUX","","","ZONE INDUSTRIELLE","","80210 FEUQUIERES EN VIMEU","FRANCE","ETABLISSEMENTS DECAYEUX","","","ZONE INDUSTRIELLE","","80210 FEUQUIERES EN VIMEU","","","","","ZONE INDUSTRIELLE","80210","","32","Nord-Pas-de-Calais-Picardie","80","1","17","308","FEUQUIERES EN VIMEU","80","1","18","248000630","15","0055","0","","1","O","201209","","","2572Z","Fabrication de serrures et de ferrures","2008","22","100 à 199 salariés","100","2015","4","19930401","19930401","NR","99","","P","S","O","","0","ETABLISSEMENTS DECAYEUX","","","","","","00015","32","80308","","5710","SAS/// société par actions simplifiée","2599A","Fabrication d'articles métalliques ménagers","2008","","N","20160915","32","250 à 499 salariés","200","2015","ETI","19570101","201209","3","S","O","2012","6","2599A","2599A","2599B","2572Z","4649Z","","","","","2001-12-13T00:00:00"

这是csv的摘录

每次我都有一个“SIRET”作为sirets值,但另一个var每次都递增和更改

非常感谢++

网友
2楼 ·

我们可以用^{}来做这个。你知道吗

sirets, start_urls = zip(*[(row[0], 'https://www.google.fr/search?q=email'+str(row[2])) for row in datareader])

现在您有了一个包含SIRET值的列表和另一个包含url的列表

相关问题