检查这是不是在csv-fi的网址

2024-06-16 13:28:56 发布

您现在位置:Python中文网/ 问答频道 /正文

我想从csv文件中删除不是url的值:我们的df['url']包含类似于https://stackoverflow.com/questions/ask'https://www.linkedin.com/feed/''345'的值,我想删除345。你知道吗

def Find_url(string):
    url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', string)
    return url



if __name__ == "__main__":
    file = pd.read_csv('url_file.csv')
    df =  pd.DataFrame(file)
    for i in range(len(df)):
        url = Find_url(df.loc[i]['url'])
        df.loc[i]['url']=url
df.to_csv('clean_url.csv')

样本输入:

 'https://www.zaubacorp.com/company/HINDUSTAN-CABLES-LTD/L31300WB1952GOI020560'
'http://www.indianrailways.gov.in/railwayboard/view_section.jsp?lang=0&id=0
1
304
365'
 'https://en.wikipedia.org/wiki/Railway_Board'
 'https://en.wikipedia.org/wiki/Railway_Board#History'

我想输出如下示例输出:

 'https://www.zaubacorp.com/company/HINDUSTAN-CABLES-LTD/L31300WB1952GOI020560'
'http://www.indianrailways.gov.in/railwayboard/view_section.jsp?lang=0&id=0
 'https://en.wikipedia.org/wiki/Railway_Board'
 'https://en.wikipedia.org/wiki/Railway_Board#History'

Tags: csvinhttpsorgboardcomhttpurl
1条回答
网友
1楼 · 发布于 2024-06-16 13:28:56

您可以使用标准库中的^{}尝试将字符串解析为具有必要属性的URL。你知道吗

from io import StringIO
from urllib.parse import urlparse
import pandas as pd

def url_validator(x):
    try:
        result = urlparse(x)
        # check non-empty attributes
        return all((result.scheme, result.netloc, result.path))
    except AttributeError:
        return False

mystr = StringIO("""https://www.zaubacorp.com/company/HINDUSTAN-CABLES-LTD/L31300WB1952GOI020560
http://www.indianrailways.gov.in/railwayboard/view_section.jsp?lang=0&id=0
1
304
365
https://en.wikipedia.org/wiki/Railway_Board
https://en.wikipedia.org/wiki/Railway_Board#History""")

# replace mystr with 'file.csv'
df = pd.read_csv(mystr, header=None, names=['values'])

# apply filter based on checker function
df = df[df['values'].apply(url_validator)]

print(df)

                                              values
0  https://www.zaubacorp.com/company/HINDUSTAN-CA...
1  http://www.indianrailways.gov.in/railwayboard/...
5        https://en.wikipedia.org/wiki/Railway_Board
6  https://en.wikipedia.org/wiki/Railway_Board#Hi...

相关问题 更多 >