我需要从BeautifulSoup的字符串输出中删除多余字符

0 投票
1 回答
1996 浏览
提问于 2025-04-18 09:46

我需要去掉数据前面的 [u' 和后面的 '] 这些字符,因为这些字符对我来说没用。我想把这些数据放进数据库里,但我发现数据库会把这些多余的字符也存进去。我该怎么去掉它们呢?我试过用 .replace 来处理这个变量,但出现了错误。

import urllib
import mechanize
from bs4 import BeautifulSoup
import requests
import re
import MySQLdb
import time

db = MySQLdb.connect(
  host=" ",
  user=" ",
  passwd=" ",
  db=" ")

inc = 0

# while inc != 3289:
c = db.cursor()
c.execute("""SELECT `symbol` FROM `stocks` LIMIT %s,1""", (inc,))
result = c.fetchall()
result = str(result)

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
br = mechanize.Browser()
br.set_handle_robots(False)
br.addHeaders = [('User-agent',user_agent)]

term = result.replace('((','').replace(',)','').replace("'",'')
url = "http://www.marketwatch.com/investing/stock/"+term
soup = BeautifulSoup(requests.get(url).text)
search = soup.find('p', attrs = {'class':'data bgLast'})
cur = search.findAll(text = True)
search2 = soup.find('span', attrs = {'class':'bgChange'})
diff = search2.findAll(text = True)
print term
print cur
print diff

c.execute("""UPDATE stocks SET cur = %s WHERE symbol = %s""", (cur,term))
c.execute("""UPDATE stocks SET diff = %s WHERE symbol = %s""", (diff,term))
db.commit()

多亏了你 @jonrsharpe,我找到了答案。在原来的代码中,.findAll 是在获取一组结果。我只需要把它改成字符串,这样就可以使用 strip 函数来处理了。下面是修改后的代码:

import urllib
import mechanize
from bs4 import BeautifulSoup
import requests
import re
import MySQLdb
import time

db = MySQLdb.connect(
  host=" ",
  user=" ",
  passwd=" ",
  db=" ")

inc = 0

# while inc != 3289:
c = db.cursor()
c.execute("""SELECT `symbol` FROM `stocks` LIMIT %s,1""", (inc,))
result = c.fetchall()
result = str(result)

user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
br = mechanize.Browser()
br.set_handle_robots(False)
br.addHeaders = [('User-agent',user_agent)]

term = result.replace('((','').replace(',)','').replace("'",'')
url = "http://www.marketwatch.com/investing/stock/"+term
soup = BeautifulSoup(requests.get(url).text)
search = soup.find('p', attrs = {'class':'data bgLast'})
cur = str(search.findAll(text = True))
search2 = soup.find('span', attrs = {'class':'bgChange'})
diff = str(search2.findAll(text = True))
cur = cur.strip("'[]u")
diff = diff.strip("'[]u")
print term
print cur
print diff

c.execute("""UPDATE stocks SET cur = %s WHERE symbol = %s""", (cur,term))
c.execute("""UPDATE stocks SET diff = %s WHERE symbol = %s""", (diff,term))
db.commit()

1 个回答

0
result = str(result)
...
cur = str(search.findAll(text = True))

别再这样做了!除了字符串,还有其他数据类型呢!

result 是一个列表的列表;而 search.findAll 会给你一堆文本节点的列表。比如,你可以通过 result[0][0] 来获取第一行的 symbol 值;如果你想获取某个元素的文本,只需要用 search.getText() 就可以了。

把像列表这样的结构化对象变成一个简单的字符串,然后再试图从中提取信息,这样做并不明智。

撰写回答