尝试在多条固定长度的行上打印单个字符串并尽量减少成本
首先,给大家一些背景信息。我刚开始学习算法(现在觉得自己在逻辑和推理方面还不够强),我一直在尝试把“This is a sample text”这句话分成多行,每行最多7个字符。所以第一行应该是:
this is (no spaces left in the end so cost 0)
a
[cost=6*6*6(The spaces left at the end of each line are cubed which will be the cost) ]
sample [cost=1*1*1]
text [cost= 3*3*3]
(Total cost = 0+216+1+27=244)
现在,这个过程可以优化一下:
this [cost 3*3*3]
is a [cost 3*3*3]
sample [cost 1*1*1]
text [cost 3*3*3]
[Total cost = 27+27+1+27 = 82]
很明显,我们不能用贪心算法来解决这个问题,而是应该使用动态规划。不过,我的问题是我无法找到可以重复使用的子结构。我真的不知道怎么把成本条件和在Python中打印的过程联系起来。我可以给每个单词编号,也可以得到每个单词的长度,但接下来该怎么做我就卡住了。当我打印的时候,所有的内容都在一行上显示(这就是我目前的进展)。
如果这个问题听起来很傻,我很抱歉,但我真的卡住了,需要一些帮助。
谢谢!
这是我尝试实现的代码,虽然我对代码进行了测试,这些测试是我朋友写的,但我觉得我没有做到正确。任何帮助或建议都非常感谢。
print_test.py import os
import sys
from glob import glob
#TODO -- replace this with your solution
from printing import print_neatly
log = open('output.log', 'w')
#This tests the code against my own text
maxline = 80
for source in glob('*.txt'):
with open(source) as f:
fulltext = f.read()
words = fulltext.split()
(cost, text) = print_neatly(words, maxline)
#double check the cost
#lines = text.split('\n')
truecost = 0
for line in text[0:-1]:
truecost += (maxline - len(line))**3
#print the output and cost
print >>log, '----------------------'
print >>log, source
print >>log, '----------------------'
print >>log, text
print >>log, '----------------------'
print >>log, 'cost = ', cost
print >>log, 'true cost = ', truecost
print >>log, '----------------------'
log.close()
#print the log
with open('output.log') as f: print f.read()
printing.py
def print_neatly(wordlist, max):
#strings='This is a sample string'
#splitting the string and taking out words from it
#wordlist=strings.split()
(cost, dyn_print) = print_line(wordlist, len(wordlist), max)
for dyn in dyn_print:
print dyn
return cost, dyn_print
def cost(lines, max):
return sum([(max-len(x)) ** 3 for x in lines])
def print_line(wordlist, count, max, results = {}):
results = [([],0)]
for count in range(1, len(wordlist) + 1):
best = wordlist[:count]
best_cost = cost(best, max)
mycount = count - 1
line = wordlist[mycount]
while len(line) <= max:
attempt, attempt_cost = results[mycount]
attempt = attempt + [line]
attempt_cost += cost([line],max)
if attempt_cost < best_cost:
best = attempt
best_cost = attempt_cost
if mycount > 0:
mycount -= 1
line = wordlist[mycount] + ' ' + line
else:
break
results += [(best, best_cost)]
#print best
#print best_cost
return (best_cost, best)
#print_neatly(0,7)
需要测试的文本文件给我这个输出,这里两个成本应该是相同的,但我没有得到,能有人指出我哪里出错了吗?
cost = 16036
true cost = 15911
3 个回答
这个算法的基础假设是,如果我们知道文本中最后 N-1、N-2、...、2、1 个单词的最佳解决方案,那么就很容易为 N 个单词构建出最佳解决方案。通过记忆化技术,我们可以避免对相同输入重复计算 best_partition()
的结果:
import functools
def wrap(text, width):
"""
>>> wrap('This is a sample text', 7)
['This', 'is a', 'sample', 'text']
"""
return [' '.join(line) for line in best_partition(
tuple(text.split()), functools.partial(cost, width=width))]
def best_partition(words, cost):
"""The best partition of words into lines according to the cost function."""
best = [words] # start with all words on a single line
for i in reversed(range(1, len(words))): # reverse to avoid recursion limit
lines = [words[:i]] + best_partition(words[i:], cost)
if cost(lines) < cost(best):
best = lines
return best
def memoize(func):
cache = {}
@functools.wraps(func)
def wrapper(*args):
try: return cache[args]
except KeyError:
ret = cache[args] = func(*args)
return ret
return wrapper
best_partition = memoize(best_partition)
其中 cost()
是:
def linelen(words):
"""Number of characters in a line created from words."""
if not words: return 0
# words + spaces between them
return sum(map(len, words)) + len(words) - 1
def cost(lines, width):
"""
- each line except last costs `(width - w)**3`, where `w` is the
line width
- cost is infinite if `w > width` and the line has more than one word
>>> cost([['a'], ['b']], 1)
0
>>> cost([['a','b']], 1)
inf
>>> cost([['a'], ['b']], 3)
8
>>> cost([['a', 'b']], 2)
inf
"""
if not lines: return 0
s = 0
for i, words in enumerate(lines, 1):
w = linelen(words)
if width >= w:
if i != len(lines): # last line has zero cost
s += (width - w)**3
elif len(words) != 1: # more than one word in the line
return float("inf") # penalty for w > width
return s
示例:
print('\n'.join(wrap("""
In olden times when wishing still helped one, there lived a king whose
daughters were all beautiful, but the youngest was so beautiful that
the sun itself, which has seen so much, was astonished whenever it
shone in her face. Close by the king's castle lay a great dark forest,
and under an old lime-tree in the forest was a well, and when the day
was very warm, the king's child went out into the forest and sat down
by the side of the cool fountain, and when she was bored she took a
golden ball, and threw it up on high and caught it, and this ball was
her favorite plaything.
""", int(sys.argv[1]) if len(sys.argv) > 1 else 70)))
输出
In olden times when wishing still helped one, there lived a king whose daughters were all beautiful, but the youngest was so beautiful that the sun itself, which has seen so much, was astonished whenever it shone in her face. Close by the king's castle lay a great dark forest, and under an old lime-tree in the forest was a well, and when the day was very warm, the king's child went out into the forest and sat down by the side of the cool fountain, and when she was bored she took a golden ball, and threw it up on high and caught it, and this ball was her favorite plaything.
如果有一种“最佳”方式来把一个词、两个词等排成行,这种方式不会因为后面出现的行而改变。它可能会因为后面出现的词而改变,如果这些词足够小,可以和其他词放在同一行。但如果我们单独考虑这些词并试图把它们排成行,那么同一组解决方案总是最优的。(可能会有等效的答案;例如,按照标准,“cats in hats”在7个字符的行上有两个解决方案。两个都是“最佳”的,永远都是——我们可以选择其中一个并坚持使用,而不会影响正确性。)
"This"
总是最好放成["This"]
。 (注意,我并不是说它总是最好单独放在一行!我想说的是,如果你只有一个词,最好的方式就是把它放在一行。)"This is"
可以排成["This", "is"]
或["This is"]
。不过,后者是最好的。所以从现在开始,每当我们只考虑这两个词时,我们可以完全忽略 ["This", "is"]——它永远不会更好。"This is a"
可以排成["This", "is", "a"]
,["This is", "a"]
或["This", "is a"]
。 (我们已经知道["This is"]
比["This", "is"]
更好——见上一个要点!)结果是["This", "is a"]
是最好的。所以从现在开始,我们可以忽略 ["This is", "a"]。"This is a sample"
可以排成:(见要点#2——我们甚至不需要考虑这个)["This", "is", "a", "sample"]
(见要点#3)["This is", "a", "sample"]
["This", "is a", "sample"]
我不懂Python;我只是把这个拼凑在一起。所以如果它看起来“不够Pythonic”或者其他什么,请多多包涵。:P
def cost(lines, limit):
# figures the cost of the current arrangement of words in lines.
return sum([(limit-len(x)) ** 3 for x in lines])
def lineify(words, limit):
# splits up words into lines of at most (limit) chars.
# should find an optimal solution, assuming all words are < limit chars long
results = [([], 0)]
for count in range(1, len(words) + 1):
best = words[:count] # (start off assuming one word per line)
best_cost = cost(best, limit)
mycount = count - 1
line = words[mycount] # start with one word
while len(line) <= limit:
# figure the optimal cost, assuming the other words are on another line
attempt, attempt_cost = results[mycount]
attempt = attempt + [line]
attempt_cost += cost([line],limit)
# print attempt
if attempt_cost < best_cost:
best = attempt
best_cost = attempt_cost
# steal another word. if there isn't one, we're done
if mycount > 0:
mycount -= 1
line = words[mycount] + ' ' + line
else:
break
# once we have an optimal result for (count) words, save it for posterity
results += [(best, best_cost)]
return results[len(words)][0]
def wrap(phrase, limit):
# helper function...so the caller doesn't have to pass an array of words.
# they shouldn't need to know to do that
words = phrase.split()
return lineify(words, limit)
我最开始有一个递归的解决方案,但结果发现Python对递归有一些限制,这让它在处理较大文本和现实世界的长度限制时不太适用。(你必须回溯到开始,才能进行记忆化处理,如果我有超过1000个词,就会触及递归限制。这可以通过从足够的词开始填满最后一行来扩展,但仍然会限制最大值为原始限制的某个倍数。)我发现自己使用了一种技巧来逐步构建结果,直到递归限制不再是问题。不过,如果你必须这样做,那可能说明递归本身就是个问题。
一种方法是列出所有可能的选择,然后挑选出成本最低的那个:
from functools import wraps
def cache(origfunc):
d = {}
@wraps(origfunc)
def wrapper(*args):
if args in d:
return d[args]
result = origfunc(*args)
d[args] = result
return result
return wrapper
@cache
def alternatives(t, m=7):
''' Given a tuple of word lengths and a maximum line length,
return a list of all possible line groupings
showing the total length of each line.
>>> alternatives((4, 2, 1, 3), 7)
[[4, 2, 1, 3], [4, 2, 5], [4, 4, 3], [7, 1, 3], [7, 5]]
'''
if not t:
return []
alts = []
s = 0
for i, x in enumerate(t):
s += x
if s > m:
break
tail = t[i+1:]
if not tail:
alts.append([s])
break
for subalt in alternatives(tail, m):
alts.append([s] + subalt)
s += 1
return alts
def cost(t, m=7):
''' Evaluate the cost of lines given to line lengths
>>> cost((7, 1, 6, 4), m=7) # 'this is', 'a', 'sample', 'text'
244
>>> cost((4, 4, 6, 4)) # 'this', 'is a', 'sample', 'text'
82
'''
return sum((m - x) ** 3 for x in t)
def textwrap(s, m=7):
''' Given a string, result a list of strings with optimal line wrapping
>>> print textwrap('This is a sample text', 7)
['This', 'is a', 'sample', 'text']
'''
words = s.split()
t = tuple(map(len, words))
lengths = min(alternatives(t, m), key=cost)
result = []
worditer = iter(words)
for length in lengths:
line = []
s = 0
while s < length:
word = next(worditer)
line.append(word)
s += len(word) + 1
result.append(' '.join(line))
return result
if __name__ == '__main__':
import doctest
print doctest.testmod()
为了加快代码的运行速度,可以限制搜索的选择数量(比如每行只考虑三个最长的选择)。