java使用web爬虫耗尽堆空间

2 周，2 日 Questions & Answers 938

我编写了一个小型爬虫程序，发现它的堆空间不足（尽管我目前将列表中的URL数量限制为300个）

使用Java内存分析器，我发现使用者是char[]（64MB中有45MB，如果我增加允许的大小，也会更多；它只是不断增长）

分析器还提供了char[]的内容。它包含由爬虫读取的HTML页面

通过对-Xmx[...]m的不同设置进行更深入的分析，我发现Java使用了几乎所有可用的空间，然后只要我想下载一个3MB大小的图像，就会得到out of heap

当我给Java 16MB时，它使用14MB，但失败了；当我给它64MB时，它使用59MB，当尝试下载一个大映像时失败了

阅读页面是用这段代码完成的（编辑并添加了.close()）：

private String readPage(Website url) throws CrawlerException {
    StringBuffer sourceCodeBuffer = new StringBuffer();
    try {
        URLConnection con = url.getUrl().openConnection();
        con.setConnectTimeout(2000);
        con.setReadTimeout(2000);

        BufferedReader br = new BufferedReader(new InputStreamReader(con.getInputStream()));
        String strTemp = "";
        try {
            while(null != (strTemp = br.readLine())) {
                sourceCodeBuffer = sourceCodeBuffer.append(strTemp);
            }
        } finally {
            br.close();
        }
    } catch (IOException e) {
        throw new CrawlerException();
    }

    return sourceCodeBuffer.toString();
}

另一个函数在while循环中使用返回的字符串，但据我所知，一旦字符串被下一页覆盖，就应该释放空间

public void run() {
    boolean stop = false;

    while (stop == false) {
        try {
            Website nextPage = getNextPage();

            String source = visitAndReadPage(nextPage);
            List<Website> links = new LinkExtractor(nextPage).extract(source);
            List<Website> images = new ImageExtractor(nextPage).extract(source);

            // do something with links and images, source is not used anymore
        } catch (CrawlerException e) {
            logger.warning("could not crawl a url");
        }
    }
}

下面是分析器给我的输出示例。当我想查看仍然需要这些char[]的位置时，分析器无法判断。所以我想他们不再需要了，应该被垃圾收集。由于它总是略低于最大空间，因此Java似乎也会进行垃圾收集，但这只是保持程序目前运行所必需的（不考虑可能会有大量输入）

此外，每5秒或甚至在设置source = null;之后明确地调用System.gc()也不起作用

网站代码似乎是以任何方式尽可能长时间存储的

我是否在使用某种similar to ^{}来强制永久维护读取字符串？或者Java怎么可能将这些网站Strings长时间保存在char[]数组中

Class Name | Shallow Heap | Retained Heap | Percentage ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- char[60750] @ 0xb02c3ee0 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...| 121.512 | 121.512 | 1,06% char[60716] @ 0xb017c9b8 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...| 121.448 | 121.448 | 1,06% char[60686] @ 0xb01f3c88 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...| 121.384 | 121.384 | 1,06% char[60670] @ 0xb015ec48 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...| 121.352 | 121.352 | 1,06% char[60655] @ 0xb01d5d08 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...| 121.328 | 121.328 | 1,06% char[60651] @ 0xb009d9c0 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...| 121.320 | 121.320 | 1,06% char[60637] @ 0xb022f418 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title>Wallpaper Kostenlos - 77.777 E-Wallpapers: Widescreen, 3D, Handy, Sexy Frauen</title><link rel="shortcut icon" href="http://img.e-wallp...| 121.288 | 121.288 | 1,06%

编辑

在使用更多内存对其进行测试后，我在dominator tree中发现了这样的URL

Class Name | Shallow Heap | Retained Heap | Percentage crawling.Website @ 0xa8d28cb0 | 16 | 759.776 | 0,15% |- java.net.URL @ 0xa8d289c0 https://www.google.com/recaptcha/api/image?c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kN... | 56 | 759.736 | 0,15% | |- char[379486] @ 0xa8c6f4f8 <!DOCTYPE html><html lang="en"> <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=EmulateIE9"> <title>Google Accounts</title><style type="text/css"> html, body, div, h1, h2, h3, h4, h5, h6, p, img, dl, dt, dd, ol, ul, li, t... | 758.984 | 758.984 | 0,15% | |- java.lang.String @ 0xa8d28a40 /recaptcha/api/image?c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kNBZ7UIDccO5bx6TqFpf-7Sl...| 24 | 624 | 0,00% | | '- char[293] @ 0xa8d28a58 /recaptcha/api/image?c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kNBZ7UIDccO5bx6TqFpf-7Sl... | 600 | 600 | 0,00% | |- java.lang.String @ 0xa8d289f8 c=03AHJ_VuuT4CmbxjAoKzWEKOqLaTCyhT-89l3WOeVjekKWW81tdZsnCvpIrQ52aLTw92rP-EUP9ThnzwBwHcRLXG6A0Bpwu11cGttRAUtarmWXhdcTVRoUMLNnJNZeuuA7LedgfTou76nl8ULyuIR3tgo7_lQ21tzzBhpaTSqwYHWyuZGfuRK3z9pgmqRqvI7gE4_4lexjYbkpd62kNBZ7UIDccO5bx6TqFpf-7Sl6YmMgFC77kWZR7vvZIPkS...| 24 | 24 | 0,00% | |- java.lang.String @ 0xa8d28a10 www.google.com | 24 | 24 | 0,00% | |- java.lang.String @ 0xa8d28a28 /recaptcha/api/image | 24 | 24 | 0,00%

我真的很想知道：为什么HTML源代码是java.net.URL的一部分？这是否来自我打开的URL连接

共 (6) 个答案

# 1 楼答案

我会首先尝试在readPage方法的末尾关闭阅读器和URL连接。最好把这个逻辑放在finally子句中

保持打开的连接将使用内存，并且根据内部结构，GC可能无法回收它，即使您不再在代码中引用它

更新（基于评论）：连接本身没有close()方法，当连接的所有读卡器都关闭时，连接将被关闭
# 2 楼答案

在任何特定时间，你有多少线程在运行？您在pastebin中发送的字符数组似乎是线程本地的（这意味着没有泄漏）。你可能会看到，如果你同时运行了太多的程序，你自然会耗尽内存。尝试运行两个线程，但URL数量相同
# 3 楼答案

我发现的另一个可能的原因是原始字符串使用的that substring uses the same old large char array。因此，如果保留一个子字符串，则整个字符串将被保留
# 4 楼答案

它很可能是一个参考保存在某个地方，防止垃圾收集。这总是需要到处捣乱才能纠正。我通常从具有堆分析的分析器开始。如果可能的话，编写一个小的测试程序，加载一个页面，而不是其他很多内容。它可以简单地列出包含一些大图片的3-4个URL。如果页面包含一张大图片，比如10+MB，那么在分析器中应该很容易找到。最糟糕的情况是，正在使用的库保存了引用。一个小的测试程序将是调试的最佳方式
# 5 楼答案

When I give Java 16MB, it uses 14MB and fails, when I give it 64MB it used 59MB and fails when trying to download a large image.

这并不奇怪，因为你已经接近极限了。3 MB的图像在加载（反压缩）时可以解压为60 MB或更多。您可以将最大值增加到1 GB吗
# 6 楼答案

我不确定你的信息是否会得出垃圾收集不起作用的结论。分配更多内存时，内存不足。你说你认为有些对象符合GC，但JVM不符合。我很确定我会信任JVM，而不是猜测

你的应用程序中的某个地方出现内存泄漏。在某个对象中的某个地方，您保留了对网页全部内容的引用。这会填满你的空闲记忆

Python中文网

有 Java 编程相关的问题?

java使用web爬虫耗尽堆空间

编辑

共 (6) 个答案

# 1 楼答案

# 2 楼答案

# 3 楼答案

# 4 楼答案

# 5 楼答案

# 6 楼答案