Java如何从Web浏览器获取文本？

3 周，4 日 Questions & Answers 1982

我想知道是否有人知道一种从Java应用程序获取当前网页上所有文本的好技术

我尝试了两种方法：

OCR：这对我来说不够准确，因为文本的正确率大约只有60%。而且它只得到屏幕截图可以看到的文本，我需要页面上的所有文本

Robot类：我现在得到的方法是使用Robot类向我们提供Control-A，Control-C方法，然后从剪贴板中获取文本。在获取文本方面，这种方法被证明是有用的。我唯一的问题是用户在瞬间看到突出显示的文本，这是我不希望他们看到的

虽然这是大学最后一年的一个项目，也是一个反网络欺凌/儿童美容项目，并且只有在检测到恶意行为时才会存储信息，但对某些人来说，这可能听起来像某种形式的间谍软件

有谁能想出一个更好的方法让文本从浏览器中消失吗

非常感谢

GetMethod get = new GetMethod("http://ThePage.com"); InputStream in = get.getResponseBodyAsStream(); String htmlText = readString(in); static String readString(InputStream is) throws IOException { char[] buf = new char[2048]; Reader r = new InputStreamReader(is, "UTF-8"); StringBuilder s = new StringBuilder(); while (true) { int n = r.read(buf); if (n < 0) break; s.append(buf, 0, n); } return s.toString(); }

# 2 楼答案

这是我为此目的创建的实用程序类。它有运行时版本和非运行时版本，还提供了验证检索到的源的尾部的功能

   import  java.io.BufferedInputStream;
   import  java.io.IOException;
   import  java.io.InputStream;
   import  java.net.MalformedURLException;
   import  java.io.EOFException;
   import  java.net.URL;

/**
   <P>Append the source-code from a web-page into a <CODE>java.lang.Appendable</CODE>.</P>

   <P>Demo: {@code java AppendWebPageSource}</P>
 **/
public class AppendWebPageSource  {
   public static final void main(String[] igno_red)  {
      String sHtml = AppendWebPageSource.get("http://usatoday.com", null);
      System.out.println(sHtml);   

      //Alternative:
      AppendWebPageSource.append(System.out, "http://usatoday.com", null);
   }
   /**
      <P>Get the source-code from a web page, with runtime-errors only.</P>

      @return  {@link #append(Appendable, String, String) append}{@code ((new StringBuilder()), s_httpUrl, s_endingString)}
    **/
   public static final String get(String s_httpUrl, String s_endingString)  {
      return  append((new StringBuilder()), s_httpUrl, s_endingString).toString();
   }
   /**
      <P>Append the source-code from a web page, with runtime-errors only.</P>

      @return  {@link #appendX(Appendable, String, String) appendX}{@code (ap_bl, s_httpUrl, s_endingString)}
      @exception  RuntimeException  Whose {@link getCause()} contains the original {@link java.io.IOException} or {@code java.net.MalformedURLException}.
    **/
   public static final Appendable append(Appendable ap_bl, String s_httpUrl, String s_endingString)  {
      try  {
         return  appendX(ap_bl, s_httpUrl, s_endingString);
      }  catch(IOException iox)  {
         throw  new RuntimeException(iox);
      }
   }
   /**
      <P>Append the source-code from a web-page into a <CODE>java.lang.Appendable</CODE>.</P>

      <P><I>I got this from {@code <A HREF="http://www.davidreilly.com/java/java_network_programming/">http://www.davidreilly.com/java/java_network_programming/</A>}, item 2.3.</I></P>

      @param  ap_bl  May not be {@code null}.
      @param  s_httpUrl  May not be {@code null}, and must be a valid url.
      @param  s_endingString  If non-{@code null}, the web-page's source-code must end with this. May not be empty.
      @see  #get(Appendable, String, String)
      @see  #append(Appendable, String, String)
    **/
   public static final Appendable appendX(Appendable ap_bl, String s_httpUrl, String s_endingString)  throws MalformedURLException, IOException  {
      if(s_httpUrl == null  ||  s_httpUrl.length() == 0)  {
         throw  new IllegalArgumentException("s_httpUrl (\"" + s_httpUrl + "\") is null or empty.");
      }
      if(s_endingString != null  &&  s_endingString.length() == 0)  {
         throw  new IllegalArgumentException("s_endingString is non-null and empty.");
      }

      // Create an URL instance
      URL url = new URL(s_httpUrl);

      // Get an input stream for reading
      InputStream is = url.openStream();

      // Create a buffered input stream for efficency
      BufferedInputStream bis = new BufferedInputStream(is);

      int ixEndStr = 0;

      // Repeat until end of file
      while(true)  {
         int iChar = bis.read();

         // Check for EOF
         if (iChar == -1)  {
            break;
         }

         char c = (char)iChar;

         try  {
            ap_bl.append(c);
         }  catch(NullPointerException npx)  {
            throw  new NullPointerException("ap_bl");
         }

         if(s_endingString != null)  {
            //There is an ending string;
            char[] ac = s_endingString.toCharArray();

            if(c == ac[ixEndStr])  {
               //The character just retrieved is equal to the
               //next character in the ending string.

               if(ixEndStr == (ac.length - 1))  {
                  //The entire string has been found. Done.
                  return ap_bl;
               }

               ixEndStr++;
            }  else  {
               ixEndStr = 0;
            }
         }
      }

      if(s_endingString != null)  {
         //Should have exited at the "return" above.
         throw  new EOFException("s_endingString " + (new String(s_endingString)) + " (is non-null, and was not found at the end of the web-page's source-code.");
      }
      return  ap_bl;
   }
}

共 (5) 个答案

# 1 楼答案

你可以试试这样的

# 3 楼答案

您可以使用URLConnection或Apache的HTTPClient从网站获取所有HTML 下面的问题解释了如何做到这一点： Get html file Java

当然，它不会给你们在二进制文件（即闪存文件）图像等文本，只有OCR将工作
# 4 楼答案

获取URL并使用HTTP客户端类读取页面。i、 e.ApacheCommonsHttpGet

有关更多信息，请阅读此处：http://hc.apache.org/httpclient-3.x/tutorial.html
# 5 楼答案

最通用的解决方案是流量嗅探器

Python中文网

有 Java 编程相关的问题?

Java如何从Web浏览器获取文本？

共 (5) 个答案

# 1 楼答案

# 2 楼答案

# 3 楼答案

# 4 楼答案

# 5 楼答案