有 Java 编程相关的问题?

你可以在下面搜索框中键入要查询的问题!

Java如何从Web浏览器获取文本?

我想知道是否有人知道一种从Java应用程序获取当前网页上所有文本的好技术

我尝试了两种方法:

  1. OCR:这对我来说不够准确,因为文本的正确率大约只有60%。而且它只得到屏幕截图可以看到的文本,我需要页面上的所有文本

  2. Robot类:我现在得到的方法是使用Robot类向我们提供Control-A,Control-C方法,然后从剪贴板中获取文本。在获取文本方面,这种方法被证明是有用的。我唯一的问题是用户在瞬间看到突出显示的文本,这是我不希望他们看到的

虽然这是大学最后一年的一个项目,也是一个反网络欺凌/儿童美容项目,并且只有在检测到恶意行为时才会存储信息,但对某些人来说,这可能听起来像某种形式的间谍软件

有谁能想出一个更好的方法让文本从浏览器中消失吗

非常感谢


共 (5) 个答案

  1. # 1 楼答案

    你可以试试这样的

    GetMethod get = new GetMethod("http://ThePage.com");
    InputStream in = get.getResponseBodyAsStream();
    String htmlText = readString(in);
    
    static String readString(InputStream is) throws IOException {
    char[] buf = new char[2048];
    Reader r = new InputStreamReader(is, "UTF-8");
    StringBuilder s = new StringBuilder();
    while (true) {
       int n = r.read(buf);
        if (n < 0)
          break;
        s.append(buf, 0, n);
      }
      return s.toString();
    }
    
  2. # 2 楼答案

    这是我为此目的创建的实用程序类。它有运行时版本和非运行时版本,还提供了验证检索到的源的尾部的功能

       import  java.io.BufferedInputStream;
       import  java.io.IOException;
       import  java.io.InputStream;
       import  java.net.MalformedURLException;
       import  java.io.EOFException;
       import  java.net.URL;
    
    /**
       <P>Append the source-code from a web-page into a <CODE>java.lang.Appendable</CODE>.</P>
    
       <P>Demo: {@code java AppendWebPageSource}</P>
     **/
    public class AppendWebPageSource  {
       public static final void main(String[] igno_red)  {
          String sHtml = AppendWebPageSource.get("http://usatoday.com", null);
          System.out.println(sHtml);   
    
          //Alternative:
          AppendWebPageSource.append(System.out, "http://usatoday.com", null);
       }
       /**
          <P>Get the source-code from a web page, with runtime-errors only.</P>
    
          @return  {@link #append(Appendable, String, String) append}{@code ((new StringBuilder()), s_httpUrl, s_endingString)}
        **/
       public static final String get(String s_httpUrl, String s_endingString)  {
          return  append((new StringBuilder()), s_httpUrl, s_endingString).toString();
       }
       /**
          <P>Append the source-code from a web page, with runtime-errors only.</P>
    
          @return  {@link #appendX(Appendable, String, String) appendX}{@code (ap_bl, s_httpUrl, s_endingString)}
          @exception  RuntimeException  Whose {@link getCause()} contains the original {@link java.io.IOException} or {@code java.net.MalformedURLException}.
        **/
       public static final Appendable append(Appendable ap_bl, String s_httpUrl, String s_endingString)  {
          try  {
             return  appendX(ap_bl, s_httpUrl, s_endingString);
          }  catch(IOException iox)  {
             throw  new RuntimeException(iox);
          }
       }
       /**
          <P>Append the source-code from a web-page into a <CODE>java.lang.Appendable</CODE>.</P>
    
          <P><I>I got this from {@code <A HREF="http://www.davidreilly.com/java/java_network_programming/">http://www.davidreilly.com/java/java_network_programming/</A>}, item 2.3.</I></P>
    
          @param  ap_bl  May not be {@code null}.
          @param  s_httpUrl  May not be {@code null}, and must be a valid url.
          @param  s_endingString  If non-{@code null}, the web-page's source-code must end with this. May not be empty.
          @see  #get(Appendable, String, String)
          @see  #append(Appendable, String, String)
        **/
       public static final Appendable appendX(Appendable ap_bl, String s_httpUrl, String s_endingString)  throws MalformedURLException, IOException  {
          if(s_httpUrl == null  ||  s_httpUrl.length() == 0)  {
             throw  new IllegalArgumentException("s_httpUrl (\"" + s_httpUrl + "\") is null or empty.");
          }
          if(s_endingString != null  &&  s_endingString.length() == 0)  {
             throw  new IllegalArgumentException("s_endingString is non-null and empty.");
          }
    
          // Create an URL instance
          URL url = new URL(s_httpUrl);
    
          // Get an input stream for reading
          InputStream is = url.openStream();
    
          // Create a buffered input stream for efficency
          BufferedInputStream bis = new BufferedInputStream(is);
    
          int ixEndStr = 0;
    
          // Repeat until end of file
          while(true)  {
             int iChar = bis.read();
    
             // Check for EOF
             if (iChar == -1)  {
                break;
             }
    
             char c = (char)iChar;
    
             try  {
                ap_bl.append(c);
             }  catch(NullPointerException npx)  {
                throw  new NullPointerException("ap_bl");
             }
    
             if(s_endingString != null)  {
                //There is an ending string;
                char[] ac = s_endingString.toCharArray();
    
                if(c == ac[ixEndStr])  {
                   //The character just retrieved is equal to the
                   //next character in the ending string.
    
                   if(ixEndStr == (ac.length - 1))  {
                      //The entire string has been found. Done.
                      return ap_bl;
                   }
    
                   ixEndStr++;
                }  else  {
                   ixEndStr = 0;
                }
             }
          }
    
          if(s_endingString != null)  {
             //Should have exited at the "return" above.
             throw  new EOFException("s_endingString " + (new String(s_endingString)) + " (is non-null, and was not found at the end of the web-page's source-code.");
          }
          return  ap_bl;
       }
    }
    
  3. # 3 楼答案

    您可以使用URLConnection或Apache的HTTPClient从网站获取所有HTML 下面的问题解释了如何做到这一点: Get html file Java

    当然,它不会给你们在二进制文件(即闪存文件)图像等文本,只有OCR将工作

  4. # 5 楼答案

    最通用的解决方案是流量嗅探器