Sophie

Sophie

distrib > Mandriva > 2010.0 > i586 > media > contrib-release > by-pkgid > 0a38b107381e947533adbb55ab5f647c > files > 658

jakarta-poi-manual-3.1-0.0.2mdv2010.0.noarch.rpm

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<!--*** This is a generated file.  Do not edit.  ***-->
<link rel="stylesheet" href="skin/tigris.css" type="text/css">
<link rel="stylesheet" href="skin/mysite.css" type="text/css">
<link rel="stylesheet" href="skin/site.css" type="text/css">
<link media="print" rel="stylesheet" href="skin/print.css" type="text/css">
<title>POI Text Extraction</title>
</head>
<body bgcolor="white" class="composite">
<!--================= start Banner ==================-->
<div id="banner">
<table width="100%" cellpadding="8" cellspacing="0" summary="banner" border="0">
<tbody>
<tr>
<!--================= start Group Logo ==================-->
<td align="left">
<div class="groupLogo">
<a href="http://poi.apache.org"><img border="0" class="logoImage" alt="Apache POI" src="resources/images/group-logo.jpg"></a>
</div>
</td>
<!--================= end Group Logo ==================-->
<!--================= start Project Logo ==================--><td align="right">
<div class="projectLogo">
<a href="http://poi.apache.org/"><img border="0" class="logoImage" alt="POI" src="resources/images/project-logo.jpg"></a>
</div>
</td>
<!--================= end Project Logo ==================-->
</tr>
</tbody>
</table>
</div>
<!--================= end Banner ==================-->
<!--================= start Main ==================-->
<table width="100%" cellpadding="0" cellspacing="0" border="0" summary="nav" id="breadcrumbs">
<tbody>
<!--================= start Status ==================-->
<tr class="status">
<td>
<!--================= start BreadCrumb ==================--><a href="http://www.apache.org/">Apache</a> | <a href="http://poi.apache.org/">POI</a><a href=""></a>
<!--================= end BreadCrumb ==================--></td><td id="tabs">
<!--================= start Tabs ==================-->
<div class="tab">
<span class="selectedTab"><a class="base-selected" href="index.html">Home</a></span> | <script language="Javascript" type="text/javascript">
function printit() {  
if (window.print) {
    window.print() ;  
} else {
    var WebBrowser = '<OBJECT ID="WebBrowser1" WIDTH="0" HEIGHT="0" CLASSID="CLSID:8856F961-340A-11D0-A96B-00C04FD705A2"></OBJECT>';
document.body.insertAdjacentHTML('beforeEnd', WebBrowser);
    WebBrowser1.ExecWB(6, 2);//Use a 1 vs. a 2 for a prompting dialog box    WebBrowser1.outerHTML = "";  
}
}
</script><script language="Javascript" type="text/javascript">
var NS = (navigator.appName == "Netscape");
var VERSION = parseInt(navigator.appVersion);
if (VERSION > 3) {
    document.write('  <a title="PRINT this page OUT" href="javascript:printit()">PRINT</a>');
}
</script> | <a title="PDF file of this page" href="text-extraction.pdf">PDF</a>
</div>
<!--================= end Tabs ==================-->
</td>
</tr>
</tbody>
</table>
<!--================= end Status ==================-->
<table id="main" width="100%" cellpadding="8" cellspacing="0" summary="" border="0">
<tbody>
<tr valign="top">
<!--================= start Menu ==================-->
<td id="leftcol">
<div id="navcolumn">
<div class="menuBar">
<div class="menu">
<span class="menuLabel">Apache POI</span>
        
<div class="menuItem">
<a href="index.html">TOP</a>
</div>
    
</div>
<div class="menu">
<span class="menuLabel">Marketing</span>
        
<div class="menuItem">
<a href="casestudies.html">Case Studies</a>
</div>
    
</div>
<div class="menu">
<span class="menuLabel">Project</span>
        
<div class="menuItem">
<a href="overview.html">Overview</a>
</div>
        
<div class="menuItem">
<a href="poifs/index.html">POIFS</a>
</div>
        
<div class="menuItem">
<a href="hssf/index.html">HSSF</a>
</div>
        
<div class="menuItem">
<a href="hwpf/index.html">HWPF</a>
</div>
        
<div class="menuItem">
<a href="hpsf/index.html">HPSF</a>
</div>
        
<div class="menuItem">
<a href="hslf/index.html">HSLF</a>
</div>
        
<div class="menuItem">
<a href="hsmf/index.html">HSMF</a>
</div>
        
<div class="menuItem">
<a href="hdgf/index.html">HDGF</a>
</div>
		
<div class="menuItem">
<a href="poi-ruby.html">POI-Ruby</a>
</div>
        
<div class="menuItem">
<a href="utils/index.html">POI-Utils</a>
</div>
        
<div class="menuItem">
<span class="menuSelected">Text Extraction</span>
</div>
        
<div class="menuItem">
<a href="http://www.apache.org/dyn/closer.cgi/poi/">Download</a>
</div>
    
</div>
<div class="menu">
<span class="menuLabel">Community</span>
        
<div class="menuItem">
<a href="news.html">News</a>
</div>
        
<div class="menuItem">
<a href="mirrors.html">Mirrors</a>
</div>
        
<div class="menuItem">
<a href="changes.html">Changes</a>
</div>
        
<div class="menuItem">
<a href="todo.html">To Do</a>
</div>
        
<div class="menuItem">
<a href="getinvolved/index.html">Get Involved</a>
</div>
        
<div class="menuItem">
<a href="mailinglists.html">Mailing Lists</a>
</div>
        
<div class="menuItem">
<a href="plan/POI20Vision.html">Vision</a>
</div>
        
<div class="menuItem">
<a href="historyandfuture.html">History and Future</a>
</div>
        
<div class="menuItem">
<a href="who.html">Who We Are</a>
</div>
        
<div class="menuItem">
<a href="resolutions/index.html">Resolutions</a>
</div>
        
<div class="menuItem">
<a href="http://www.apache.org/foundation/thanks.html">Sponsors</a>
</div>
    
</div>
<div class="menu">
<span class="menuLabel">Docs</span>
        
<div class="menuItem">
<a href="apidocs/index.html">Javadocs</a>
</div>
        
<div class="menuItem">
<a href="faq.html">FAQ</a>
</div>
        
<div class="menuItem">
<a href="legal.html">Legal</a>
</div>
        
<div class="menuItem">
<a href="references/index.html">References</a>
</div>
        
<div class="menuItem">
<a href="howtobuild.html">How to Build</a>
</div>
    
</div>
<div class="menu">
<span class="menuLabel">Translations</span>
        
<div class="menuItem">
<a href="trans/index.html">Index</a>
</div>
        
<div class="menuItem">
<a href="trans/guidelines.html">Guidelines</a>
</div>
        
<div class="menuItem">
<a href="trans/de/index.html">German (DE)</a>
</div>
        
<div class="menuItem">
<a href="trans/es/index.html">Spanish (ES)</a>
</div>
        
<div class="menuItem">
<a href="http://jakarta.terra-intl.com/poi/">Japanese (Web)</a>
</div>
        
<div class="menuItem">
<a href="http://jakarta.apache-korea.org/poi/">Korean (Web)</a>
</div>
    
</div>
<div class="menu">
<span class="menuLabel">Code</span>
        
<div class="menuItem">
<a href="subversion.html">Subversion (current source code)</a>
</div>
        
<div class="menuItem">
<a href="http://issues.apache.org/bugzilla/buglist.cgi?votes=1&product=POI&order=bugs.votes">Top Voted Bugs</a>
</div>
        
<div class="menuItem">
<a href="http://issues.apache.org/bugzilla/buglist.cgi?product=POI">Bug Database</a>
</div>
        
<div class="menuItem">
<a href="http://issues.apache.org/bugzilla/buglist.cgi?product=POI&short_desc=%5BPATCH%5D&short_desc_type=allwordssubstr">Patches</a>
</div>
        
<div class="menuItem">
<a href="junit/index.html">Junit Test Results</a>
</div>
        
<div class="menuItem">
<a href="jdepend/index.html">Dependency Metrics</a>
</div>
        
    
</div>
</div>
</div>
<form target="_blank" action="http://www.google.com/search" method="get">
<table summary="search" border="0" cellspacing="0" cellpadding="0">
<tr>
<td><img height="1" width="1" alt="" src="skin/images/spacer.gif" class="spacer"></td><td nowrap="nowrap"><input value="poi.apache.org" name="sitesearch" type="hidden"><input size="10" name="q" id="query" type="text"><img height="1" width="5" alt="" src="skin/images/spacer.gif" class="spacer"><input name="Search" value="GO" type="submit">
<br>
                          Search poi</td><td><img height="1" width="1" alt="" src="skin/images/spacer.gif" class="spacer"></td>
</tr>
<tr>
<td colspan="3"><img height="7" width="1" alt="" src="skin/images/spacer.gif" class="spacer"></td>
</tr>
<tr>
<td class="bottom-left-thick"></td><td bgcolor="#a5b6c6"><img height="1" width="1" alt="" src="skin/images/spacer.gif" class="spacer"></td><td class="bottom-right-thick"></td>
</tr>
</table>
</form>
</td>
<!--================= end Menu ==================-->
<!--================= start Content ==================--><td>
<div id="bodycol">
<div class="app">
<div align="center">
<h1>POI Text Extraction</h1>
</div>
<div class="h3">
  
  
  
    
<div class="h3">
<h3>Overview</h3>
</div>
      
<p>POI provides text extraction for all the supported file
       formats. In addition, it provides access to the metadata
       associated with a given file, such as title and author.</p>
      
<p>In addition to providing direct text extraction classes,
       POI works closely with the 
       <a href="http://incubator.apache.org/tika/">Apache Tika</a>
       text extraction library. Users may wish to simply utilise 
       the functionality provided by Tika.</p>
    

    
<div class="h3">
<h3>Common functionality</h3>
</div>
     
<p>All of the POI text extractors extend from
      <em>org.apache.poi.POITextExtractor</em>. This provides a common
      method across all extractors, getText(). For many cases, the text
      returned will be all you need. However, many extractors do provide
      more targetted text extraction methods, so you may wish to use
      these in some cases.</p>
     
<p>All POIFS / OLE 2 based text extractors also extend from
      <em>org.apache.poi.POIOLE2TextExtractor</em>. This additionally
      provides common methods to get at the <a href="hpfs/">HPFS
      document metadata</a>.</p>
     
<p>All OOXML based text extractors (available in POI 3.5 and later) 
      also extend from
      <em>org.apache.poi.POIOOXMLTextExtractor</em>. This additionally
      provides common methods to get at the OOXML metadata.</p>
    

    
<div class="h3">
<h3>Text Extractor Factory - POI 3.5 or later</h3>
</div>
     
<p>A new class in POI 3.5, 
      <em>org.apache.poi.extractor.ExtractorFactory</em> provides a
      similar function to WorkbookFactory. You simply pass it an
      InputStream, a file, a POIFSFileSystem or a OOXML Package. It
      figures out the correct text extractor for you, and returns it.</p>
    

    
<div class="h3">
<h3>Excel</h3>
</div>
     
<p>For .xls files, there is 
      <em>org.apache.poi.hssf.extractor.ExcelExtractor</em>, which will 
      return text, optionally with formulas instead of their contents. 
      Those using POI 3.5 can also use 
      <em>org.apache.poi.xssf.extractor.XSSFExcelExtractor</em>, to perform
      a similar task for .xlsx files.</p>
     
<p>In addition, there is a second text extractor for .xls files,
      <em>org.apache.poi.hssf.extractor.EventBasedExcelExtractor</em>. This
      is based on the streaming EventUserModel code, and will generally
      deliver a lower memory footprint for extraction. However, it will
      have problems correctly outputting more complex formulas, as it 
      works with records as they pass, and so doesn't have access to all
      parts of complex and shared formulas.</p>
    

    
<div class="h3">
<h3>Word</h3>
</div>
     
<p>For .doc files, in scratchpad there is 
      <em>org.apache.poi.hwpf.extractor.WordExtractor</em>, which will 
      return text for your document. Those using POI 3.5 can also use 
      <em>org.apache.poi.xwpf.extractor.XPFFWordExtractor</em>, to perform
      a similar task for .docx files.</p>
    

    
<div class="h3">
<h3>PowerPoint</h3>
</div>
     
<p>For .ppt files, in scratchpad there is 
      <em>org.apache.poi.hslf.extractor.PowerPointExtractor</em>, which 
      will return text for your slideshow, optionally restricted to just
      slides text or notes text. Those using POI 3.5 can also use 
      <em>org.apache.poi.xslf.extractor.XSLFPowerPointExtractor</em>, to 
      perform a similar task for .pptx files.</p>
    

    
<div class="h3">
<h3>Visio</h3>
</div>
     
<p>For .vsd files, in scratchpad there is 
      <em>org.apache.poi.hdgf.extractor.VisioTextExtractor</em>, which 
      will return text for your file.</p>
    
  

  

<div id="authors" align="right">by&nbsp;Nick Burch</div>
</div>
</div>
</div>
</td>
<!--================= end Content ==================-->
</tr>
</tbody>
</table>
<!--================= end Main ==================-->
<!--================= start Footer ==================-->
<div id="footer">
<table summary="footer" cellspacing="0" cellpadding="4" width="100%" border="0">
<tbody>
<tr>
<!--================= start Copyright ==================-->
<td colspan="2">
<div align="center">
<div class="copyright">
              Copyright &copy; 2002-2007&nbsp;The Apache Software Foundation. All rights reserved.
            </div>
</div>
</td>
<!--================= end Copyright ==================-->
</tr>
<tr>
<td align="left">
<!--================= start Host ==================-->
<!--================= end Host ==================--></td><td align="right">
<!--================= start Credits ==================-->
<div align="right">
<div class="credit"></div>
</div>
<!--================= end Credits ==================-->
</td>
</tr>
</tbody>
</table>
</div>
<!--================= end Footer ==================-->
</body>
</html>