org.alfresco.repo.content.metadata
Class TikaPoweredMetadataExtracter

java.lang.Object
  extended by org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter
      extended by org.alfresco.repo.content.metadata.TikaPoweredMetadataExtracter
All Implemented Interfaces:
ContentWorker, MetadataExtracter
Direct Known Subclasses:
DWGMetadataExtracter, MailMetadataExtracter, MP3MetadataExtracter, OfficeMetadataExtracter, OpenDocumentMetadataExtracter, PdfBoxMetadataExtracter, PoiMetadataExtracter, TikaAutoMetadataExtracter, TikaSpringConfiguredMetadataExtracter

public abstract class TikaPoweredMetadataExtracter
extends AbstractMappingMetadataExtracter

The parent of all Metadata Extractors which use Apache Tika under the hood. This handles all the common parts of processing the files, and the common mappings. Individual extractors extend from this to do custom mappings.

   author:                 --      cm:author
   title:                  --      cm:title
   subject:                --      cm:description
   created:                --      cm:created
   comments:
 


Nested Class Summary
protected static class TikaPoweredMetadataExtracter.HeadContentHandler
          This content handler will capture entries from within the header of the Tika content XHTML, but ignore the rest.
protected static class TikaPoweredMetadataExtracter.MapCaptureContentHandler
          This content handler will grab all tags and attributes, and record the textual content of the last seen one of them.
protected static class TikaPoweredMetadataExtracter.NullContentHandler
          A content handler that ignores all the content it finds.
 
Nested classes/interfaces inherited from interface org.alfresco.repo.content.metadata.MetadataExtracter
MetadataExtracter.OverwritePolicy
 
Field Summary
protected static java.lang.String KEY_AUTHOR
           
protected static java.lang.String KEY_COMMENTS
           
protected static java.lang.String KEY_CREATED
           
protected static java.lang.String KEY_DESCRIPTION
           
protected static java.lang.String KEY_SUBJECT
           
protected static java.lang.String KEY_TITLE
           
protected static org.apache.commons.logging.Log logger
           
 
Fields inherited from class org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter
NAMESPACE_PROPERTY_PREFIX, supportedDateFormats
 
Constructor Summary
TikaPoweredMetadataExtracter(java.util.ArrayList supportedMimeTypes)
           
TikaPoweredMetadataExtracter(java.util.HashSet supportedMimeTypes)
           
 
Method Summary
protected static java.util.ArrayList buildSupportedMimetypes(java.lang.String[] explicitTypes, org.apache.tika.parser.Parser tikaParser)
          Builds up a list of supported mime types by merging an explicit list with any that Tika also claims to support
protected  java.util.Map extractRaw(org.alfresco.service.cmr.repository.ContentReader reader)
          Override to provide the raw extracted metadata values.
protected  java.util.Map extractSpecific(org.apache.tika.metadata.Metadata metadata, java.util.Map properties, java.util.Map headers)
          Allows implementation specific mappings to be done.
protected abstract  org.apache.tika.parser.Parser getParser()
          Returns the correct Tika Parser to process the document.
protected  java.util.Date makeDate(java.lang.String dateStr)
          Version which also tries the ISO-8601 formats (in order..), and similar formats, which Tika makes use of
protected  boolean needHeaderContents()
          Do we care about the contents of the extracted header, or nothing at all?
 
Methods inherited from class org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter
checkIsSupported, extract, extract, extract, filterSystemProperties, getDefaultMapping, getExtractionTime, getMapping, getMimetypeService, getReliability, init, isSupported, newRawMap, putRawValue, readMappingProperties, readMappingProperties, register, setDictionaryService, setFailOnTypeConversion, setInheritDefaultMapping, setMapping, setMappingProperties, setMimetypeService, setOverwritePolicy, setOverwritePolicy, setRegistry, setSupportedDateFormats, setSupportedMimetypes
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

logger

protected static org.apache.commons.logging.Log logger

KEY_AUTHOR

protected static final java.lang.String KEY_AUTHOR
See Also:
Constant Field Values

KEY_TITLE

protected static final java.lang.String KEY_TITLE
See Also:
Constant Field Values

KEY_SUBJECT

protected static final java.lang.String KEY_SUBJECT
See Also:
Constant Field Values

KEY_CREATED

protected static final java.lang.String KEY_CREATED
See Also:
Constant Field Values

KEY_DESCRIPTION

protected static final java.lang.String KEY_DESCRIPTION
See Also:
Constant Field Values

KEY_COMMENTS

protected static final java.lang.String KEY_COMMENTS
See Also:
Constant Field Values
Constructor Detail

TikaPoweredMetadataExtracter

public TikaPoweredMetadataExtracter(java.util.ArrayList supportedMimeTypes)

TikaPoweredMetadataExtracter

public TikaPoweredMetadataExtracter(java.util.HashSet supportedMimeTypes)
Method Detail

buildSupportedMimetypes

protected static java.util.ArrayList buildSupportedMimetypes(java.lang.String[] explicitTypes,
                                                             org.apache.tika.parser.Parser tikaParser)
Builds up a list of supported mime types by merging an explicit list with any that Tika also claims to support


makeDate

protected java.util.Date makeDate(java.lang.String dateStr)
Version which also tries the ISO-8601 formats (in order..), and similar formats, which Tika makes use of

Overrides:
makeDate in class AbstractMappingMetadataExtracter

getParser

protected abstract org.apache.tika.parser.Parser getParser()
Returns the correct Tika Parser to process the document. If you don't know which you want, use TikaAutoMetadataExtracter which makes use of the Tika auto-detection.


needHeaderContents

protected boolean needHeaderContents()
Do we care about the contents of the extracted header, or nothing at all?


extractSpecific

protected java.util.Map extractSpecific(org.apache.tika.metadata.Metadata metadata,
                                        java.util.Map properties,
                                        java.util.Map headers)
Allows implementation specific mappings to be done.


extractRaw

protected java.util.Map extractRaw(org.alfresco.service.cmr.repository.ContentReader reader)
                            throws java.lang.Throwable
Description copied from class: AbstractMappingMetadataExtracter
Override to provide the raw extracted metadata values. An extracter should extract as many of the available properties as is realistically possible. Even if the default mapping doesn't handle all properties, it is possible for each instance of the extracter to be configured differently and more or less of the properties may be used in different installations.

Raw values must not be trimmed or removed for any reason. Null values and empty strings are

Properties extracted and their meanings and types should be thoroughly described in the class-level javadocs of the extracter implementation, for example:

 editor: - the document editor        -->  cm:author
 title:  - the document title         -->  cm:title
 user1:  - the document summary
 user2:  - the document description   -->  cm:description
 user3:  -
 user4:  -
 

Specified by:
extractRaw in class AbstractMappingMetadataExtracter
Parameters:
reader - the document to extract the values from. This stream provided by the reader must be closed if accessed directly.
Returns:
Returns a map of document property values keyed by property name.
Throws:
java.lang.Throwable
See Also:
AbstractMappingMetadataExtracter.getDefaultMapping()


Copyright © 2005 - 2010 Alfresco Software, Inc. All Rights Reserved.