org.alfresco.repo.content.metadata
Class AbstractMappingMetadataExtracter

java.lang.Object
  extended by org.alfresco.repo.content.metadata.AbstractMappingMetadataExtracter
All Implemented Interfaces:
ContentWorker, MetadataExtracter
Direct Known Subclasses:
HtmlMetadataExtracter, MappingMetadataExtracterTest.DummyMappingMetadataExtracter, OpenOfficeMetadataExtracter, RFC822MetadataExtracter, TikaPoweredMetadataExtracter, XmlMetadataExtracter, XPathMetadataExtracter

public abstract class AbstractMappingMetadataExtracter
extends java.lang.Object
implements MetadataExtracter

Support class for metadata extracters that support dynamic and config-driven mapping between extracted values and model properties. Extraction is broken up into two phases:

Migrating an existing extracter to use this class is straightforward:

Since:
2.1
See Also:
AbstractMappingMetadataExtracter.getDefaultMapping(), AbstractMappingMetadataExtracter.extractRaw(ContentReader), AbstractMappingMetadataExtracter.setMapping(Map)

Nested Class Summary
 
Nested classes/interfaces inherited from interface org.alfresco.repo.content.metadata.MetadataExtracter
MetadataExtracter.OverwritePolicy
 
Field Summary
protected static org.apache.commons.logging.Log logger
           
static java.lang.String NAMESPACE_PROPERTY_PREFIX
           
protected  java.util.Set supportedDateFormats
           
 
Constructor Summary
protected AbstractMappingMetadataExtracter()
          Default constructor.
protected AbstractMappingMetadataExtracter(java.util.Set supportedMimetypes)
          Constructor that can be used when the list of supported mimetypes is known up front.
 
Method Summary
protected  void checkIsSupported(org.alfresco.service.cmr.repository.ContentReader reader)
          Checks if the mimetype is supported.
 java.util.Map extract(org.alfresco.service.cmr.repository.ContentReader reader, java.util.Map destination)
          Extracts the metadata values from the content provided by the reader and source mimetype to the supplied map.
 java.util.Map extract(org.alfresco.service.cmr.repository.ContentReader reader, MetadataExtracter.OverwritePolicy overwritePolicy, java.util.Map destination)
          Extracts the metadata values from the content provided by the reader and source mimetype to the supplied map.
 java.util.Map extract(org.alfresco.service.cmr.repository.ContentReader reader, MetadataExtracter.OverwritePolicy overwritePolicy, java.util.Map destination, java.util.Map mapping)
          Extracts the metadata from the content provided by the reader and source mimetype to the supplied map.
protected abstract  java.util.Map extractRaw(org.alfresco.service.cmr.repository.ContentReader reader)
          Override to provide the raw extracted metadata values.
protected  void filterSystemProperties(java.util.Map systemProperties, java.util.Map targetProperties)
          Filters the system properties that are going to be applied.
protected  java.util.Map getDefaultMapping()
          This method provides a best guess of where to store the values extracted from the documents.
 long getExtractionTime()
          Provides an estimate, usually a worst case guess, of how long an extraction will take.
protected  java.util.Map getMapping()
          Helper method for derived classes to obtain the mappings that will be applied to raw values.
protected  org.alfresco.service.cmr.repository.MimetypeService getMimetypeService()
           
 double getReliability(java.lang.String mimetype)
          TODO - This doesn't appear to be used, so should be removed / deprecated / replaced
protected  void init()
          Provides a hook point for implementations to perform initialization.
 boolean isSupported(java.lang.String sourceMimetype)
          Determines if the extracter works against the given mimetype.
protected  java.util.Date makeDate(java.lang.String dateStr)
          Convert a date String to a Date object
protected  java.util.Map newRawMap()
          Helper method to fetch a clean map into which raw values can be dumped.
protected  boolean putRawValue(java.lang.String key, java.io.Serializable value, java.util.Map destination)
          Adds a value to the map, conserving null values.
protected  java.util.Map readMappingProperties(java.util.Properties mappingProperties)
          A utility method to convert mapping properties to the Map form.
protected  java.util.Map readMappingProperties(java.lang.String propertiesUrl)
          A utility method to read mapping properties from a resource file and convert to the map form.
 void register()
          Registers this instance of the extracter with the registry.
 void setDictionaryService(org.alfresco.service.cmr.dictionary.DictionaryService dictionaryService)
           
 void setFailOnTypeConversion(boolean failOnTypeConversion)
          Set whether the extractor should discard metadata that fails to convert to the target type defined in the data dictionary model.
 void setInheritDefaultMapping(boolean inheritDefaultMapping)
          Set if the property mappings augment or override the mapping generically provided by the extracter implementation.
 void setMapping(java.util.Map mapping)
          Set the mapping from document metadata to system metadata.
 void setMappingProperties(java.util.Properties mappingProperties)
          Set the properties that contain the mapping from document metadata to system metadata.
 void setMimetypeService(org.alfresco.service.cmr.repository.MimetypeService mimetypeService)
           
 void setOverwritePolicy(MetadataExtracter.OverwritePolicy overwritePolicy)
          Set the policy to use when existing values are encountered.
 void setOverwritePolicy(java.lang.String overwritePolicyStr)
          Set the policy to use when existing values are encountered.
 void setRegistry(MetadataExtracterRegistry registry)
          Set the registry to register with.
 void setSupportedDateFormats(java.util.List supportedDateFormats)
          Set the date formats, over and above the ISO8601 format, that will be supported for string to date conversions.
 void setSupportedMimetypes(java.util.Collection supportedMimetypes)
          Set the mimetypes that are supported by the extracter.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

NAMESPACE_PROPERTY_PREFIX

public static final java.lang.String NAMESPACE_PROPERTY_PREFIX
See Also:
Constant Field Values

logger

protected static org.apache.commons.logging.Log logger

supportedDateFormats

protected java.util.Set supportedDateFormats
Constructor Detail

AbstractMappingMetadataExtracter

protected AbstractMappingMetadataExtracter()
Default constructor. If this is called, then AbstractMappingMetadataExtracter.isSupported(String) should be implemented. This is useful when the list of supported mimetypes is not known when the instance is constructed. Alternatively, once the set becomes known, call AbstractMappingMetadataExtracter.setSupportedMimetypes(Collection).

See Also:
AbstractMappingMetadataExtracter.isSupported(String), AbstractMappingMetadataExtracter.setSupportedMimetypes(Collection)

AbstractMappingMetadataExtracter

protected AbstractMappingMetadataExtracter(java.util.Set supportedMimetypes)
Constructor that can be used when the list of supported mimetypes is known up front.

Parameters:
supportedMimetypes - the set of mimetypes supported by default
Method Detail

setRegistry

public void setRegistry(MetadataExtracterRegistry registry)
Set the registry to register with. If this is not set, then the default initialization will not auto-register the extracter for general use. It can still be used directly.

Parameters:
registry - a metadata extracter registry

setMimetypeService

public void setMimetypeService(org.alfresco.service.cmr.repository.MimetypeService mimetypeService)
Parameters:
mimetypeService - the mimetype service. Set this if required.

getMimetypeService

protected org.alfresco.service.cmr.repository.MimetypeService getMimetypeService()
Returns:
Returns the mimetype helper

setDictionaryService

public void setDictionaryService(org.alfresco.service.cmr.dictionary.DictionaryService dictionaryService)
Parameters:
dictionaryService - the dictionary service to determine which data conversions are necessary

setSupportedMimetypes

public void setSupportedMimetypes(java.util.Collection supportedMimetypes)
Set the mimetypes that are supported by the extracter.

Parameters:
supportedMimetypes -

isSupported

public boolean isSupported(java.lang.String sourceMimetype)
Determines if the extracter works against the given mimetype.

Specified by:
isSupported in interface MetadataExtracter
Parameters:
sourceMimetype - the document mimetype
Returns:
Returns true if the mimetype is supported, otherwise false.
See Also:
AbstractMappingMetadataExtracter.setSupportedMimetypes(Collection)

getReliability

public double getReliability(java.lang.String mimetype)
TODO - This doesn't appear to be used, so should be removed / deprecated / replaced

Specified by:
getReliability in interface MetadataExtracter
Parameters:
mimetype - the mimetype to check
Returns:
Returns 1.0 if the mimetype is supported, otherwise 0.0
See Also:
AbstractMappingMetadataExtracter.isSupported(String)

setOverwritePolicy

public void setOverwritePolicy(MetadataExtracter.OverwritePolicy overwritePolicy)
Set the policy to use when existing values are encountered. Depending on how the extractor is called, this may not be relevant, i.e an empty map of existing properties may be passed in by the client code, which may follow its own overwrite strategy.

Parameters:
overwritePolicy - the policy to apply when there are existing system properties

setOverwritePolicy

public void setOverwritePolicy(java.lang.String overwritePolicyStr)
Set the policy to use when existing values are encountered. Depending on how the extractor is called, this may not be relevant, i.e an empty map of existing properties may be passed in by the client code, which may follow its own overwrite strategy.

Parameters:
overwritePolicyStr - the policy to apply when there are existing system properties

setFailOnTypeConversion

public void setFailOnTypeConversion(boolean failOnTypeConversion)
Set whether the extractor should discard metadata that fails to convert to the target type defined in the data dictionary model. This is true by default i.e. if the data extracted is not compatible with the target model then the extraction will fail. If this is false then any extracted data that fails to convert will be discarded.

Parameters:
failOnTypeConversion - false to discard properties that can't get converted to the dictionary-defined type, or true (default) to fail the extraction if the type doesn't convert

setSupportedDateFormats

public void setSupportedDateFormats(java.util.List supportedDateFormats)
Set the date formats, over and above the ISO8601 format, that will be supported for string to date conversions. The supported syntax is described by the SimpleDateFormat Javadocs.

Parameters:
supportedDateFormats - a list of supported date formats.

setInheritDefaultMapping

public void setInheritDefaultMapping(boolean inheritDefaultMapping)
Set if the property mappings augment or override the mapping generically provided by the extracter implementation. The default is false, i.e. any mapping set completely replaces the default mappings.

Parameters:
inheritDefaultMapping - true to add the configured mapping to the list of default mappings.
See Also:
AbstractMappingMetadataExtracter.getDefaultMapping(), AbstractMappingMetadataExtracter.setMapping(Map), AbstractMappingMetadataExtracter.setMappingProperties(Properties)

setMapping

public void setMapping(java.util.Map mapping)
Set the mapping from document metadata to system metadata. It is possible to direct an extracted document property to several system properties. The conversion between the document property types and the system property types will be done by the default converter.

Parameters:
mapping - a mapping from document metadata to system metadata

setMappingProperties

public void setMappingProperties(java.util.Properties mappingProperties)
Set the properties that contain the mapping from document metadata to system metadata. This is an alternative to the AbstractMappingMetadataExtracter.setMapping(Map) method. Any mappings already present will be cleared out. The property mapping is of the form:
 # Namespaces prefixes
 namespace.prefix.cm=http://www.alfresco.org/model/content/1.0
 namespace.prefix.my=http://www....com/alfresco/1.0
 
 # Mapping
 editor=cm:author, my:editor
 title=cm:title
 user1=cm:summary
 user2=cm:description
 
The mapping can therefore be from a single document property onto several system properties.

Parameters:
mappingProperties - the properties that map document properties to system properties

getMapping

protected final java.util.Map getMapping()
Helper method for derived classes to obtain the mappings that will be applied to raw values. This should be called after initialization in order to guarantee the complete map is given.

Normally, the list of properties that can be extracted from a document is fixed and well-known - in that case, just extract everything. But Some implementations may have an extra, indeterminate set of values available for extraction. If the extraction of these runtime parameters is expensive, then the keys provided by the return value can be used to extract values from the documents. The metadata extraction becomes fully configuration-driven, i.e. declaring further mappings will result in more values being extracted from the documents.

Most extractors will not be using this method. For an example of its use, see the OpenDocument extractor, which uses the mapping to select specific user properties from a document.


readMappingProperties

protected java.util.Map readMappingProperties(java.lang.String propertiesUrl)
A utility method to read mapping properties from a resource file and convert to the map form.

Parameters:
propertiesUrl - A standard Properties file URL location
See Also:
AbstractMappingMetadataExtracter.setMappingProperties(Properties)

readMappingProperties

protected java.util.Map readMappingProperties(java.util.Properties mappingProperties)
A utility method to convert mapping properties to the Map form.

See Also:
AbstractMappingMetadataExtracter.setMappingProperties(Properties)

register

public final void register()
Registers this instance of the extracter with the registry. This will call the AbstractMappingMetadataExtracter.init() method and then register if the registry is available.

See Also:
AbstractMappingMetadataExtracter.setRegistry(MetadataExtracterRegistry), AbstractMappingMetadataExtracter.init()

init

protected void init()
Provides a hook point for implementations to perform initialization. The base implementation must be invoked or the extracter will fail during extraction. The default mappings will be requested during initialization.


getExtractionTime

public long getExtractionTime()
Provides an estimate, usually a worst case guess, of how long an extraction will take.

This method is used to determine, up front, which of a set of equally reliant transformers will be used for a specific extraction.

Specified by:
getExtractionTime in interface MetadataExtracter
Returns:
Returns the approximate number of milliseconds per transformation

checkIsSupported

protected void checkIsSupported(org.alfresco.service.cmr.repository.ContentReader reader)
Checks if the mimetype is supported.

Parameters:
reader - the reader to check
Throws:
org.alfresco.error.AlfrescoRuntimeException - if the mimetype is not supported

extract

public final java.util.Map extract(org.alfresco.service.cmr.repository.ContentReader reader,
                                   java.util.Map destination)
Extracts the metadata values from the content provided by the reader and source mimetype to the supplied map. The internal mapping and overwrite policy between document metadata and system metadata will be used.

The extraction viability can be determined by an up front call to MetadataExtracter.isSupported(String).

The source mimetype must be available on the ContentAccessor.getMimetype() method of the reader.

Specified by:
extract in interface MetadataExtracter
Parameters:
reader - the source of the content
destination - the map of properties to populate (essentially a return value)
Returns:
Returns a map of all properties on the destination map that were added or modified. If the return map is empty, then no properties were modified.
See Also:
MetadataExtracter.extract(ContentReader, OverwritePolicy, Map, Map)

extract

public final java.util.Map extract(org.alfresco.service.cmr.repository.ContentReader reader,
                                   MetadataExtracter.OverwritePolicy overwritePolicy,
                                   java.util.Map destination)
Extracts the metadata values from the content provided by the reader and source mimetype to the supplied map.

The extraction viability can be determined by an up front call to MetadataExtracter.isSupported(String).

The source mimetype must be available on the ContentAccessor.getMimetype() method of the reader.

Specified by:
extract in interface MetadataExtracter
Parameters:
reader - the source of the content
overwritePolicy - the policy stipulating how the system properties must be overwritten if present
destination - the map of properties to populate (essentially a return value)
Returns:
Returns a map of all properties on the destination map that were added or modified. If the return map is empty, then no properties were modified.
See Also:
MetadataExtracter.extract(ContentReader, OverwritePolicy, Map, Map)

extract

public java.util.Map extract(org.alfresco.service.cmr.repository.ContentReader reader,
                             MetadataExtracter.OverwritePolicy overwritePolicy,
                             java.util.Map destination,
                             java.util.Map mapping)
Extracts the metadata from the content provided by the reader and source mimetype to the supplied map. The mapping from document metadata to system metadata is explicitly provided. The overwrite policy is also explictly set.

The extraction viability can be determined by an up front call to MetadataExtracter.isSupported(String).

The source mimetype must be available on the ContentAccessor.getMimetype() method of the reader.

Specified by:
extract in interface MetadataExtracter
Parameters:
reader - the source of the content
overwritePolicy - the policy stipulating how the system properties must be overwritten if present
destination - the map of properties to populate (essentially a return value)
mapping - a mapping of document-specific properties to system properties.
Returns:
Returns a map of all properties on the destination map that were added or modified. If the return map is empty, then no properties were modified.
See Also:
MetadataExtracter.extract(ContentReader, Map)

filterSystemProperties

protected void filterSystemProperties(java.util.Map systemProperties,
                                      java.util.Map targetProperties)
Filters the system properties that are going to be applied. Gives the metadata extracter an opportunity to remove properties that may not be appropriate in a given context.

Parameters:
systemProperties - map of system properties to be applied
targetProperties - map of target properties, may be used to provide to the context requried

makeDate

protected java.util.Date makeDate(java.lang.String dateStr)
Convert a date String to a Date object


putRawValue

protected boolean putRawValue(java.lang.String key,
                              java.io.Serializable value,
                              java.util.Map destination)
Adds a value to the map, conserving null values. Values are converted to null if:
  • it is an empty string value after trimming
  • it is an empty collection
  • it is an empty array
String values are trimmed before being put into the map. Otherwise, it is up to the extracter to ensure that the value is a Serializable. It is not appropriate to implicitly convert values in order to make them Serializable - the best conversion method will depend on the value's specific meaning.

Parameters:
key - the destination key
value - the serializable value
destination - the map to put values into
Returns:
Returns true if set, otherwise false

newRawMap

protected final java.util.Map newRawMap()
Helper method to fetch a clean map into which raw values can be dumped.

Returns:
Returns an empty map

getDefaultMapping

protected java.util.Map getDefaultMapping()
This method provides a best guess of where to store the values extracted from the documents. The list of properties mapped by default need not include all properties extracted from the document; just the obvious set of mappings need be supplied. Implementations must either provide the default mapping properties in the expected location or override the method to provide the default mapping.

The default implementation looks for the default mapping file in the location given by the class name and .properties. If the extracter's class is x.y.z.MyExtracter then the default properties will be picked up at classpath:/x/y/z/MyExtracter.properties. Inner classes are supported, but the '$' in the class name is replaced with '-', so default properties for x.y.z.MyStuff$MyExtracter will be located using x.y.z.MyStuff-MyExtracter.properties.

The default mapping implementation should include thorough Javadocs so that the system administrators can accurately determine how to best enhance or override the default mapping.

If the default mapping is declared in a properties file other than the one named after the class, then the AbstractMappingMetadataExtracter.readMappingProperties(String) method can be used to quickly generate the return value:


      protected Map<> getDefaultMapping()
      {
          return readMappingProperties(DEFAULT_MAPPING);
      }
 
The map can also be created in code either statically or during the call.

Returns:
Returns the default, static mapping. It may not be null.
See Also:
AbstractMappingMetadataExtracter.setInheritDefaultMapping(boolean inherit)

extractRaw

protected abstract java.util.Map extractRaw(org.alfresco.service.cmr.repository.ContentReader reader)
                                     throws java.lang.Throwable
Override to provide the raw extracted metadata values. An extracter should extract as many of the available properties as is realistically possible. Even if the default mapping doesn't handle all properties, it is possible for each instance of the extracter to be configured differently and more or less of the properties may be used in different installations.

Raw values must not be trimmed or removed for any reason. Null values and empty strings are

  • Null: Removed
  • Empty String: Passed to the OverwritePolicy
  • Non Serializable: Converted to String or fails if that is not possible

Properties extracted and their meanings and types should be thoroughly described in the class-level javadocs of the extracter implementation, for example:

 editor: - the document editor        -->  cm:author
 title:  - the document title         -->  cm:title
 user1:  - the document summary
 user2:  - the document description   -->  cm:description
 user3:  -
 user4:  -
 

Parameters:
reader - the document to extract the values from. This stream provided by the reader must be closed if accessed directly.
Returns:
Returns a map of document property values keyed by property name.
Throws:
org.apache.xmlbeans.impl.xb.xsdschema.All - exception conditions can be handled.
java.lang.Throwable
See Also:
AbstractMappingMetadataExtracter.getDefaultMapping()


Copyright © 2005 - 2010 Alfresco Software, Inc. All Rights Reserved.