IExtract - Extracting information out of documents


English version  

This program is distributed under the terms of the GNU General Public License; in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Deutsche Version  
Versión en castellano  

IExtract is a little utility designed to extract the properites (title, author and comment) out of different documents and present them in a list for further processing.

The following documents are supported (further types can be added with plugins):

HTML

Searches for the text between the <title> and </title> tags and the content of the meta tags (both according to HTML 4.0 and Dublin Core).

PNG

PNG-images can contain text-chunks with keyword - value pairs.

The (uncompressed) content of the keywords Title, Author and Description is extracted.

GIF

GIF-images can contain "Comment extensions". The content of these entries is returned in the comment-field. Author and title are left empty (for lack of availability.)

JPEG

The application recognices comments stored after a comment marker (0xFFFE), in an APP1 Exif marker (as stored by Microsoft/Windows XP and in an APPD marker (as seem to write some versions of PhotoShop.)

Microsoft Office documents

Searches for the content of the properties dialog.

Thanks to the Apache Jakarta POI project and their documentation of the "OLE2 Document compound format" (as it is used by MS Office) all documents should be able to be parsed (maybe with the exception of documents bigger than 6.8 MB.)

Office Open XML (OOXML) documents can be processed, if the library zlib is installed.

MP3

Extracts the content of the ID3 tag (version 1.x and 2.x). The title of the album is returned in the comment field.

OGG

Extracts the content of the comment header. The title of the album is returned in the comment field.

PDF

PDF documents contain a so called Document Information directory with various keys. The content of this directory is extracted (with the "Subject"-key as comment).

Encrypted documents are not decrypted!

StarOffice documents

Searches for the content of the properties dialog.

OpenOffice documents

Searches for the content of the properties dialog.

Abiword documents

Searches for the content of the properties dialog.

RTF documents

Searches for the content of the "info" block.

The output can be in HTML-format (a table), XML (default XHTML), LaTeX (tabular) format or plain text (both human-readable (separated by spaces) or easily parseable (quoted, separated by quotes)). Note that special characters are converted in the exctracted information.

The behaviour of the program is controlled with a INI file (~/.IExtract for UNICES or %HOMEDRIVE%%HOMEPATH%IExtract.ini for Windows). See Format of the INI file for further information.

This handling can be overridden by specifying either a different INI file or by passing further options to the program.

The files searched for can contain the UNIX-typical wildcards (an asterisk (*) for any number of any characters, the question-mark (?) for any single character and a range of valid characters in brackets ([) and (]), either by listing all of, by specifying the borders separated with a minus (-) or by specifying its class (in '[:' and ':]') To invert this selection a leading caret (^) or an exclaimation mark (!) can be used). This is also true for the Windows version.

Examples are:

*.mp3
Inspect only MP3 files.
[A-Za-z]*
Inspect files which start with a letter.
[^[:alnum:]]*
Inspect files which don't start with a letter or a number.
???.txt
Inspect text-files having a name with exactly 3 characters.

Usage

   IExtract [OPTIONS] <File(s)>

Options

The following options are recognized by the program (short options can be combined together; if the long option needs an argument, so does its short counterpart. Long options can be abbreviated as long as they stay unique):

   -r, --recursive Recurse into subdirectories after processing the current directory
 
   -o, --output=STYLE Sets the style of the output (text, quoted, HTML or LaTeX.)
 
   -f, --format=FORMAT Format of the output (default: %n¦-¦%t¦%a¦%c¦%d)

The output can be separted in columns; indicated by the pipe character (¦) in the format string.

The percent sign (%) indicates that the following character has a special meaning:

  • %a is substituted with the author
  • %c is substituted with the comment
  • %d is substituted with the modification time of the file
  • %D is substituted with the modification time of the file (day only)
  • %e is substituted with the extension of the file
  • %E is substituted with the name of the file without extension
  • %n is substituted with the name of the file
  • %N is substituted with path and name of the file
  • %p is substituted with the path of the file
  • %P is substituted with the path of the file in UNIX style (the nodes separated with a slash (/))
  • %s is substituted with the size of the file
  • %S is substituted with the size of the file (human readable)
  • %t is substituted with the title
  • %U is substituted with path and name of the file in UNIX style (the nodes separated with a slash (/)
  • %(LETTERS) is substituted with first of the above substitutions producing a non-empty string (e.g. %(tn) is the tile if not empty or else the filename.)
  • %*LETTER changes the substitution slightly. If LETTER is a filename substitution, it additionally changes "special characters" (which depend of the output-mode) in the file name; for the others it suppresses this change.

In every other constellation the '%' is removed!

 
   -T, --title=TITLE Title of the output (written even if there's no further output).

TITLE specifies the columns for the output; separated with the pipe character (¦); every columns must contain at least one character.

 
   -s, --separate=TEXT Separate subdirectories with TEXT (default: empty); implies recursion into subdirectories (--recursive).

The percent sign (%) indicates that the following character has a special meaning:

  • %e prints the end-of-output for the specified output style
  • %n is substituted with the name of the directory
  • %N is substituted with the full path of the directory
  • %p is substituted with the path to the directory
  • %P is substituted with the path to the directory in UNIX style (separated with a slash (/))
  • %s prints the start-of-output for the specified output style
  • %U is substituted with the full path of the dir in UNIX style (separated with a slash (/))
 
   -p, --prepend=TEXT Specifies a text, which is written before any other output (e. g. for a headline).
 
   -P, --pre-file=FILE Specifies a file, whose content is written before any other output (e.g. for a header).
 
   -a, --append=TEXT Specifies a text, which is written after any other output (e. g. for a footline).
 
   -A, --app-file=FILE Specifies a file, whose content is written after any other output (e.g. for a footer).
 
   -e, --show-errors Puts error messages (additionally) into the output
 
   -u, --add-unknown Show all files (including unknown types) in the output
 
   -n, --new=[DAYS:]TEXT Show (leading) TEXT (in his own column) for files younger than DAYS days (default: 30)

DAYS may be omited or may have an multiplier suffix: m for 30.

 
   -i, --include=LIST Specifies which files should be inspected; this can be even a list of files; separated with the path-separator of the operating system (the colon (:) for UNICES; the semicolon (;) for Windows).

File specifications can contain the UNIX-typical wildcards. Details can be found in the top of the document.

 
   -x, --exclude=LIST Specifies which files should not be inspected; this can be even a list of files; separated with the path-separator of the operating system (the colon (:) for UNICES; the semicolon (;) for Windows).

File specifications can contain the UNIX-typical wildcards. Details can be found in the top of the document.

 
   -I, --ini-file=FILE Read further options from the specified file. See Format of the INI file for further information.
 
   -t, --threads=NR Set the number of threads for examing files. This threads are additional to the main thread which searches for the files to examine.

This option is only available, if the program has been configured (or compiled) with --enable-threads (or -DENABLE_THREADS)!

 
   -S, --sort Sorts the found files alphabetically.
 
   -M, --mode=[MODUS] Spezifies, how the type of the files is being determined. Possible values are:
Ext
From the last extension.
EXT
From the last extension (ignoring case).
AllExt
From the last known extension.
AllEXT
From the last known extension (ignoring case).
Content
From the content of the file. This searches for certain identifiers characterising the different file-types.
 
   -V, --version Output version information and exit.
 
   -h, -?, --help Displays this help and exit.
 
   File(s) Specifies the starting directory and/or the files to inspect.

The options append and app-file (or equivalently prepend and pre-file) can be repeated. Each new options adds its text to the previous ones.

Plugins

Plugins are little extensions, which add the support of further types of documents.

They are realised as shared libraries (UNIX) or DLLs (Windows), which offer two functions:

processFile
Reads the properties from the passed file
getFileType
Checks, if the passed file has the right type

An example can be found in src/Plugins/Text.cpp.

Format of the INI file

The format of the INI file is like this (entries can be missing):

   [Output]
   Format=FORMAT
   Title=TEXT
   TextForNewFiles=TEXT
   MaxAgeForNewFiles=DAYS
   DirSeparatorText=TEXT
   Style=STYLE
   SortFiles=1

   [FileType]
   Mode=Content

   [Handler]
   <Extension1>=<Library1>
   <Extension2>=<Library2>
   <ExtensionN>=<LibraryN>

The same substituations as within the options are performed!


Requirements

To compile the sources, you'll need my libYGP library; also available as a Sourceforge project. See http://libymp.sourceforge.net for details.

An installed zlib allows to process Office Open XML (OOXML) documents.

The Windows executeable has no requirements. But keep in mind, that this application is cross-compiled with MinGW and as such behaves very much like a Unix-utility (i.e. directories are separated by slashes). This version is completely unsupported!


Download

Get it from the Sourceforge download area.


Mail comments
SourceForge.net Logo