English version |
This program is distributed under the terms of the GNU General Public License; in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. | |
Deutsche Version | ||
Versión en castellano |
IExtract
is a little utility designed to extract
the properites (title, author and comment) out of different
documents and present them in a list for further processing.
The following documents are supported (further types can be added with plugins):
Searches for the text between the <title>
and
</title>
tags and the content of the meta
tags (both according to HTML 4.0 and Dublin Core).
PNG-images can contain text-chunks with keyword - value pairs.
The (uncompressed) content of the keywords Title, Author and Description is extracted.
GIF-images can contain "Comment extensions". The content of these entries is returned in the comment-field. Author and title are left empty (for lack of availability.)
The application recognices comments stored after a comment marker (0xFFFE), in an APP1 Exif marker (as stored by Microsoft/Windows XP and in an APPD marker (as seem to write some versions of PhotoShop.)
Searches for the content of the properties dialog.
Thanks to the Apache Jakarta POI project and their documentation of the "OLE2 Document compound format" (as it is used by MS Office) all documents should be able to be parsed (maybe with the exception of documents bigger than 6.8 MB.)
Office Open XML (OOXML) documents can be processed, if the library zlib is installed.
Extracts the content of the ID3 tag (version 1.x and 2.x). The title of the album is returned in the comment field.
Extracts the content of the comment header. The title of the album is returned in the comment field.
PDF documents contain a so called Document Information directory with various keys. The content of this directory is extracted (with the "Subject"-key as comment).
Encrypted documents are not decrypted!
Searches for the content of the properties dialog.
Searches for the content of the properties dialog.
Searches for the content of the properties dialog.
Searches for the content of the "info" block.
The output can be in HTML-format (a table), XML (default XHTML), LaTeX (tabular) format or plain text (both human-readable (separated by spaces) or easily parseable (quoted, separated by quotes)). Note that special characters are converted in the exctracted information.
The behaviour of the program is controlled with a INI file (~/.IExtract for UNICES or %HOMEDRIVE%%HOMEPATH%IExtract.ini for Windows). See Format of the INI file for further information.
This handling can be overridden by specifying either a different INI file or by passing further options to the program.
The files searched for can contain the UNIX-typical wildcards (an asterisk (*) for any number of any characters, the question-mark (?) for any single character and a range of valid characters in brackets ([) and (]), either by listing all of, by specifying the borders separated with a minus (-) or by specifying its class (in '[:' and ':]') To invert this selection a leading caret (^) or an exclaimation mark (!) can be used). This is also true for the Windows version.
Examples are:
IExtract [OPTIONS] <File(s)>
The following options are recognized by the program (short options can be combined together; if the long option needs an argument, so does its short counterpart. Long options can be abbreviated as long as they stay unique):
-r, --recursive | Recurse into subdirectories after processing the current directory |
-o, --output=STYLE | Sets the style of the output (text, quoted, HTML or LaTeX.) |
-f, --format=FORMAT | Format of the output (default: %n¦-¦%t¦%a¦%c¦%d)
The output can be separted in columns; indicated by the pipe character (¦) in the format string. The percent sign (%) indicates that the following character has a special meaning:
In every other constellation the '%' is removed! |
-T, --title=TITLE | Title of the output (written even if there's no further output).
TITLE specifies the columns for the output; separated with the pipe character (¦); every columns must contain at least one character. |
-s, --separate=TEXT | Separate subdirectories with TEXT (default: empty); implies
recursion into subdirectories (--recursive).
The percent sign (%) indicates that the following character has a special meaning:
|
-p, --prepend=TEXT | Specifies a text, which is written before any other output (e. g. for a headline). |
-P, --pre-file=FILE | Specifies a file, whose content is written before any other output (e.g. for a header). |
-a, --append=TEXT | Specifies a text, which is written after any other output (e. g. for a footline). |
-A, --app-file=FILE | Specifies a file, whose content is written after any other output (e.g. for a footer). |
-e, --show-errors | Puts error messages (additionally) into the output |
-u, --add-unknown | Show all files (including unknown types) in the output |
-n, --new=[DAYS:]TEXT | Show (leading) TEXT (in his own column) for files younger than
DAYS days (default: 30)
DAYS may be omited or may have an multiplier suffix: m for 30. |
-i, --include=LIST | Specifies which files should be inspected; this can be even a
list of files; separated with the path-separator of the operating
system (the colon (:) for UNICES; the semicolon (;) for Windows).
File specifications can contain the UNIX-typical wildcards. Details can be found in the top of the document. |
-x, --exclude=LIST | Specifies which files should not be inspected; this can be
even a list of files; separated with the path-separator of the operating
system (the colon (:) for UNICES; the semicolon (;) for Windows).
File specifications can contain the UNIX-typical wildcards. Details can be found in the top of the document. |
-I, --ini-file=FILE | Read further options from the specified file. See Format of the INI file for further information. |
-t, --threads=NR | Set the number of threads for examing files. This threads are additional
to the main thread which searches for the files to examine.
This option is only available, if the program has been configured
(or compiled) with |
-S, --sort | Sorts the found files alphabetically. |
-M, --mode=[MODUS] | Spezifies, how the type of the files is being determined.
Possible values are:
|
-V, --version | Output version information and exit. |
-h, -?, --help | Displays this help and exit. |
File(s) | Specifies the starting directory and/or the files to inspect. |
The options append and app-file (or equivalently prepend and pre-file) can be repeated. Each new options adds its text to the previous ones.
Plugins are little extensions, which add the support of further types of documents.
They are realised as shared libraries (UNIX) or DLLs (Windows), which offer two functions:
An example can be found in src/Plugins/Text.cpp.
The format of the INI file is like this (entries can be missing):
[Output] Format=FORMAT Title=TEXT TextForNewFiles=TEXT MaxAgeForNewFiles=DAYS DirSeparatorText=TEXT Style=STYLE SortFiles=1 [FileType] Mode=Content [Handler] <Extension1>=<Library1> <Extension2>=<Library2> <ExtensionN>=<LibraryN>
The same substituations as within the options are performed!
To compile the sources, you'll need my libYGP library; also available as a Sourceforge project. See http://libymp.sourceforge.net for details.
An installed zlib allows to process Office Open XML (OOXML) documents.
The Windows executeable has no requirements. But keep in mind, that this application is cross-compiled with MinGW and as such behaves very much like a Unix-utility (i.e. directories are separated by slashes). This version is completely unsupported!
Get it from the Sourceforge download area.
Mail comments |