gocr
(1)
Name
gocr - command line text recognition tool
Synopsis
gocr [OPTION] [-i] pnm-file
Description
User Commands GOCR(1)
NAME
gocr - command line text recognition tool
SYNOPSIS
gocr [OPTION] [-i] pnm-file
DESCRIPTION
gocr is an optical character recognition program that can be
used from the command line. It takes input in PNM, PGM,
PBM, PPM, or PCX format, and writes recognized text to std-
out. If the pnm file is a single dash, PNM data is read
from stdin. If gzip, bzip2 and netpbm-progs are installed
and your system supports popen(3) also pnm.gz, pnm.bz2, png,
jpg, jpeg, tiff, gif, bmp, ps (only single pages) and eps
are supported as input files (not as input stream), where
pnm can be replaced by one of ppm, pgm and pbm.
OPTIONS
-h show usage information
-i file
read input from file (or stdin if file is a single
dash)
-o file
send output to file instead of stdout
-e file
send errors to file instead of stderr or to stdout if
file is a dash
-x file
progress output to file (file can be a file name, a
fifo name or a file descriptor 1...255), this is useful
for GUI developpers to show the OCR progress, the file
descriptor argument is only available, if compiled with
__USE_POSIX defined
-p path
database path, a final slash must be included, default
is ./db/, this path will be populated with images of
learned characters
-f format
output format of the recognized text (ISO8859_1 TeX
HTML XML UTF8 ASCII), XML will also output position and
probability data
-l level
set grey level to level (0<160<=255, default: 0 for
autodetect), darker pixels belong to characters,
brighter pixels are interpreted as background of the
Linux Last change: 29 Mar 2009 1
User Commands GOCR(1)
input image
-d size
set dust size in pixels (clusters smaller than this are
removed), 0 means no clusters are removed, the default
is -1 for auto detection
-s num
set spacewidth between words in units of dots (default:
0 for autodetect), wider widths are interpreted as word
spaces, smaller as character spaces
-v verbosity
be verbose to stderr; verbosity is a bitfield
-c string
only verbose output of characters from string to
stderr, more output is generated for all characters
within the string, the underscore stands for unknown
chars, this function is usefull to limit debug informa-
tion to the necessary one
-C string
only recognise characters from string, this is a filter
function in cases where the interest is only to a part
of the character alphabet, you can use 0-9 or a-z to
specify ranges, use -- to detect the minus sign
-a certainty
set value for certainty of recognition (0..100;
default: 95), characters with a higher certainty are
accepted, characters with a lower certainty are treated
as unknown (not recognized); set higher values, if you
want to have only more certain recognized characters
-u string
output this string for every unrecognized character
(default is "_")
-m mode
set oprational mode; mode is a bitfield (default: 0)
-n bool
if bool is non-zero, only recognise numbers (this is
now obsolete, use -C "0123456789")
The verbosity is specified as a bitfield:
1 print more info
2 list shapes of boxes (see -c) to stderr
Linux Last change: 29 Mar 2009 2
User Commands GOCR(1)
4 list pattern of boxes (see -c) to stderr
8 print pattern after recognition for debugging
16 print debug information about recognition of lines
to stderr
32 create outXX.png with boxes and lines marked on
each general OCR-step
The operation modes are:
2 use database to recognize characters which are not
recognized by other algorithms, (early develop-
ment)
4 switching on layout analysis or zoning (develop-
ment)
8 don't compare unrecognized characters to recog-
nized one
16 don't try to divide overlapping characters to two
or three single characters
32 don't do context correction
64 character packing, before recognition starts, sim-
ilar characters are searched and only one of this
characters will be send to the recognition engine
(development)
130 extend database, prompts user for unidentified
characters and extends the database with users
answer (128+2, early development)
256 switch off the recognition engine (makes sense
together with -m 2)
AUTHOR
Joerg Schulenburg (see http://jocr.sourceforge.net/ for
EMAIL)
First version of man page by Tim Waugh <[email protected]>
VERSION INFORMATION
This man page documents gocr, version 0.41.
REPORTING BUGS
Report bugs to Joerg Schulenburg
Linux Last change: 29 Mar 2009 3
User Commands GOCR(1)
ATTRIBUTES
See attributes(5) for descriptions of the following
attributes:
+---------------+------------------+
|ATTRIBUTE TYPE | ATTRIBUTE VALUE |
+---------------+------------------+
|Availability | image/gocr |
+---------------+------------------+
|Stability | Volatile |
+---------------+------------------+
SEE ALSO
More details can be found at /usr/share/doc/gocr-
X.XX/gocr.html. Also read /usr/share/doc/gocr-X.XX/README
to learn, how to improve results.
EXAMPLES
gocr -v 33 text1.pbm
output verbose information, out30.png is created to see
details of recognition process
gocr -v 7 -c _YV text1.pbm
verbose output for unknown chars and chars Y and V
djpeg -pnm -gray text.jpg | gocr
convert a jpeg file to pnm format and input via pipe
NOTES
This software was built from source available at
https://java.net/projects/solaris-userland. The original
community source was downloaded from http://prdown-
loads.sourceforge.net/jocr/gocr-0.48.tar.gz
Further information about this software can be found on the
open source community website at http://jocr.source-
forge.net/.
Linux Last change: 29 Mar 2009 4