Gap Framework - Natural Language Processing for PDF/TIFF/Image Documents
SPLITTER Module
High Precision PDF Page Splitting/OCR/Text Extraction
Technical Specification, Gap v0.9.2
1 Document
1.1 Document Overview
The document classifier contains the following primary classes, and their relationships:
- Document – This is the base class for the representation of a stored document. The constructor for the class object takes as parameters the stored path to the document, optionally a directory path for storing extracted pages and text, and optionally an event completion handler when processing the document asynchronously, and optionally a config parameter for configuring the NLP preprocessing.
document = Document(“/somedir/mydocument.pdf”, “/mypages/mydocument”)
The constructors calls the _exists()
and _collate()
private methods for the specified document.
- Page – This is a base class for the representation of an extracted page from the document. The
Document
class contains a list (index) of the extracted pages asPage
objects.
Fig. 1a High Level view of Document
Class Object Relationships
1.2 Initializer (Constructor)
Synopsis
Document( document=None, dir=’./’, ehandler=None, config=None)
Parameters
document: If not None, a string that is either:
1. local path to document
2. remote path to document ((i.e., http[s]://…)
The document must be one of the following types: PDF, JPG (JPEG, J2K), PNG, BMP or TIF (TIFF)
dir: The directory where to store the machine learning ready data.
ehandler: If not None, the processing of the images into machine learning ready data will be asynchronous, and the value of the parameter is the function (or method) that is the event handler when processing is complete.
The event handler takes the form:
def myHandler(images):
# where images is the Images object that was preprocessed.
config: If not None, a list of one or more configuration settings as strings:
bare
pos
roman
segment
stem=gap|porter|Lancaster|snowball|lemma
Usage
When specified with no parameters, an empty Document
object is created. The Document
object may then be used to subsequent load (retrieve) previously stored preprocessed machine learning ready data (see load()
).
Otherwise, the document parameter must be specified.
The document specified by the document parameter will be preprocessed according to the optional parameters and configuration settings.
By default, the document will be preprocessed as follows:
- The document will be split into individual pages.
- A
Page
object will be created for each page. - If the document (or page) is an image (e.g., scanned PDF), it will be OCR’d.
- The digital text will be extracted from each page and stored in the
Page
object. - The text will be optionally segmented into regions if the configuration setting segment is specified.
- The text from each page object will be preprocessed into machine learning ready data (see syntax module specification), according to the optional parameters and configuration settings.
- If the document was a scanned or image document, the quality of the scan will be estimated, unless
Document.SCANCHECK
is set to zero.
The machine learning ready data will be stored on a per page basis in the directory specified by the parameter dir
.
The following files are created and stored:
<document><pageno>.<suffix>
<document><pageno>.txt
<document>.<pageno>.json
The <document>
is the root name of the document, and <pageno>
is the corresponding page number starting at page 1. The file ending in the original file suffix <suffix>
is the split page. The file ending in the file suffix .txt
is the extracted text. The file ending in the file suffix .json
is the NLP preprocessed machine learning data stored in a JSON format.
If the ehandler parameter is not None
, then the above will occur asynchronously, and when completed, the corresponding event handler will be called with the Document
object passed as a parameter.
If the path to the document file is remote (i.e., starts with http), an HTTP request will be made to fetch the contents of the file from the remote location.
Exceptions
A TypeError
is raised if the type of the parameter is not the expected type.
A AttributeError
is raised if an invalid configuration setting is specified.
A FileNotFoundError
is raised if the document file does not exist.
A IOError
is raised if an error occurs reading in the document file.
1.3 Document Properties
1.3.1 document
Synopsis
# Getter
path = document.document
# Setter
document.document = path
Usage
When used as a getter the property returns the path to the document file.
When used as a setter the property specifies the path of the document file to preprocess into machine learning ready data (see initializer).
Exceptions
A TypeError
is raised if the type of the parameter is not the expected type.
A FileNotFoundError
is raised if the document file does not exist.
A IOError
is raised if an error occurs reading in the document file.
1.3.2 name
Synopsis
# Getter
root = document.name
Usage
When used as a getter the property returns the root name of the document file (e.g., /mydir/mydocument.pdf -> mydocument).
1.3.3 type
Synopsis
# Getter
suffix = document.type
Usage
When used as a getter the property returns the file suffix of the document file (e.g., pdf).
1.3.4 size
Synopsis
# Getter
size = document.size
Usage
When used as a getter the property returns the file size of the document file in bytes.
1.3.5 dir
Synopsis
# Getter
subfolder = document.dir
# Setter
document.dir = subfolder
Usage
When used as a getter the property returns the directory path where the corresponding files of the associated page objects are stored.
When used as a setter, it is only applicable when used in conjunction with the load()
method, indicating where the path where the files associated with the page objects are stored. Otherwise, it is ignored.
Exceptions
A TypeError
is raised if the type of the parameter is not the expected type.
A FileNotFoundError
is raised if the directory does not exist.
1.3.6 label
Synopsis
# Getter
label = document.label
# Setter
document.label = label
Usage
When used as a getter the property returns the integer label
specified for the document.
When used as a setter the property sets the label
of the document to the specified integer value.
Exceptions
A TypeError
is raised if the type of the parameter is not the expected type.
1.3.7 lang
Synopsis
# Getter
lang = document.lang
Usage
When used as a getter the property returns the language of the document, which may be either 'en' (English), 'es' (Spanish), 'fr' (French), 'de' (German), or 'it' (Italian).
1.3.8 scanned
Synopsis
# Getter
scanned, quality = document.scanned
Usage
When used as a getter the property returns whether the document is a scanned image True
or digital text False
document, and the estimated quality of the scan as a percentage (between 0 and 1).
1.3.9 time
Synopsis
# Getter
secs = document.time
Usage
When used as a getter the property returns the amount of time (in seconds) it took to preprocess the document into machine learning ready data.
1.3.10 text
Synopsis
# Getter
text = document.text
Usage
When used as a getter the property returns a list, one entry per page, of the extracted text from the document in its original Unicode format.
1.3.11 pages
Synopsis
# Getter
pages = document.pages
1.3.12 bagOfWords
Synopsis
# Getter
bag = document.bagOfWords
Usage
When used as a getter the property returns the document’s word sequences as a Bag of Words, represented as an unordered dictionary, where the key is the word and the value is the number of occurrences:
{ ‘<word’> : <no. of occurrences>, … }
1.3.13 freqDist
Synopsis
# Getter
freq = document.freqDist
Usage
When used as a getter the property returns the sorted tuples of a frequency distribution of words (from BagOfWords
), in descending order (i.e., highest first)
[ ( ‘<word’>: <no. of occurrences> ), … ]
1.3.14 termFreq
Synopsis
# Getter
tf = document.termFreq
Usage
When used as a getter the property returns the sorted tuples of a term frequency distribution (i.e., percent that term occurs), in descending order (i.e., highest first)
[ ( ‘<word’>: <percentage of occurrences> ), …. ]
1.3.15 Static Variables
The Document
class contains the following static variables:
- RESOLUTION – The image resolution when converting
PDF
toPNG
forOCR
(default300
). - SCANCHECK – The number of
OCR
words to check to estimate the quality of the scan. - WORDDICT - The word dictionary to use for scan spell check (default to
norvig
).
1.4 Document Overridden Operators
1.4.1 len()
Synopsis
npages = len(document)
Usage
The len()
(__len__)
operator is overridden to return the number of pages in the document.
1.4.2 +=
Synopsis
document += page
Usage
The +=
(__iadd__)
method is overridden to append a Page
object to the document.
1.4.3 []
Synopsis
page= documents[n]
document[n] = page
Usage
The []
(__getitem__)
operator is overridden to return the Page
object at the specified index.
The __setitem__()
method is overridden to replace the Page
object at the specified index (i.e., page number – 1).
Exceptions
A IndexError
is raised if the index is out of range.
1.4.4 str()
Synopsis
label = str(image)
Usage
The str()
(__str__)
operator is overridden to return the label of the document as a string.
1.5 Document Public Methods
1.5.1 load()
Synopsis
document.load(name, dir=None)
Parameters
name: The name of the document.
Usage
This method will load into memory a preprocessed machine learning ready data from the corresponding JSON files specified by the document (root) name. The method will load the JSON files by the filename <name><pageno>.json
. If dir
is None
, then it will look for the files where the current value for dir
is defined (either locally or reset by the dir property). Otherwise, it will look for the files under the directory specified by the dir
parameter.
Once loaded, the Document
object will have the same characteristics as when the Document
object was created.
Exceptions
A TypeError
is raised if the type of the parameter is not the expected type.
A ValueError
is raised if the name parameter is None
.
1.6 Document Private Methods
The Document
class contains the following private methods:
_exists()
– This method checks if the document exists at the specified stored path. If not, aFileNotFound
exception is thrown._collate()
– This method performs the collation task, which includes:- Determines the number of pages in the document.
-
Splits the document into individual pages, where each page is individually stored in the same format as the document. The pages are named as follows:
<name><pageno>.<suffix>
Each page is stored in the subdirectory specified by the property dir. If
dir
isNone
, then the page is stored in the same directory where program is ran; otherwise, if the subdirectory does not exist, it is created.- If the page is a scanned PDF page, the scanned image is extracted and saved as a PNG image. The PNG image is then OCR’d to convert to text.
<name><pageno>.png
- If the page is a TIFF facsimile, the TIFF image is then OCR’d to convert to text.
<name><pageno>.tif
- If the page is an image capture (e.g. camera capture), the captured image (e.g., JPG) is then OCR’d to convert to text.
<name><pageno>.jpg
- Extracts the raw text from the page , where each page is individually stored in a raw text format. The pages are named as follows:
<name><pageno>.txt
Each page is stored in the subdirectory specified by the property
dir
. Ifdir
isNone
, then the page is stored in the same directory where program is ran.-
Create a
Page
object for each page and adds them to the pages index property. -
If the document format is raw text, then:
- Treats as a single page.
- Stores only a single page text file.
- If the document format is PDF, then page splitting and extraction of the raw text per page is done with the open source version of Ghostscript. If the document is a scanned PDF, the image is extracted and converted to PNG using Ghostscript and then OCR’d using open source Tesseract.
- If the document format is TIFF, then page splitting is done with the open source Magick and then OCR’d using open source Tesseract.
_langcheck()
– This method is called after NLP preprocessing of the document has been completed. The method will sample upto twenty words to probabilistically determine the language of the document. The detected languages are English, French, German, Italian, and Spanish._scancheck()
– This method is called after NLP preprocessing of the document has been completed, and the document was a scanned image. The method will sample uptoSCANCHECK
number of words for recognition in the detected language dictionary (i.e., English, French, German, Italian, or Spanish). The method will check the words on either page 1 or page 2, depending on which page has a greater number of words. Punctuation, symbols, acronyms or single letter words are excluded. The method then sets the internal variable_quality
to the percentage of the words that were recognized (between 0 and 1)._async()
– This method performs asynchronous processing of the_collate()
function, when the optional ehandler parameter to the constructor is notNone
. When processing is completed, theehandler
parameter value is called as a function to signal completion of the processing, and theDocument
object is passed as a parameter.
2 Page
2.1 Page Overview
The page classifier contains the following primary classes, and their relationships:
Page
– This is a base class for the representation of an extracted page from a document. The constructor for the class object takes optionally as parameters the stored path to the page, and the extracted raw text.
page = Page( ‘/mypages/page1.pdf’, ‘some text’)
Words
– This is a base class for representation of the text as NLP preprocessed list of words.
Fig. 2a High Level view of Page
Class Object Relationships
2.2 Page Initializer (Constructor)
Synopsis
Page( page=None, text=None, pageno=None)
Parameters
page: If not None, the local path to the page.
text: If not None, the text corresponding to the page.
pageno: If not None, the page number in the corresponding Document
object.
Usage
If the text parameter is not None
, a Words
object is created and instantiated with the corresponding text. The text is then NLP preprocessed according to the configuration settings stored as static members in the Page
class (i.e., set by the parent Document
object):
BARE
: If True, then the bare configuration setting is passed to the Words
object.
STEM
: If not None, then the stem configuration setting is passed to the Words
object.
ROMAN
: If True, then the roman configuration setting is passed to the Words
object.
POS
: If True, then the pos configuration setting is passed to the Words
object.
Exceptions
A TypeError
is raised if the type of the parameter is not the expected type.
A FileNotFoundError
is raised if the file specified by page parameter does not exist.
2.3 Page Properties
2.3.1 path
Synopsis
# Getter
path = page.path
# Setter
page.path= path
Usage
When used as a getter the property returns the path of the corresponding page (i.e., split by Document
object) in its native format.
When used as a setter the property sets the path of the corresponding split page.
Exceptions
A TypeError
is raised if the type of the parameter is not the expected type.
A FileNotFoundError
is raised if the file specified by path does not exist.
2.3.2 pageno
Synopsis
# Getter
pageno = page.pageno
Usage
When used as a getter the property returns the pageno set for the Page
object in the corresponding parent Document
object.
2.3.3 size
Synopsis
# Getter
nbytes = page.size
Usage
When used as a getter the property returns the byte size of the text parameter.
2.3.4 label
Synopsis
# Getter
label = page.label
# Setter
page.label = label
Usage
When used as a getter the property returns the integer label
that has been assigned to the page.
When used as a setter the property assigns a label
to the page.
Exceptions
A TypeError
is raised if the type of the parameter is not the expected type.
2.3.5 words
Synopsis
# Getter
words = page.words
Usage
When used as a getter the property returns the Words
object of the corresponding NLP preprocessed text.
2.3.6 bagOfWords
Synopsis
# Getter
bag = page.bagOfWords
Usage
When used as a getter the property returns the page’s word sequences as a Bag of Words, represented as an unordered dictionary, where the key is the word and the value is the number of occurrences:
{ ‘<word’> : <no. of occurrences>, … }
2.3.7 freqDist
Synopsis
# Getter
freq = page.freqDist
Usage
When used as a getter the property returns the sorted tuples of a frequency distribution of words (from BagOfWords
), in descending order (i.e., highest first)
[ ( ‘<word’>: <no. of occurrences> ), …. ]
2.3.8 termFreq
Synopsis
# Getter
tf = page.termFreq
Usage
When used as a getter the property returns the sorted tuples of a term frequency distribution (i.e., percent that term occurs), in descending order (i.e., highest first)
[ ( ‘<word’>: <percentage of occurrences> ), …. ]
2.3.9 Static Variables
The Page
class contains the following static variables:
BARE
: If True
, then the bare configuration setting is passed to the Words
object.
STEM
: If not None
, then the stem configuration setting is passed to the Words
object.
ROMAN
: If True
, then the roman configuration setting is passed to the Words
object.
POS
: If True
, then the pos configuration setting is passed to the Words
object.
2.4 Page Overwritten Operators
2.4.1 len()
Synopsis
nwords = len(pages)
Usage
The len()
(__len__)
operator is overridden to return the number of NLP tokenized words in the page.
2.4.2 +=
Synopsis
page += text
Usage
The +=
(__iadd__)
method is overridden to append text to the page, which is then NLP preprocessed.
Exceptions
A TypeError
is raised if the type of the parameter is not the expected type.
2.4.3 str()
Synopsis
label = str(image)
Usage
The str()
(__str__)
operator is overridden to return the label of the page as a string.
2.5 Page Private Methods
The Page class contains no private methods.
2.6 Page Public Methods
2.6.1 store()
Synopsis
image.store(path)
Parameters
path: the file path to write to.
Usage
The store()
method writes the NLP tokenized sequence as a JSON object to the specified file.
Exceptions
A TypeError
is raised if the type of the parameter is not the expected type.
A FileNotFoundError
is raised if the file path is invalid.
2.6.2 load()
Synopsis
image.load(path)
Parameters
path: the file path to read from.
Usage
The load()
method writes the NLP tokenized sequence as a JSON object from the specified file.
Exceptions
A TypeError
is raised if the type of the parameter is not the expected type.
A FileNotFoundError
is raised if the file path is invalid.
APPENDIX I: Updates
Pre-Gap (Epipog) v1.1
1. Added time property.
2. Added scanned property.
3. Added support for TIFF and JPG/PNG.
Pre-Gap (Epipog) v1.3
1. Add direct read of PDF resource element to determine if scanned page.
2. Fix not detecting scanned PDF if text extraction produced noise.
Pre-Gap (Epipog) v1.4
1. Added pageno property to Page
class.
2. Added methods store()
and load()
to Page class to store/load NLP tokenized words to file.
3. Added method load()
to Document
class to reload NLP tokenized words from storage.
4. Added config keyword arguent to Document
initializer to configure NLP preprocessing.
Pre-Gap (Epipog) v1.5
1. Added bagOfWords
, freqDist
, and termFreq
properties to Document
and Page
class.
Gap v0.9.1 (alpha)
1. Rewrote Specification.
2. Add OCR
quality estimate.
Gap v0.9.2 (alpha)
1. Add language detection for English, Spanish, French, German and Italian.
APPENDIX II: Anticipated Engineering
The following has been identified as enhancement/issues to be addressed in subsequent update:
- What does it mean to add text to a document.
- Break raw text into pages for > 50 lines.
- Refactor page counting for faster performance.
- Add page split endpoint for streaming interface and URL.
- Add more pdf test files.
- Fix bug of not handling Cryllic characters in page
load()
method.
Copyright ©2018, Epipog, All Rights Reserved