Gap Framework - Natural Language Processing for PDF/TIFF/Image Documents
SPLITTER Module
High Precision PDF Page Splitting/OCR/Text Extraction
Technical Specification, Gap v0.9.2
1 Document
1.1 Document Overview
The document classifier contains the following primary classes, and their relationships:
- Document – This is the base class for the representation of a stored document. The constructor for the class object takes as parameters the stored path to the document, optionally a directory path for storing extracted pages and text, and optionally an event completion handler when processing the document asynchronously, and optionally a config parameter for configuring the NLP preprocessing.
 
document = Document(“/somedir/mydocument.pdf”, “/mypages/mydocument”)
The constructors calls the _exists() and _collate() private methods for the specified document.
- Page – This is a base class for the representation of an extracted page from the document. The 
Documentclass contains a list (index) of the extracted pages asPageobjects. 

Fig. 1a High Level view of Document Class Object Relationships
1.2 Initializer (Constructor)
Synopsis
Document( document=None, dir=’./’, ehandler=None,  config=None)
Parameters
document: If not None, a string that is either:
1. local path to document  
2. remote path to document ((i.e., http[s]://…)
The document must be one of the following types: PDF, JPG (JPEG, J2K), PNG, BMP or TIF (TIFF)
dir: The directory where to store the machine learning ready data.
ehandler: If not None, the processing of the images into machine learning ready data will be asynchronous, and the value of the parameter is the function (or method) that is the event handler when processing is complete.
The event handler takes the form:
def myHandler(images): 
    # where images is the Images object that was preprocessed.
config: If not None, a list of one or more configuration settings as strings:
bare
pos
roman
segment
stem=gap|porter|Lancaster|snowball|lemma
Usage
When specified with no parameters, an empty Document object is created. The Document object may then be used to subsequent load (retrieve) previously stored preprocessed machine learning ready data (see load()).
Otherwise, the document parameter must be specified.
The document specified by the document parameter will be preprocessed according to the optional parameters and configuration settings.
By default, the document will be preprocessed as follows:
- The document will be split into individual pages.
 - A 
Pageobject will be created for each page. - If the document (or page) is an image (e.g., scanned PDF), it will be OCR’d.
 - The digital text will be extracted from each page and stored in the 
Pageobject. - The text will be optionally segmented into regions if the configuration setting segment is specified.
 - The text from each page object will be preprocessed into machine learning ready data (see syntax module specification), according to the optional parameters and configuration settings.
 - If the document was a scanned or image document, the quality of the scan will be estimated, unless 
Document.SCANCHECKis set to zero. 
The machine learning ready data will be stored on a per page basis in the directory specified by the parameter dir.
The following files are created and stored:
<document><pageno>.<suffix>
<document><pageno>.txt
<document>.<pageno>.json
The <document> is the root name of the document, and <pageno> is the corresponding page number starting at page 1. The file ending in the original file suffix <suffix> is the split page. The file ending in the file suffix .txt is the extracted text. The file ending in the file suffix .json is the NLP preprocessed machine learning data stored in a JSON format.
If the ehandler parameter is not None, then the above will occur asynchronously, and when completed, the corresponding event handler will be called with the Document object passed as a parameter.
If the path to the document file is remote (i.e., starts with http), an HTTP request will be made to fetch the contents of the file from the remote location.
Exceptions
A TypeError is raised if the type of the parameter is not the expected type.
A AttributeError is raised if an invalid configuration setting is specified.
A FileNotFoundError is raised if the document file does not exist.
A IOError is raised if an error occurs reading in the document file.
1.3 Document Properties
1.3.1 document
Synopsis
# Getter
path = document.document
# Setter
document.document = path    
Usage
When used as a getter the property returns the path to the document file.
When used as a setter the property specifies the path of the document file to preprocess into machine learning ready data (see initializer).
Exceptions
A TypeError is raised if the type of the parameter is not the expected type.
A FileNotFoundError is raised if the document file does not exist.
A IOError is raised if an error occurs reading in the document file.
1.3.2 name
Synopsis
# Getter
root = document.name
Usage
When used as a getter the property returns the root name of the document file (e.g., /mydir/mydocument.pdf -> mydocument).
1.3.3 type
Synopsis
# Getter
suffix = document.type
Usage
When used as a getter the property returns the file suffix of the document file (e.g., pdf).
1.3.4 size
Synopsis
# Getter
size = document.size
Usage
When used as a getter the property returns the file size of the document file in bytes.
1.3.5 dir
Synopsis
# Getter
subfolder = document.dir
# Setter
document.dir = subfolder    
Usage
When used as a getter the property returns the directory path where the corresponding files of the associated page objects are stored.
When used as a setter, it is only applicable when used in conjunction with the load() method, indicating where the path where the files associated with the page objects are stored. Otherwise, it is ignored.
Exceptions
A TypeError is raised if the type of the parameter is not the expected type.
A FileNotFoundError is raised if the directory does not exist.
1.3.6 label
Synopsis
# Getter
label = document.label
# Setter
document.label = label  
Usage
When used as a getter the property returns the integer label specified for the document.
When used as a setter the property sets the label of the document to the specified integer value.
Exceptions
A TypeError is raised if the type of the parameter is not the expected type.
1.3.7 lang
Synopsis
# Getter
lang = document.lang
Usage
When used as a getter the property returns the language of the document, which may be either 'en' (English), 'es' (Spanish), 'fr' (French), 'de' (German), or 'it' (Italian).
1.3.8 scanned
Synopsis
# Getter
scanned, quality = document.scanned
Usage
When used as a getter the property returns whether the document is a scanned image True or digital text False document, and the estimated quality of the scan as a percentage (between 0 and 1).
1.3.9 time
Synopsis
# Getter
secs = document.time
Usage
When used as a getter the property returns the amount of time (in seconds) it took to preprocess the document into machine learning ready data.
1.3.10 text
Synopsis
# Getter
text = document.text
Usage
When used as a getter the property returns a list, one entry per page, of the extracted text from the document in its original Unicode format.
1.3.11 pages
Synopsis
# Getter
pages = document.pages
1.3.12 bagOfWords
Synopsis
# Getter
bag = document.bagOfWords   
Usage
When used as a getter the property returns the document’s word sequences as a Bag of Words, represented as an unordered dictionary, where the key is the word and the value is the number of occurrences:
{ ‘<word’> : <no. of occurrences>, … }
1.3.13 freqDist
Synopsis
# Getter
freq = document.freqDist    
Usage
When used as a getter the property returns the sorted tuples of a frequency distribution of words (from BagOfWords), in descending order (i.e., highest first)
[ ( ‘<word’>: <no.  of occurrences> ), … ]
1.3.14 termFreq
Synopsis
# Getter
tf = document.termFreq
Usage
When used as a getter the property returns the sorted tuples of a term frequency distribution (i.e., percent that term occurs), in descending order (i.e., highest first)
[ ( ‘<word’>: <percentage  of occurrences> ), …. ]
1.3.15 Static Variables
The Document class contains the following static variables:
- RESOLUTION – The image resolution when converting 
PDFtoPNGforOCR(default300). - SCANCHECK  – The number of 
OCRwords to check to estimate the quality of the scan. - WORDDICT   - The word dictionary to use for scan spell check (default to 
norvig). 
1.4 Document Overridden Operators
1.4.1 len()
Synopsis
npages = len(document)
Usage
The len() (__len__) operator is overridden to return the number of pages in the document.
1.4.2 +=
Synopsis
document += page
Usage
The += (__iadd__) method is overridden to append a Page object to the document.
1.4.3 []
Synopsis
page= documents[n] 
document[n] = page
Usage
The [] (__getitem__) operator is overridden to return the Page object at the specified index.
The __setitem__() method is overridden to replace the Page object at the specified index (i.e., page number – 1).
Exceptions
A IndexError is raised if the index is out of range.
1.4.4 str()
Synopsis
label = str(image)
Usage
The str() (__str__) operator is overridden to return the label of the document as a string.
1.5 Document Public Methods
1.5.1 load()
Synopsis
document.load(name, dir=None) 
Parameters
name: The name of the document.
Usage
This method will load into memory a preprocessed machine learning ready data from the corresponding JSON files specified by the document (root) name. The method will load the JSON files by the filename <name><pageno>.json. If dir is None, then it will look for the files where the current value for dir is defined (either locally or reset by the dir property). Otherwise, it will look for the files under the directory specified by the dir parameter.
Once loaded, the Document object will have the same characteristics as when the Document object was created.
Exceptions
A TypeError is raised if the type of the parameter is not the expected type.
A ValueError is raised if the name parameter is None.
1.6 Document Private Methods
The Document class contains the following private methods:
_exists()– This method checks if the document exists at the specified stored path. If not, aFileNotFoundexception is thrown._collate()– This method performs the collation task, which includes:- Determines the number of pages in the document.
 - 
Splits the document into individual pages, where each page is individually stored in the same format as the document. The pages are named as follows:
<name><pageno>.<suffix>Each page is stored in the subdirectory specified by the property dir. If
dirisNone, then the page is stored in the same directory where program is ran; otherwise, if the subdirectory does not exist, it is created.- If the page is a scanned PDF page, the scanned image is extracted and saved as a PNG image. The PNG image is then OCR’d to convert to text.
 
<name><pageno>.png- If the page is a TIFF facsimile, the TIFF image is then OCR’d to convert to text.
 
<name><pageno>.tif- If the page is an image capture (e.g. camera capture), the captured image (e.g., JPG) is then OCR’d to convert to text.
 
<name><pageno>.jpg- Extracts the raw text from the page , where each page is individually stored in a raw text format. The pages are named as follows:
 
<name><pageno>.txtEach page is stored in the subdirectory specified by the property
dir. IfdirisNone, then the page is stored in the same directory where program is ran.- 
Create a
Pageobject for each page and adds them to the pages index property. - 
If the document format is raw text, then:
- Treats as a single page.
 - Stores only a single page text file.
 
 - If the document format is PDF, then page splitting and extraction of the raw text per page is done with the open source version of Ghostscript. If the document is a scanned PDF, the image is extracted and converted to PNG using Ghostscript and then OCR’d using open source Tesseract.
 - If the document format is TIFF, then page splitting is done with the open source Magick and then OCR’d using open source Tesseract.
 _langcheck()– This method is called after NLP preprocessing of the document has been completed. The method will sample upto twenty words to probabilistically determine the language of the document. The detected languages are English, French, German, Italian, and Spanish._scancheck()– This method is called after NLP preprocessing of the document has been completed, and the document was a scanned image. The method will sample uptoSCANCHECKnumber of words for recognition in the detected language dictionary (i.e., English, French, German, Italian, or Spanish). The method will check the words on either page 1 or page 2, depending on which page has a greater number of words. Punctuation, symbols, acronyms or single letter words are excluded. The method then sets the internal variable_qualityto the percentage of the words that were recognized (between 0 and 1)._async()– This method performs asynchronous processing of the_collate()function, when the optional ehandler parameter to the constructor is notNone. When processing is completed, theehandlerparameter value is called as a function to signal completion of the processing, and theDocumentobject is passed as a parameter.
 
2 Page
2.1 Page Overview
The page classifier contains the following primary classes, and their relationships:
Page– This is a base class for the representation of an extracted page from a document. The constructor for the class object takes optionally as parameters the stored path to the page, and the extracted raw text.
page = Page( ‘/mypages/page1.pdf’, ‘some text’)
Words– This is a base class for representation of the text as NLP preprocessed list of words.

Fig. 2a High Level view of Page Class Object Relationships
2.2 Page Initializer (Constructor)
Synopsis
Page( page=None, text=None,  pageno=None)
Parameters
page:   If not None, the local path to the page.
text:   If not None, the text corresponding to the page.
pageno: If not None, the page number in the corresponding Document object.
Usage
If the text parameter is not None, a Words object is created and instantiated with the corresponding text. The text is then NLP preprocessed according to the configuration settings stored as static members in the Page class (i.e., set by the parent Document object):
BARE  : If True, then the bare configuration setting is passed to the Words object.
STEM  : If not None, then the stem configuration setting is passed to the Words object.
ROMAN : If True, then the roman configuration setting is passed to the Words object.
POS   : If True, then the pos configuration setting is passed to the Words object.
Exceptions
A TypeError is raised if the type of the parameter is not the expected type.
A FileNotFoundError is raised if the file specified by page parameter does not exist.
2.3 Page Properties
2.3.1 path
Synopsis
# Getter
path = page.path
# Setter
page.path= path 
Usage
When used as a getter the property returns the path of the corresponding page (i.e., split by Document object) in its native format.
When used as a setter the property sets the path of the corresponding split page.
Exceptions
A TypeError is raised if the type of the parameter is not the expected type.
A FileNotFoundError is raised if the file specified by path does not exist.
2.3.2 pageno
Synopsis
# Getter
pageno = page.pageno
Usage
When used as a getter the property returns the pageno set for the Page object in the corresponding parent Document object.
2.3.3 size
Synopsis
# Getter
nbytes = page.size
Usage
When used as a getter the property returns the byte size of the text parameter.
2.3.4 label
Synopsis
# Getter
label = page.label
# Setter
page.label = label
Usage
When used as a getter the property returns the integer label that has been assigned to the page.
When used as a setter the property assigns a label to the page.
Exceptions
A TypeError is raised if the type of the parameter is not the expected type.
2.3.5 words
Synopsis
# Getter
words = page.words
Usage
When used as a getter the property returns the Words object of the corresponding NLP preprocessed text.
2.3.6 bagOfWords
Synopsis
# Getter
bag = page.bagOfWords   
Usage
When used as a getter the property returns the page’s word sequences as a Bag of Words, represented as an unordered dictionary, where the key is the word and the value is the number of occurrences:
{ ‘<word’> : <no. of occurrences>, … }
2.3.7 freqDist
Synopsis
# Getter
freq = page.freqDist    
Usage
When used as a getter the property returns the sorted tuples of a frequency distribution of words (from BagOfWords), in descending order (i.e., highest first)
[ ( ‘<word’>: <no.  of occurrences> ), …. ]
2.3.8 termFreq
Synopsis
# Getter
tf = page.termFreq
Usage
When used as a getter the property returns the sorted tuples of a term frequency distribution (i.e., percent that term occurs), in descending order (i.e., highest first)
[ ( ‘<word’>: <percentage  of occurrences> ), …. ]
2.3.9 Static Variables
The Page class contains the following static variables:
BARE  : If True, then the bare configuration setting is passed to the Words object.
STEM  : If not None, then the stem configuration setting is passed to the Words object.
ROMAN : If True, then the roman configuration setting is passed to the Words object.
POS     : If True, then the pos configuration setting is passed to the Words object.
2.4 Page Overwritten Operators
2.4.1 len()
Synopsis
nwords = len(pages)
Usage
The len() (__len__) operator is overridden to return  the number of NLP tokenized words in the page.
2.4.2 +=
Synopsis
page += text
Usage
The += (__iadd__) method is overridden to append text to the page, which is then NLP preprocessed.
Exceptions
A TypeError is raised if the type of the parameter is not the expected type.
2.4.3 str()
Synopsis
label = str(image)
Usage
The str() (__str__) operator is overridden to return the label of the page as a string.
2.5 Page Private Methods
The Page class contains no private methods.
2.6 Page Public Methods
2.6.1 store()
Synopsis
image.store(path)
Parameters
path: the file path to write to.
Usage
The store() method writes the NLP tokenized sequence as a JSON object to the specified file.
Exceptions
A TypeError is raised if the type of the parameter is not the expected type.
A FileNotFoundError is raised if the file path is invalid.
2.6.2 load()
Synopsis
image.load(path) 
Parameters
path: the file path to read from.
Usage
The load() method writes the NLP tokenized sequence as a JSON object from the specified file.
Exceptions
A TypeError is raised if the type of the parameter is not the expected type.
A FileNotFoundError is raised if the file path is invalid.
APPENDIX I: Updates
Pre-Gap (Epipog) v1.1
1.  Added time property.
2.  Added scanned property.
3.  Added support for TIFF and JPG/PNG.
Pre-Gap (Epipog) v1.3
1.  Add direct read of PDF resource element to determine if scanned page.
2.  Fix not detecting scanned PDF if text extraction produced noise.
Pre-Gap (Epipog) v1.4
1.  Added pageno property to Page class.
2.  Added methods store() and load() to Page class to store/load NLP tokenized words to file.
3.  Added method load() to Document class to reload NLP tokenized words from storage.
4.  Added config keyword arguent to Document initializer to configure NLP preprocessing.
Pre-Gap (Epipog) v1.5
1.  Added bagOfWords, freqDist, and termFreq properties to Document and Page class.
Gap v0.9.1 (alpha)
1.  Rewrote Specification.
2.  Add OCR quality estimate.
Gap v0.9.2 (alpha)
1. Add language detection for English, Spanish, French, German and Italian.
APPENDIX II: Anticipated Engineering
The following has been identified as enhancement/issues to be addressed in subsequent update:
- What does it mean to add text to a document.
 - Break raw text into pages for > 50 lines.
 - Refactor page counting for faster performance.
 - Add page split endpoint for streaming interface and URL.
 - Add more pdf test files.
 - Fix bug of not handling Cryllic characters in page 
load()method. 
Copyright ©2018, Epipog, All Rights Reserved