wxIScan
wxIScanHocr2Pdf Class Reference

DOM tree traverser class for wxIScanFrame::AddPdfPage(). More...

#include <wxiscanhocr2pdf.h>

Inheritance diagram for wxIScanHocr2Pdf:
wxIScanSmartHocr2Pdf

List of all members.

Public Member Functions

 wxIScanHocr2Pdf (wxXmlDocument *poXmlDoc, wxPdfDocument *poPdfDoc, int nResolution, const wxString &strHocrClassFilter=wxT("ocr_line"))
 Standard constructor.
 ~wxIScanHocr2Pdf ()
 Virtual destructor;.
virtual void Run ()
 Traverse the DOM tree by calling TraverseXmlNodes() and flushing outstanding operations in the end.

Public Attributes

wxString m_strHocrClassFilter
 Filter for the hOCR 'class' attribute.

Protected Member Functions

virtual void TraverseXmlNodes (wxXmlNode *poNode)
 Traverse recursively through the DOM tree beginning with the given node.
virtual wxString GetNodeContent (wxXmlNode *poNode)
 Get all text content from all levels below this node.
virtual void Print2Pdf (double x, double y, const wxString &strText)
 Print the given text at the given coordinates.
virtual void Flush2Pdf ()
 Flush outstanding print-to-PDF-commands.

Protected Attributes

wxXmlDocument * m_poXmlDoc
 The pointer to the XML document.
wxPdfDocument * m_poPdfDoc
 The pointer to the PDF document.
int m_nResolution
 The (fictive) resolution of an image in dpi.

Detailed Description

DOM tree traverser class for wxIScanFrame::AddPdfPage().

This is a helper class that traverses a hOCR DOM tree and "prints" the text of a hOCR XML file hidden behind the corresponding position on the image.

NOTE:

1) This is some sort of a "private" class to wxIScanFrame and should not be used outside wxIScanFrame::AddPdfPage().

2) There is no validity check done on poXmlDoc, poPdfDoc and nResolution. That is it is assumed that all parameters of the constructor are valid.

Definition at line 41 of file wxiscanhocr2pdf.h.


Constructor & Destructor Documentation

wxIScanHocr2Pdf::wxIScanHocr2Pdf ( wxXmlDocument *  poXmlDoc,
wxPdfDocument *  poPdfDoc,
int  nResolution,
const wxString &  strHocrClassFilter = wxT( "ocr_line" ) 
)

Standard constructor.

Parameters:
poXmlDocthe (valid!) pointer to the XML DOM tree
poPdfDocthe (valid!) pointer to the PDF document
nResolutionthe (virtual) resolution of the image
strHocrClassFilterthe class to use for hOCR information (e. g. whole lines or words)

Definition at line 30 of file wxiscanhocr2pdf.cpp.

 : m_strHocrClassFilter( strHocrClassFilter ),
   m_poXmlDoc( poXmlDoc ),
   m_poPdfDoc( poPdfDoc ),
   m_nResolution( nResolution )
{
    // Initialization. (Nothing to do, yet.)
}
wxIScanHocr2Pdf::~wxIScanHocr2Pdf ( ) [inline]

Virtual destructor;.

Definition at line 56 of file wxiscanhocr2pdf.h.

{}

Member Function Documentation

virtual void wxIScanHocr2Pdf::Flush2Pdf ( ) [inline, protected, virtual]

Flush outstanding print-to-PDF-commands.

NOTE: This function does nothing, but can be overriden.

Reimplemented in wxIScanSmartHocr2Pdf.

Definition at line 90 of file wxiscanhocr2pdf.h.

Referenced by Run(), and TraverseXmlNodes().

{}
wxString wxIScanHocr2Pdf::GetNodeContent ( wxXmlNode *  poNode) [protected, virtual]

Get all text content from all levels below this node.

Parameters:
poNodepointer to current node in the DOM tree.

Definition at line 96 of file wxiscanhocr2pdf.cpp.

Referenced by TraverseXmlNodes().

{
    // Get the current node's text content...
    wxString strContent= poNode->GetNodeContent();

    // ... and concatenate with the child node's text content.
    for( wxXmlNode *poIteratorNode= poNode->GetChildren(); poIteratorNode; poIteratorNode= poIteratorNode->GetNext() )
    {
        strContent += GetNodeContent( poIteratorNode );
    }
    return strContent;
}
void wxIScanHocr2Pdf::Print2Pdf ( double  x,
double  y,
const wxString &  strText 
) [protected, virtual]

Print the given text at the given coordinates.

Parameters:
xabscissa of the origin
yordinate of the origin
strTexttext to print

NOTE: If you want to change the behaviour of the text placement you should override this function.

Reimplemented in wxIScanSmartHocr2Pdf.

Definition at line 111 of file wxiscanhocr2pdf.cpp.

References m_poPdfDoc.

Referenced by TraverseXmlNodes().

{
    m_poPdfDoc->Text( x, y, strText );
}
void wxIScanHocr2Pdf::Run ( ) [virtual]

Traverse the DOM tree by calling TraverseXmlNodes() and flushing outstanding operations in the end.

Definition at line 43 of file wxiscanhocr2pdf.cpp.

References Flush2Pdf(), m_poXmlDoc, and TraverseXmlNodes().

Referenced by wxIScanFrame::AddPdfPage().

{
    TraverseXmlNodes( m_poXmlDoc->GetRoot() );
    Flush2Pdf();
}
void wxIScanHocr2Pdf::TraverseXmlNodes ( wxXmlNode *  poNode) [protected, virtual]

Traverse recursively through the DOM tree beginning with the given node.

Parameters:
poNodepointer to start node in the DOM tree.

Definition at line 51 of file wxiscanhocr2pdf.cpp.

References Flush2Pdf(), GetNodeContent(), m_nResolution, m_strHocrClassFilter, and Print2Pdf().

Referenced by Run().

{
    // If this is an XML tag containing a 'title' attribute
    // beginning with 'bbox' extract the bounding box and
    // the content (the text) and print it on the PDF page
    // using the (lower left corner of the) bounding box.
    if( poNode->GetType() == wxXML_ELEMENT_NODE )
    {
        wxString strAttrClass= poNode->GetAttribute( wxT( "class" ), wxEmptyString );
        wxString strAttrTitle= poNode->GetAttribute( wxT( "title" ), wxEmptyString );

        if(    ( strAttrClass.IsEmpty() || !strAttrClass.Cmp( m_strHocrClassFilter ) )
            && strAttrTitle.StartsWith( wxT( "bbox" ) ) )
        {
            // Parse string, ...
            wxArrayString astrTokens= wxStringTokenize( strAttrTitle );

            // ... get the coordinates of the bounding box, and ...
            long x, y;

            astrTokens[1].ToLong( &x );
            astrTokens[4].ToLong( &y );

            // ... "print" the text on the PDF page.
            Print2Pdf( (double)x / (double)m_nResolution * 25.4,
                       (double)y / (double)m_nResolution * 25.4,
                       GetNodeContent( poNode ) );
        }
    }
    else if( poNode->IsWhitespaceOnly() )
    {
        // Flush eventually delayed output.
        Flush2Pdf();
    }

    // Do the same for all children of this XML node
    // (so doing a depth first search).
    for( wxXmlNode *poChildNode= poNode->GetChildren(); poChildNode; poChildNode= poChildNode->GetNext() )
    {
        TraverseXmlNodes( poChildNode );
    }
}

Member Data Documentation

The (fictive) resolution of an image in dpi.

Definition at line 98 of file wxiscanhocr2pdf.h.

Referenced by TraverseXmlNodes().

wxPdfDocument* wxIScanHocr2Pdf::m_poPdfDoc [protected]

The pointer to the PDF document.

Definition at line 97 of file wxiscanhocr2pdf.h.

Referenced by wxIScanSmartHocr2Pdf::Flush2Pdf(), and Print2Pdf().

wxXmlDocument* wxIScanHocr2Pdf::m_poXmlDoc [protected]

The pointer to the XML document.

Definition at line 96 of file wxiscanhocr2pdf.h.

Referenced by Run().

Filter for the hOCR 'class' attribute.

Definition at line 93 of file wxiscanhocr2pdf.h.

Referenced by TraverseXmlNodes().


The documentation for this class was generated from the following files: