DOM tree traverser class for wxIScanFrame::AddPdfPage(). More...

#include <wxiscanhocr2pdf.h>

Inheritance diagram for wxIScanHocr2Pdf:

Public Member Functions
	wxIScanHocr2Pdf (wxXmlDocument poXmlDoc, wxPdfDocument poPdfDoc, int nResolution, const wxString &strHocrClassFilter=wxT("ocr_line"))
	Standard constructor.
	~wxIScanHocr2Pdf ()
	Virtual destructor;.
virtual void	Run ()
	Traverse the DOM tree by calling TraverseXmlNodes() and flushing outstanding operations in the end.
Public Attributes
wxString	m_strHocrClassFilter
	Filter for the hOCR 'class' attribute.
Protected Member Functions
virtual void	TraverseXmlNodes (wxXmlNode *poNode)
	Traverse recursively through the DOM tree beginning with the given node.
virtual wxString	GetNodeContent (wxXmlNode *poNode)
	Get all text content from all levels below this node.
virtual void	Print2Pdf (double x, double y, const wxString &strText)
	Print the given text at the given coordinates.
virtual void	Flush2Pdf ()
	Flush outstanding print-to-PDF-commands.
Protected Attributes
wxXmlDocument *	m_poXmlDoc
	The pointer to the XML document.
wxPdfDocument *	m_poPdfDoc
	The pointer to the PDF document.
int	m_nResolution
	The (fictive) resolution of an image in dpi.

Detailed Description

DOM tree traverser class for wxIScanFrame::AddPdfPage().

This is a helper class that traverses a hOCR DOM tree and "prints" the text of a hOCR XML file hidden behind the corresponding position on the image.

NOTE:

1) This is some sort of a "private" class to wxIScanFrame and should not be used outside wxIScanFrame::AddPdfPage().

2) There is no validity check done on poXmlDoc, poPdfDoc and nResolution. That is it is assumed that all parameters of the constructor are valid.

Definition at line 41 of file wxiscanhocr2pdf.h.

Constructor & Destructor Documentation

wxIScanHocr2Pdf::wxIScanHocr2Pdf	(	wxXmlDocument *	poXmlDoc,
		wxPdfDocument *	poPdfDoc,
		int	nResolution,
		const wxString &	strHocrClassFilter = `wxT( "ocr_line" )`
	)

Standard constructor.

Parameters:

poXmlDoc	the (valid!) pointer to the XML DOM tree
poPdfDoc	the (valid!) pointer to the PDF document
nResolution	the (virtual) resolution of the image
strHocrClassFilter	the class to use for hOCR information (e. g. whole lines or words)

Definition at line 30 of file wxiscanhocr2pdf.cpp.

 : m_strHocrClassFilter( strHocrClassFilter ),
   m_poXmlDoc( poXmlDoc ),
   m_poPdfDoc( poPdfDoc ),
   m_nResolution( nResolution )
{
    // Initialization. (Nothing to do, yet.)
}

wxIScanHocr2Pdf::~wxIScanHocr2Pdf ( ) [inline]

Virtual destructor;.

Definition at line 56 of file wxiscanhocr2pdf.h.

{}

Member Function Documentation

virtual void wxIScanHocr2Pdf::Flush2Pdf ( ) [inline, protected, virtual]

Flush outstanding print-to-PDF-commands.

NOTE: This function does nothing, but can be overriden.

Reimplemented in wxIScanSmartHocr2Pdf.

Definition at line 90 of file wxiscanhocr2pdf.h.

Referenced by Run(), and TraverseXmlNodes().

{}

wxString wxIScanHocr2Pdf::GetNodeContent ( wxXmlNode * poNode ) [protected, virtual]

Get all text content from all levels below this node.

Parameters:

poNode pointer to current node in the DOM tree.

Definition at line 96 of file wxiscanhocr2pdf.cpp.

Referenced by TraverseXmlNodes().

{
    // Get the current node's text content...
    wxString strContent= poNode->GetNodeContent();

    // ... and concatenate with the child node's text content.
    for( wxXmlNode *poIteratorNode= poNode->GetChildren(); poIteratorNode; poIteratorNode= poIteratorNode->GetNext() )
    {
        strContent += GetNodeContent( poIteratorNode );
    }
    return strContent;
}

void wxIScanHocr2Pdf::Print2Pdf	(	double	x,
		double	y,
		const wxString &	strText
	)		`[protected, virtual]`

Print the given text at the given coordinates.

Parameters:

x	abscissa of the origin
y	ordinate of the origin
strText	text to print

NOTE: If you want to change the behaviour of the text placement you should override this function.

Reimplemented in wxIScanSmartHocr2Pdf.

Definition at line 111 of file wxiscanhocr2pdf.cpp.

References m_poPdfDoc.

Referenced by TraverseXmlNodes().

{
    m_poPdfDoc->Text( x, y, strText );
}

void wxIScanHocr2Pdf::Run ( ) [virtual]

Traverse the DOM tree by calling TraverseXmlNodes() and flushing outstanding operations in the end.

Definition at line 43 of file wxiscanhocr2pdf.cpp.

References Flush2Pdf(), m_poXmlDoc, and TraverseXmlNodes().

Referenced by wxIScanFrame::AddPdfPage().

{
    TraverseXmlNodes( m_poXmlDoc->GetRoot() );
    Flush2Pdf();
}

void wxIScanHocr2Pdf::TraverseXmlNodes ( wxXmlNode * poNode ) [protected, virtual]

Traverse recursively through the DOM tree beginning with the given node.

Parameters:

poNode pointer to start node in the DOM tree.

Definition at line 51 of file wxiscanhocr2pdf.cpp.

References Flush2Pdf(), GetNodeContent(), m_nResolution, m_strHocrClassFilter, and Print2Pdf().

Referenced by Run().

{
    // If this is an XML tag containing a 'title' attribute
    // beginning with 'bbox' extract the bounding box and
    // the content (the text) and print it on the PDF page
    // using the (lower left corner of the) bounding box.
    if( poNode->GetType() == wxXML_ELEMENT_NODE )
    {
        wxString strAttrClass= poNode->GetAttribute( wxT( "class" ), wxEmptyString );
        wxString strAttrTitle= poNode->GetAttribute( wxT( "title" ), wxEmptyString );

        if(    ( strAttrClass.IsEmpty() || !strAttrClass.Cmp( m_strHocrClassFilter ) )
            && strAttrTitle.StartsWith( wxT( "bbox" ) ) )
        {
            // Parse string, ...
            wxArrayString astrTokens= wxStringTokenize( strAttrTitle );

            // ... get the coordinates of the bounding box, and ...
            long x, y;

            astrTokens[1].ToLong( &x );
            astrTokens[4].ToLong( &y );

            // ... "print" the text on the PDF page.
            Print2Pdf( (double)x / (double)m_nResolution * 25.4,
                       (double)y / (double)m_nResolution * 25.4,
                       GetNodeContent( poNode ) );
        }
    }
    else if( poNode->IsWhitespaceOnly() )
    {
        // Flush eventually delayed output.
        Flush2Pdf();
    }

    // Do the same for all children of this XML node
    // (so doing a depth first search).
    for( wxXmlNode *poChildNode= poNode->GetChildren(); poChildNode; poChildNode= poChildNode->GetNext() )
    {
        TraverseXmlNodes( poChildNode );
    }
}