![]() |
wxIScan
|
DOM tree traverser class for wxIScanFrame::AddPdfPage(). More...
#include <wxiscanhocr2pdf.h>
Public Member Functions | |
wxIScanHocr2Pdf (wxXmlDocument *poXmlDoc, wxPdfDocument *poPdfDoc, int nResolution, const wxString &strHocrClassFilter=wxT("ocr_line")) | |
Standard constructor. | |
~wxIScanHocr2Pdf () | |
Virtual destructor;. | |
virtual void | Run () |
Traverse the DOM tree by calling TraverseXmlNodes() and flushing outstanding operations in the end. | |
Public Attributes | |
wxString | m_strHocrClassFilter |
Filter for the hOCR 'class' attribute. | |
Protected Member Functions | |
virtual void | TraverseXmlNodes (wxXmlNode *poNode) |
Traverse recursively through the DOM tree beginning with the given node. | |
virtual wxString | GetNodeContent (wxXmlNode *poNode) |
Get all text content from all levels below this node. | |
virtual void | Print2Pdf (double x, double y, const wxString &strText) |
Print the given text at the given coordinates. | |
virtual void | Flush2Pdf () |
Flush outstanding print-to-PDF-commands. | |
Protected Attributes | |
wxXmlDocument * | m_poXmlDoc |
The pointer to the XML document. | |
wxPdfDocument * | m_poPdfDoc |
The pointer to the PDF document. | |
int | m_nResolution |
The (fictive) resolution of an image in dpi. |
DOM tree traverser class for wxIScanFrame::AddPdfPage().
This is a helper class that traverses a hOCR DOM tree and "prints" the text of a hOCR XML file hidden behind the corresponding position on the image.
NOTE:
1) This is some sort of a "private" class to wxIScanFrame and should not be used outside wxIScanFrame::AddPdfPage().
2) There is no validity check done on poXmlDoc, poPdfDoc and nResolution. That is it is assumed that all parameters of the constructor are valid.
Definition at line 41 of file wxiscanhocr2pdf.h.
wxIScanHocr2Pdf::wxIScanHocr2Pdf | ( | wxXmlDocument * | poXmlDoc, |
wxPdfDocument * | poPdfDoc, | ||
int | nResolution, | ||
const wxString & | strHocrClassFilter = wxT( "ocr_line" ) |
||
) |
Standard constructor.
poXmlDoc | the (valid!) pointer to the XML DOM tree |
poPdfDoc | the (valid!) pointer to the PDF document |
nResolution | the (virtual) resolution of the image |
strHocrClassFilter | the class to use for hOCR information (e. g. whole lines or words) |
Definition at line 30 of file wxiscanhocr2pdf.cpp.
: m_strHocrClassFilter( strHocrClassFilter ), m_poXmlDoc( poXmlDoc ), m_poPdfDoc( poPdfDoc ), m_nResolution( nResolution ) { // Initialization. (Nothing to do, yet.) }
wxIScanHocr2Pdf::~wxIScanHocr2Pdf | ( | ) | [inline] |
virtual void wxIScanHocr2Pdf::Flush2Pdf | ( | ) | [inline, protected, virtual] |
Flush outstanding print-to-PDF-commands.
NOTE: This function does nothing, but can be overriden.
Reimplemented in wxIScanSmartHocr2Pdf.
Definition at line 90 of file wxiscanhocr2pdf.h.
Referenced by Run(), and TraverseXmlNodes().
{}
wxString wxIScanHocr2Pdf::GetNodeContent | ( | wxXmlNode * | poNode | ) | [protected, virtual] |
Get all text content from all levels below this node.
poNode | pointer to current node in the DOM tree. |
Definition at line 96 of file wxiscanhocr2pdf.cpp.
Referenced by TraverseXmlNodes().
{ // Get the current node's text content... wxString strContent= poNode->GetNodeContent(); // ... and concatenate with the child node's text content. for( wxXmlNode *poIteratorNode= poNode->GetChildren(); poIteratorNode; poIteratorNode= poIteratorNode->GetNext() ) { strContent += GetNodeContent( poIteratorNode ); } return strContent; }
void wxIScanHocr2Pdf::Print2Pdf | ( | double | x, |
double | y, | ||
const wxString & | strText | ||
) | [protected, virtual] |
Print the given text at the given coordinates.
x | abscissa of the origin |
y | ordinate of the origin |
strText | text to print |
NOTE: If you want to change the behaviour of the text placement you should override this function.
Reimplemented in wxIScanSmartHocr2Pdf.
Definition at line 111 of file wxiscanhocr2pdf.cpp.
References m_poPdfDoc.
Referenced by TraverseXmlNodes().
{ m_poPdfDoc->Text( x, y, strText ); }
void wxIScanHocr2Pdf::Run | ( | ) | [virtual] |
Traverse the DOM tree by calling TraverseXmlNodes() and flushing outstanding operations in the end.
Definition at line 43 of file wxiscanhocr2pdf.cpp.
References Flush2Pdf(), m_poXmlDoc, and TraverseXmlNodes().
Referenced by wxIScanFrame::AddPdfPage().
{ TraverseXmlNodes( m_poXmlDoc->GetRoot() ); Flush2Pdf(); }
void wxIScanHocr2Pdf::TraverseXmlNodes | ( | wxXmlNode * | poNode | ) | [protected, virtual] |
Traverse recursively through the DOM tree beginning with the given node.
poNode | pointer to start node in the DOM tree. |
Definition at line 51 of file wxiscanhocr2pdf.cpp.
References Flush2Pdf(), GetNodeContent(), m_nResolution, m_strHocrClassFilter, and Print2Pdf().
Referenced by Run().
{ // If this is an XML tag containing a 'title' attribute // beginning with 'bbox' extract the bounding box and // the content (the text) and print it on the PDF page // using the (lower left corner of the) bounding box. if( poNode->GetType() == wxXML_ELEMENT_NODE ) { wxString strAttrClass= poNode->GetAttribute( wxT( "class" ), wxEmptyString ); wxString strAttrTitle= poNode->GetAttribute( wxT( "title" ), wxEmptyString ); if( ( strAttrClass.IsEmpty() || !strAttrClass.Cmp( m_strHocrClassFilter ) ) && strAttrTitle.StartsWith( wxT( "bbox" ) ) ) { // Parse string, ... wxArrayString astrTokens= wxStringTokenize( strAttrTitle ); // ... get the coordinates of the bounding box, and ... long x, y; astrTokens[1].ToLong( &x ); astrTokens[4].ToLong( &y ); // ... "print" the text on the PDF page. Print2Pdf( (double)x / (double)m_nResolution * 25.4, (double)y / (double)m_nResolution * 25.4, GetNodeContent( poNode ) ); } } else if( poNode->IsWhitespaceOnly() ) { // Flush eventually delayed output. Flush2Pdf(); } // Do the same for all children of this XML node // (so doing a depth first search). for( wxXmlNode *poChildNode= poNode->GetChildren(); poChildNode; poChildNode= poChildNode->GetNext() ) { TraverseXmlNodes( poChildNode ); } }
int wxIScanHocr2Pdf::m_nResolution [protected] |
The (fictive) resolution of an image in dpi.
Definition at line 98 of file wxiscanhocr2pdf.h.
Referenced by TraverseXmlNodes().
wxPdfDocument* wxIScanHocr2Pdf::m_poPdfDoc [protected] |
The pointer to the PDF document.
Definition at line 97 of file wxiscanhocr2pdf.h.
Referenced by wxIScanSmartHocr2Pdf::Flush2Pdf(), and Print2Pdf().
wxXmlDocument* wxIScanHocr2Pdf::m_poXmlDoc [protected] |
The pointer to the XML document.
Definition at line 96 of file wxiscanhocr2pdf.h.
Referenced by Run().
Filter for the hOCR 'class' attribute.
Definition at line 93 of file wxiscanhocr2pdf.h.
Referenced by TraverseXmlNodes().