QRegExp for Searching HTML Files?
I have a pool of html files and want to search through them for same targeted text. It is required to search in their text contents and ignore the html tags. I tried QRegExp, but could not find a good pattern to do this. So, I’d appreciate any help in this regard.
Searching directly is almost impossible, as the formatting gets in the way all the time. One possible solution could be, to load the HTML into a QTextDocument [doc.qt.nokia.com] and use find on the document.
This has the drawback, that the HTML might be completely altered if you are in need to manipulate the contents save the file afterwards.
First of all, using a QRegExp to search through files on disk isn’t something that is supported directly by Qt. You’d have to load the files one by one and then search the contents. Then, using a regexp to parse HTML or XML is a bad idea. You really don’t want to do that. I would recommend that you use some HTML tidy program to create valid XML from it, and then use Qt’s XML classes to search for your text.
Even if it is valid XHTML, searching will not work out well with the raw XHTML source. Imagine you search for “foo bar” and have in your markup:
- <em>foo</em> <span class='hugo'>ba</span><span class='superduper'>r</span>
You still want to match this (obviously silly) construct, as its plain text representation is still “foo bar”. Best approach IMO would be the text search of QTextDocument.