April 3, 2011

fargo fargo
Lab Rat
1 posts

QRegExp for Searching HTML Files?

 

I have a pool of html files and want to search through them for same targeted text. It is required to search in their text contents and ignore the html tags. I tried QRegExp, but could not find a good pattern to do this. So, I’d appreciate any help in this regard.

Thank you.

4 replies

April 3, 2011

Volker Volker
Ant Farmer
5428 posts

Searching directly is almost impossible, as the formatting gets in the way all the time. One possible solution could be, to load the HTML into a QTextDocument [doc.qt.nokia.com] and use find on the document.

This has the drawback, that the HTML might be completely altered if you are in need to manipulate the contents save the file afterwards.

April 4, 2011

Andre Andre
Robot Herder
6399 posts

First of all, using a QRegExp to search through files on disk isn’t something that is supported directly by Qt. You’d have to load the files one by one and then search the contents. Then, using a regexp to parse HTML or XML is a bad idea. You really don’t want to do that. I would recommend that you use some HTML tidy program to create valid XML from it, and then use Qt’s XML classes to search for your text.

April 4, 2011

Volker Volker
Ant Farmer
5428 posts

Even if it is valid XHTML, searching will not work out well with the raw XHTML source. Imagine you search for “foo bar” and have in your markup:

  1. <em>foo</em> <span class='hugo'>ba</span><span class='superduper'>r</span>

You still want to match this (obviously silly) construct, as its plain text representation is still “foo bar”. Best approach IMO would be the text search of QTextDocument.

April 4, 2011

Andre Andre
Robot Herder
6399 posts

Good point. I think you are right, and QTextDocument::find() is the way to go.

 
  ‹‹ Windows messages to Qt window      Scaling problems in Ubuntu using QPixmap and QPainter ››

You must log in to post a reply. Not a member yet? Register here!