Hi there guys, I am creating this web crawler using C++ and the Qt library. Everything is working fine so far however I have difficulty extracting the data I am interested in from the HTML document. I was wondering if there is a C++ library that can encapsulate JavaScript code just for the sake of extracting data from an HTML document. The Qt framework has the QWebEngine to do this but I don't feel it's efficient enough for what I need it for. Is there another simpler lib for this?
libxml2 is a pretty standard choice for HTML parsing. http://xmlsoft.org/
Keep in mind, this is literally just an HTML parser. If a website contains JS that manipulates the DOM, a parser will not execute that code, so you will not be able to see computed contents. You need something closer to a full-fledged web browser for that.
Thank you for your reply. I did some research on libxml2 and I read somewhere that libxml2 does not support HTML5 tags. Apparently to parse an HTML document using libxml2 you must first convert the HTML document to XML and then perform the parsing.
It is quite easy to create grammar subset of HTML/JS and put callback to fetch exatly what you need. You also can easy support user extensions to skip them correctly just adding a little extra gramma.
Google/Gumbo: C99, but the claims are impressive:
'Passes all html5lib tests, including the template tag.' and 'Tested on over 2.5 billion pages from Google's index'. https://github.com/google/gumbo-parser