[try Beta version]
Not logged in

 
C++ Web Extraction

Jan 6, 2011 at 12:54am
Hello Everyone,

I was wondering how to extract data from websites and use it in my program. For example how can I have my program read the HTML code and save it as a text file? I know I can do this manually on Internet Explore under Page>View Source.I would like my program to do that for me.

Any help would be greatly appreciated

Thanks, Arthur
Jan 7, 2011 at 9:32am
www.cplusplus.com
Jan 7, 2011 at 9:34am
Someone ask this question before. Use cURL.
Jan 7, 2011 at 12:30pm
In Linux, you can just use the get command for that.
Jan 7, 2011 at 1:55pm
In Linux, you can just use the get command for that.




But surly you won't use system()...
Jan 20, 2011 at 9:28pm
Another thing you could do is to download the HTML file from the website (not sure how you'd do that by the way) and then just use some good old file I/O.


If thats any help...
Jan 21, 2011 at 7:41am
If you want it to be automated, then cURL once installed you can use it like any other Unix command in a shell script.

A lot of organizations use cURL to do web extraction. Besides cURL binary, they also provide cURL libraries so you can integrate them into your programs.

Now after extracting the HTML, you want to remove those HTML tags to get to the contents will require a HTML tag parser. This I don't have a candidate yet.

Anyone want to recommend a C++ HTML tag parser? I know in Java there is one.
Jan 21, 2011 at 7:45am
I'm having trouble picking the cURL download I need, as well as how to install it. I do all my C++ in Visual Studio 2010. What is the right download for me? And can any one guide me through how to add it to my project? Thanks for all the help!
Jan 21, 2011 at 7:53am
Do you want a ready binary to just run and get results OR do you want to add the cURL libraries into your project so your program can use cURL functionalities ?

In all cases, try contact cURL authors I believe they can help you. Last I visit they also have dedicated cURL forums there.

curl.haxx.se
Jan 21, 2011 at 7:57am
I want my program to be able to fully function as a exe so... I would think that would be adding the cURL libraries into my project.
Jan 21, 2011 at 8:46am
I want my program to be able to fully function as a exe so... I would think that would be adding the cURL libraries into my project.


Then you do need the cURL in libraries format then. Contact cURL authors to see if they distribute cURL libraries for Windows. I know for Linux/Unix they do but not sure about Windows though.
Jan 21, 2011 at 8:50am
Okay great! Thanks for all your help!
Jan 22, 2011 at 12:49am
I hope this isn't off-topic but if this is more about the productivity then you might want to consider using another language (eg. Perl) for something like this.
Topic archived. No new replies allowed.