AVS Forum banner

Status
Not open for further replies.
1 - 8 of 8 Posts

·
Registered
Joined
·
6,227 Posts
Discussion Starter · #1 ·
Is there a simple scripting language way of doing this ???


I cant code for a damn, but I have a plan on my mind and involves scraping data that is currently presented in a HTML format... The format is simple and I have hope that I could extract the data but I know nix about this subject...


Any pointers / hints / URL's for reading / etc very welcome...
 

·
Registered
Joined
·
2,379 Posts
PP: What you are wanting to do is use a regular expression. The good news is that regular expressions are extremely powerful. The bad news is that they are somewhat tough to understand.


For instance, in a proof of concept DVD database, I parsed a YXY.dat file using the following PHP command:

Code:
Code:
while (!feof ($handle))
  {
    $counter++;
    $buffer = fgets($handle, 512);
    preg_match_all("/=([^;]*)/", $buffer, $s);
    $resultsarray[$counter][1] = $s[1][3];
    $resultsarray[$counter][2] = $s[1][0];
  }
The most comprehensive resource for this is Friedl's Mastering Regular Expressions .


A quicker, dirtier way is to go through the docs at php.net.


Good luck.
 

·
Registered
Joined
·
3,496 Posts
You could try an XML parser. HTML and XML are subsets of SGML. There are a few HTML tags that do not follow the XML standard, such as
and

, because they lack the closing tag.


You could also try brute force with vbscript or javascript.


vb woud be something like


html is a string that contains the document body

p = instr(html, "")

if p>0 then

html = mid(html, p+5, len(html)

p2 = instr(html, ""

if p2 > 0 then

tagContent = mid(html, 1, p2)

'process your content

end if

end if


javascript would be similar


var p = indexOf(html, "");

if (p>0)

{

html = html.substring(p+5, html.length);

p2 = indexOf(html, "");

if (p2 > 0)

{

tagContent = html.substring(0, p2);

// process your content

}

}
 

·
Registered
Joined
·
174 Posts
It's not quite a scripting language solution, but the lynx browser is pretty good for simple stuff. I've had success using it in conjunction with a scripting language to parse and/or save simple HTML documents as text without the tags.


---
 

·
Registered
Joined
·
108 Posts
I've actually written a C++ COM object that does this (sort of) - I call it "Netscrape". At it's base, is a heavily modified (by me) version of "HTML Tidy", a powerful command line tool that can translate HTML into XHTML. The object reads the HTML returned from any URL (including HTTPS), translates the HTML into XML, and returns to you a powerful but easy to use XML DOMDocument (MSXML) object. You can then use the DOM or XPath to dissect or pare the markup down in any manner you want. It is lightning fast and very reliable. However, you do have to know *SOME* code (like VBScript or Javascript) to be able to use it and the related XML objects.


I could email you the binary and/or the source if you think you can use it. It is not something that I "support" or anything - just code I have laying around from previous projects.


The "documentation" is nearly non existent, but the interface is simple - pretty much one method that returns a DOMDocument - then you can use the MSXML documentation from MSDN and go nuts. There are other helper methods for quick selection and search/replace kinda stuff (useful for reformatting someone else's HTML - turn italics to bold and stuff, remove linebreaks, etc)


-Jeff
 

·
Registered
Joined
·
105 Posts
Here is the best I have found... I am certainly no programmer, but have been working on a way of displaying my photo, dvd, and CD collection on my HTPC using a browser. My goal was to take a unique piece of info (like an ISBN or UPC code) which I store in an Access database and get cover art. etc. from Amazon or something similar. But, I wanted it displayed directly in my web interface. I have had some success using these utilities:


ASP Tear:
http://www.alphasierrapapa.com/IisDe...nents/AspTear/


WebDigger:
http://www.aspmodules.com/product_about.asp?ProductID=7



JW
 

·
Registered
Joined
·
389 Posts
Wow, it seems there are already several excellent suggestions here! I'll add just one more: There's a utility that comes with the free Enhydra application server called XMLC . It's actually an HTML parser which gives you a DOM tree for your manipulation pleasure. It's quite excellent and seems to fit the bill perfectly. It's also supported by an entire community of developers, ans has extensive documentation.


So there.
 

·
Registered
Joined
·
6,227 Posts
Discussion Starter · #8 ·
WOW... Thanks guys... lotta reading to do...
 
1 - 8 of 8 Posts
Status
Not open for further replies.
Top