Get HTML - Functional Specification
I want to be able to go to the internet and grab data from web pages. This is not straight forward. Web pages come in all sorts of shapes and sizes. Some require secure login. Some have badly formed HTML. Others are under constant development and have structural changes that we need to accommodate. Still, you've got to do what you can.
The problems arise when you want a particular piece of data from a particular web site. Given the possibilities, it looks like we will have to write a completely new suite for each page accessed and then keep maintaining it if someone decides to pretty it up or add something new.
So, let's try and simplify the process down a bit - and at the same time write some generic code so that we don't have to keep maintaining large bits of it.
To get the data, we need to:
- Access the page using whatever security is in place (sometimes none).
- Fetch the HTML using Oracle packages and store this in a table.
- Process the table to extract our data.
The second part is fairly generic too. After all, all we are doing is fetching HTML regardless of what it contains. Let's do a raw HTML fetch - generically - and worry about what's in it later.
We can also help ourselves for stage 3. This cannot be generic because each data extraction has to look at particular structures within the HTML and break them down to a particular bit. However, raw HTML comes with a lot of stuff we know we don't need if we are just getting the data - mostly to do with presentation and rendering. Why don't we - generically - strip that stuff out and give the last stage something clean to work on. That way we have taken a lot of the burden off the data-specific by code simplifying it and making it less subject to change.
Our new process now looks like this:
- Access the page using whatever security is in place - Generic.
- Fetch the raw HTML using Oracle packages and store this in a table - Generic.
- Parse the raw HTML stripping out everything that isn't in a data-specific tag and format it into a clean HTML table - Generic
- Process the clean HTML table to extract our data - Page specific.
Further, there are downstream processes that are also generic, specifically to do with validation. HTML contains 'Special Common Entity Codes'. Included with this sub-system is a set of APIs to return actual characters in their place.
0 Comments:
Post a Comment
<< Home