Humble Trader

Sunday, July 30, 2006

Get HTML - get_html_ctl_d.sql

-- ------------------------------------------------------------------------- --
-- --
-- Title: get_html_ctl_d.sql --
-- Author: Steve Roach --
-- Date: 25-JUL-2006 --
-- Version: 01.00 --
-- --
-- Description: Create Control Database seed data for sub-system GET_HTML. --
-- --
-- History: --
-- --
-- Date Ver Author Description --
-- --------- ----- ------ -------------------------------------------------- --
-- 30-JUL-06 01.00 SRR First release --
-- --
-- ------------------------------------------------------------------------- --

DELETE FROM html_page_access
WHERE run_no IN
(SELECT run_no
FROM mapping_histories
WHERE mapping_library IN ('M_H2S_RAW_HTML'
,'M_S2S_PARSE_HTML'));

DELETE FROM mapping_histories
WHERE mapping_library IN ('M_H2S_RAW_HTML'
,'M_S2S_PARSE_HTML');

DELETE FROM mapping_libraries
WHERE mapping_name IN ('M_H2S_RAW_HTML'
,'M_S2S_PARSE_HTML')
/

COMMIT
/

INSERT INTO mapping_libraries
(
mapping_name
,rep_from
,rep_to
)
VALUES
(
'M_H2S_RAW_HTML'
,'EXTERNAL_HTML'
,'STAGING'
)
/

INSERT INTO mapping_libraries
(
mapping_name
,rep_from
,rep_to
)
VALUES
(
'M_S2S_PARSE_HTML'
,'STAGING'
,'STAGING'
)
/

COMMIT
/

Get HTML - GET_HTML

Scripts that support this object:

Saturday, July 29, 2006

Get HTML - HTML_TAGS

Scripts that support this object:

Get HTML - get_html_y.sql

-- ------------------------------------------------------------------------- --
-- --
-- Title: get_html_y.sql --
-- Author: Steve Roach --
-- Date: 25-JUL-2006 --
-- Version: 01.00 --
-- --
-- Description: Create required synonyms for sub system Get HTML. --
-- --
-- History: --
-- --
-- Date Ver Author Description --
-- --------- ----- ------ -------------------------------------------------- --
-- 25-JUL-06 01.00 SRR First release --
-- --
-- ------------------------------------------------------------------------- --

DROP SYNONYM repositories
/

DROP SYNONYM mapping_libraries
/

DROP SYNONYM mapping_histories
/

DROP SYNONYM html_page_access
/

DROP SYNONYM html_parse_passes
/

DROP SYNONYM html_pages
/

DROP SYNONYM html_tags
/

DROP SYNONYM ctl_gen
/

DROP SYNONYM ctl_get_html
/

CREATE SYNONYM repositories FOR ctl.repositories
/

CREATE SYNONYM mapping_libraries FOR ctl.mapping_libraries
/

CREATE SYNONYM mapping_histories FOR ctl.mapping_histories
/

CREATE SYNONYM html_page_access FOR ctl.html_page_access
/

CREATE SYNONYM html_parse_passes FOR ctl.html_parse_passes
/

CREATE SYNONYM html_pages FOR ctl.html_pages
/

CREATE SYNONYM html_tags FOR ctl.html_tags
/

CREATE SYNONYM ctl_gen FOR ctl.ctl_gen
/

CREATE SYNONYM ctl_get_html FOR ctl.ctl_get_html
/

EXIT SQL.SQLCODE

Get HTML - PARSED_HTML

Scripts that support this object:

Thursday, July 27, 2006

Get HTML - RAW_HTML

Scripts that support this object:

Get HTML - HTML_PAGE_ACCESS

Scripts that support this object:

Wednesday, July 26, 2006

Get HTML - Implementation Notes

Implementing in Oracle XE:

STA requires execute privilege on the supplied package; UTL_HTTP.

Note: Ignore warning that 'EXIT SQL.SQLCODE' will be ignored when running scripts.

  • Upload to CTL:
    • html_page_access_a.sql
    • html_page_access_t.sql
    • html_page_access_i.sql
    • html_page_access_c.sql
    • html_parse_passes_a.sql
    • html_parse_passes_t.sql
    • html_parse_passes_i.sql
    • html_parse_passes_c.sql
    • get_html_ctl_d.sql
    • ctl_get_html_a.sql
    • ctl_get_html_h.sql
    • ctl_get_html_b.sql
  • Upload to STA:
    • html_pages_a.sql
    • html_pages_t.sql
    • html_pages_i.sql
    • html_pages_c.sql
    • html_special_entity_codes_a.sql
    • html_special_entity_codes_t.sql
    • html_special_entity_codes_i.sql
    • html_special_entity_codes_c.sql
    • html_special_entity_codes_d.sql
    • html_tags_a.sql
    • html_tags_t.sql
    • html_tags_i.sql
    • html_tags_c.sql
    • html_tags_d.sql
    • raw_html_a.sql
    • raw_html_t.sql
    • raw_html_i.sql
    • raw_html_c.sql
    • parsed_html_a.sql
    • parsed_html_t.sql
    • parsed_html_i.sql
    • parsed_html_c.sql
    • get_html_a.sql
    • get_html_h.sql
    • get_html_b.sql
  • Run the following in order:
    • STA script: html_pages_a.sql
    • STA command: EXEC lib_util.create_grants;
    • CTL command: EXEC lib_util.create_synonyms;
    • CTL script: html_page_access_a.sql
    • CTL script: html_parse_passes_a.sql
    • CTL script: get_html_ctl_d.sql
    • CTL script: ctl_get_html_a.sql
    • CTL command: EXEC lib_util.create_grants;
    • STA command: EXEC lib_util.create_synonyms;
    • STA script: html_special_entity_codes_a.sql
    • STA script: html_special_entity_codes_d.sql
    • STA script: html_tags_a.sql
    • STA script: html_tags_d.sql
    • STA script: raw_html_a.sql
    • STA script: parsed_html_a.sql
    • STA script: get_html_a.sql
    • STA command: EXEC lib_util.create_grants;
    • CTL command: EXEC lib_util.create_synonyms;
  • Run the test script; CTL: ctl_get_html_test.sql. NOTE: Best run from the SQL Command Line.
    • Each test outputs a result and, at the end, a pass/fail summary.
  • Run the test script; STA: get_html_test.sql. NOTE: Best run from the SQL Command Line.
    • Each test outputs a result and, at the end, a pass/fail summary.

Get HTML - HTML_PAGES

Scripts that support this object:

Get HTML - Configuration Items

This is a list of links to posts that comprise the configuration of the sub-system:

Get HTML - Technical Specification

A number of objects will be built to support this sub-system. Some objects are related to the location of the target web-site and auditing the fetch and parse process, and these are better placed in the CTL schema. The other objects, which hold and carry actual data are placed in the STA schema because these activities are all about staging the web-page into the data warehouse.

CTL:

STA:

Get HTML - Functional Specification

I want to be able to go to the internet and grab data from web pages. This is not straight forward. Web pages come in all sorts of shapes and sizes. Some require secure login. Some have badly formed HTML. Others are under constant development and have structural changes that we need to accommodate. Still, you've got to do what you can.

The problems arise when you want a particular piece of data from a particular web site. Given the possibilities, it looks like we will have to write a completely new suite for each page accessed and then keep maintaining it if someone decides to pretty it up or add something new.

So, let's try and simplify the process down a bit - and at the same time write some generic code so that we don't have to keep maintaining large bits of it.

To get the data, we need to:

  • Access the page using whatever security is in place (sometimes none).
  • Fetch the HTML using Oracle packages and store this in a table.
  • Process the table to extract our data.
Well, apart from a few quirks regarding security, the first stage is common to whatever page we will access. That bit looks pretty generic already.

The second part is fairly generic too. After all, all we are doing is fetching HTML regardless of what it contains. Let's do a raw HTML fetch - generically - and worry about what's in it later.

We can also help ourselves for stage 3. This cannot be generic because each data extraction has to look at particular structures within the HTML and break them down to a particular bit. However, raw HTML comes with a lot of stuff we know we don't need if we are just getting the data - mostly to do with presentation and rendering. Why don't we - generically - strip that stuff out and give the last stage something clean to work on. That way we have taken a lot of the burden off the data-specific by code simplifying it and making it less subject to change.

Our new process now looks like this:
  • Access the page using whatever security is in place - Generic.
  • Fetch the raw HTML using Oracle packages and store this in a table - Generic.
  • Parse the raw HTML stripping out everything that isn't in a data-specific tag and format it into a clean HTML table - Generic
  • Process the clean HTML table to extract our data - Page specific.
This sub-system will implement the first three stages of this and leave it as something that will be called by any code that wants web page based data.

Further, there are downstream processes that are also generic, specifically to do with validation. HTML contains 'Special Common Entity Codes'. Included with this sub-system is a set of APIs to return actual characters in their place.

Get HTML

Links to posts for this sub-system:

Control Repository - RUN_NO_S

-- ------------------------------------------------------------------------- --
-- --
-- Title: run_no_s.sql --
-- Author: Steve Roach --
-- Date: 23-AUG-2006 --
-- Version: 01.00 --
-- --
-- Description: Create sequence RUN_NO_S. --
-- --
-- History: --
-- --
-- Date Ver Author Description --
-- --------- ----- ------ -------------------------------------------------- --
-- 23-AUG-06 01.00 SRR First release --
-- --
-- ------------------------------------------------------------------------- --

DROP SEQUENCE run_no_s
/

CREATE SEQUENCE run_no_s
START WITH 1
INCREMENT BY 1
NOCACHE
/

Control Repository - MAPPING_HISTORIES

Scripts that support this object:

Control Repository - MAPPING_LIBRARIES

Scripts that support this object:

Tuesday, July 25, 2006

Control Repository - REPOSITORIES

Scripts that support this object:

Wednesday, July 12, 2006

Country Dimension - Mapping

HTML Table -> Country Dimension:

Source
Target
Description
 IDSequence generated
Country
COU_NAMESUBSTR(Country, 1, 100)
ISO Alpha 2
COU_ISO2_CODE1:1
ISO Alpha 3
COU_ISO3_CODE1:1
UN Numeric 3
COU_ISO3_NUMBER1:1
Windows Country Region
 Loaded to staging but not mapped to dimension.
Windows Code
 Loaded to staging but not mapped to dimension.
Mac Name
 Loaded to staging but not mapped to dimension.
Mac Code
 Loaded to staging but not mapped to dimension.
 START_DATE01-JAN-0100
 END_DATE31-DEC-3000
 RUN_NOpk_ctl_gen.initialise_mapping ('M_S2W_D_COUNTRY') -> p_run_no
 RUN_NOpk_ctl_gen.initialise_mapping ('M_S2W_D_COUNTRY') -> p_run_date

Country Dimension - Sources

Initial:

The initial source data for this dimension is the UniCode Country Codes: ISO 3166, Microsoft, and Macintosh; here.

This page contains an 8 column table containing the target data.

Source - HTML Table:

Column
Datatype
Description
Country
CHARName of the country.
ISO Alpha 2
CHAR(2)2 character ISO code.
ISO Alpha 3
CHAR(3)3 character ISO code.
UN Numeric 3
CHAR(3)3 digit numeric UN code.
Windows Country Region
CHARWindows Country Region.
Windows Code
CHAR(3)Windows Code.
Mac Name
CHARMac Name.
Mac Code
CHARMac Code.


Ongoing:

This source is very slowly changing - in the order of magnitude that the web page itself may well change. Therefore, any future updating of the dimension is seen as a special one-off exercise. If the structure of the source web page has changed in the interim, this will be addressed at the time.

Tuesday, July 11, 2006

Country Dimension - Definition

Table:

Table
Description
D_COUNTRYThe Country Dimension contains the names of countries and their various codes as defined by the ISO.


Levels:
Level
Prefix
Description
ISO_COU
COUThe ISO Country.


Level Attributes:

Country Level:

Attribute
Datatype
Description
ID
NUMBERUID of the row.
NAME
VARCHAR2(100)Full name of the country.
ISO2_CODE
VARCHAR2(2)2 character ISO code.
ISO3_CODE
VARCHAR2(3)3 character ISO code.
UN3_NUMBER
VARCHAR2(3)3 digit numeric UN code.
START_DATE
DATEDate that the Country came into being.
END_DATE
DATEDate that the Country ceased.


OTHER :

These are audit and control attributes. While not strictly belonging to a Level, they are assigned to the most granular level - in this case; ISO_COU:

Attribute
Datatype
Description
RUN_NO
NUMBERThe Run Number that created this row.
INS_TSP
DATEInsert timestamp.

Country Dimension - Granularity

Level
Relative Granularity
Country250

Country Dimension - Roadmap

The Country Dimension

Links to posts for this sub-system: