Humble Trader

Friday, September 01, 2006

Get HTML - Package STA.GET_HTML Specification

  • Package: get_html
  • Description: Container for all procedures and functions relating to accessing data from remote web pages.
  • Procedure: fetch_html (Overload 1)
    • Description: Wrapper to do fetch with housekeeping.
    • Parameters:
      • p_web_page
        • Datatype: VARCHAR2
        • Direction: IN
        • Description: Local name of the web page to be accessed. Matches HTML_PAGES.NAME.
    • Sub-procedures:
      • Procedure: initialise_raw_html
        • Description: Set up mapping and page access rows for this run.
        • Parameters:
          • p_web_page
            • Datatype: VARCHAR2
            • Direction: IN
            • Description: Local name of the web page to be accessed. Matches HTML_PAGES.NAME.
          • p_run_no
            • Datatype: VARCHAR2
            • Direction: OUT
            • Description: Generated run_no for this run.
        • Action:
          • Run ctl_gen.initialise_mapping with parameters; mapping_name = 'M_H2S_RAW_HTML'. p_run_no is returned.
          • Run ctl_get_html.init_html_page_access with parameters; p_web_page = p_web_page and p_run_no = p_run_no returned in the previous call.
      • Procedure: finalise_raw_html
        • Description: Close page access and mapping for this run.
        • Parameters:
          • p_run_no
            • Datatype: VARCHAR2
            • Direction: IN
            • Description: run_no for this run.
        • Action:
          • Run ctl_get_html.html_page_access_update_status with parameters; p_run_no = p_run_no and p_new_status = 'FETCHED'.
          • Run ctl_gen.finalise_mapping with parameters; p_run_no = p_run_no.
    • Action:
      • Run sub-procedure initialise_raw_html with parameter; p_web_page = p_web_page. This returns p_run_no into a local variable, l_run_no.
      • Run fetch_html (Overload 2) with parameters; p_web_page = p_web_page and p_run_no = l_run_no.
      • Run sub-procedure finalise_raw_html with parameter; p_run_no = l_run_no.
  • Procedure: fetch_html (Overload 2)
    • Description: Fetch web page.
    • Parameters:
      • p_web_page
        • Datatype: VARCHAR2
        • Direction: IN
        • Description: Local name of the web page to be accessed. Matches HTML_PAGES.NAME.
      • p_run_no
        • Datatype: NUMBER
        • Direction: IN
        • Description: Run number of this fetch run.
    • Action:
      • Fetch the URL from HTML_PAGES by dereferencing p_web_page.
      • Fetch the HTML pieces from the web page into a collection using UTL_HTTP.REQUEST_PIECES.
      • For each piece in the collection:
        • Increment the piece sequence counter (forms part of the primary key for the target table and keeps the pieces in their page order).
        • Insert the piece into RAW_HTML using the following: run_no = p_run_no, piece_seq = piece sequence, html_piece = current collection element, and ins_tsp = SYSDATE.
      • End; For each piece in the collection.
  • Procedure: parse_html (Overload 1)
    • Description: Wrapper to do parse with housekeeping.
    • Parameters: None
    • Sub-procedures:
      • Procedure: initialise_parse_html
        • Description: Set up mapping and page access rows for this run.
        • Parameters:
          • p_run_no
            • Datatype: VARCHAR2
            • Direction: OUT
            • Description: Generated run_no for this run.
        • Action:
          • Run ctl_gen.initialise_mapping with parameters; mapping_name = 'initialise_parse_html'. p_run_no is returned.
      • Procedure: finalise_raw_html
        • Description: Close mapping for this run.
        • Parameters:
          • p_run_no
            • Datatype: VARCHAR2
            • Direction: IN
            • Description: run_no for this run.
        • Action:
          • Run ctl_gen.finalise_mapping with parameters; p_run_no = p_run_no.
    • Action:
      • Run sub-procedure initialise_parse_html. This returns p_run_no into a local variable, l_run_no.
      • Run parse_html (Overload 2) with parameter; p_run_no = l_run_no.
      • Run sub-procedure finalise_parse_html with parameter; p_run_no = l_run_no.
  • Procedure: parse_html (Overload 2)
    • Description: Parse all page accesses with status of 'FETCHED'.
    • Parameters:
      • p_parse_run_no
        • Datatype: NUMBER
        • Direction: IN
        • Description: Run number of the parse run.
    • Action:
      • For each html_access row with a status of FETCHED identified by its run_no (locking the status for update):
        • Run ctl_get_html.init_html_parse_passes with parameters; p_parse_run_no = p_parse_run_no, p_raw_run_no = the identifying run_no of this row.
        • Run parse_html (Overload 3) with parameters; p_parse_run_no = p_parse_run_no, p_raw_run_no = the identifying run_no of this row.
        • Run ctl_get_html.html_parse_passes_update_stat with parameters; p_parse_run_no = p_parse_run_no, p_raw_run_no = the identifying run_no of this row, p_new_status = PARSED.
        • Run ctl_get_html.html_page_access_update_status with parameters; p_raw_run_no = the identifying run_no of this row, p_new_status = PARSED.
      • End; For each html_access row with a status of FETCHED identified by its run_no.
  • Procedure: parse_html (Overload 3)
    • Description: Parse web page.
    • Parameters:
      • p_parse_run_no
        • Datatype: NUMBER
        • Direction: IN
        • Description: Run number of the parse run.
      • p_run_run_no
        • Datatype: NUMBER
        • Direction: IN
        • Description: Run number of the raw html.
    • Sub-procedures:
      • Procedure: push_tag
        • Description: Push a tag on to the tag stack.
        • Parameters:
          • p_tag_name
            • Datatype: VARCHAR2
            • Direction: IN
            • Description: Name of the tag.
        • Action:
          • Extend, by 1 element, the depth of the stack.
          • Write the tag name to the new element.
      • Procedure: pop_tag
        • Description: Pop a tag from the tag stack.
        • Parameters:
          • p_tag_name
            • Datatype: VARCHAR2
            • Direction: IN
            • Description: Name of the tag.
        • Action:
          • If the tag on the top of the stack is to the same as the one to be popped, raise an exception; The nested structure of tags is maintained in the stack and this test checks that the HTML is well structured.
          • Destroy the top element on the stack.
      • Procedure: process_tag
        • Description: Process a complete tag.
        • Parameters:
          • p_tag_name
            • Datatype: VARCHAR2
            • Direction: IN
            • Description: Name of the tag.
        • Action:
          • Convert the tag name to upper case.
          • If the second character in the tag string is not backslash ('\'), then...
            • Set the tag type to 'opening'.
            • Get the tag name ( = the tag string stipped of its first and last characters).
            • If the first and second characters of the tag name are 'A ', then...
              • This is the special case Anchor tag. Set the tag name to 'A'.
            • End; If the first and second characters of the tag name are 'A '.
          • Else; If the second character in the tag string is not backslash ('\'), then...
            • Set the tag type to 'closing'.
            • Get the tag name ( = the tag string stipped of its first, second and last characters).
          • End; If the second character in the tag string is not backslash ('\').
          • Fetch the treatment and paired data from HTML_TAGS by dereferencing the tag name.
            • If this fetch didn't return just one row, raise an exception; The tag is either not an HTML tag or, more likely, it's specification has not been added to HTML_TAGS yet.
          • If the tag is specified as 'paired'; i.e. Where an opening tag is encountered, there must also be a closing tag, then...
            • If this is an opening tag, then...
              • Run push_tag with parameter; p_tag_name = the tag name.
            • Else; If this is an opening tag. (i.e. this is a closing tag), then...
              • Run pop_tag with parameter; p_tag_name = the tag name.
            • End; If this is an opening tag.
          • End; If the tag is specified as 'paired'.
          • If the tag is specified as 'keep'; (i.e. The tag is data or structure related and will be written to PARSED_HTML), then...
            • If the tag is specified as 'paired', then...
              • If this is an opening tag, then...
                • Increment the indent (i.e. the local variable that maintains the indent structure of nested HTML.
              • Else; If this is an opening tag (i.e. this is a closing tag), then...
                • Decrement the indent.
              • End; If this is an opening tag.
            • End; If the tag is specified as 'paired'.
            • Increment the piece sequence (i.e. the local variable that maintains a counter to complete the primary key of PARSED_HTML).
            • Insert the tag into PARSED_HTML using the following; parse_run_no = p_parse_run_no (global to the outer parse_html (3) outer procedure), raw_run_no = p_raw_run_no (global to the outer parse_html (3) outer procedure), component_seq = piece sequence, indent = local indent, html_component = tag name, and ins_tsp = SYSDATE.
          • End; If the tag is specified as 'keep'.
      • Procedure: process_dat
        • Description: Process a piece of data.
        • Parameters:
          • p_data
            • Datatype: VARCHAR2
            • Direction: IN
            • Description: Container for a piece of data.
        • Action:
          • Strip leading and trailing spaces from the data and concatenate a single space to the end (this both removes spruious spaces and ensures that, where data is to wrap to the next row, a space seperator exists).
          • If all that remains of the data is not a single space (in which case it was originally all spaces and is, therefore, unwanted) and is not ' ' (individual non-breaking spaces are unwanted also), then...
            • Increment the piece sequence (i.e. the local variable that maintains a counter to complete the primary key of PARSED_HTML).
            • Insert the data into PARSED_HTML using the following; parse_run_no = p_parse_run_no (global to the outer parse_html (3) outer procedure), raw_run_no = p_raw_run_no (global to the outer parse_html (3) outer procedure), component_seq = piece sequence, indent = local indent, html_component = data, and ins_tsp = SYSDATE.
          • End; If all that remains of the data is not a.
    • Action:
      • Set the parse state to 'component required', i.e. looking for the start of an HTML component.
      • For each HTML piece in RAW_HTML where the run_no = p_raw_run_no ordered by piece_seq, loop...
        • Increment piece_num; records the number of characters in the available HTML.
        • For each character in the HTML piece, loop...
          • Get the current character.
          • If the current character is ASCII(10) (newline), then...
            • Convert it to a space character.
          • End; If the current character is ASCII(10) (newline).
          • If the parse state is 'component required', then...
            • If the current character is 'gt', then...
              • Add the current character to the tag buffer.
              • Set the parse state to 'in tag', i.e. processing a tag.
            • Else (the current character is not 'gt')
              • Add the current character to the data buffer.
              • Set the parse state to 'in data', i.e. processing data.
            • End; If the current character is 'gt'.
          • Else; If the parse state is 'in tag', then...
            • If the current character is 'lt', then...
              • Raise exception 'tag opener found in a tag'.
            • Else; If the current character is ' ', then...
              • Set the parse state to 'got tag', i.e. the tag buffer contains a 'lt' followed by a tag name.
            • Else; If the current character is 'gt', then...
              • Add the current character to the tag buffer.
              • Run process_tag with parameter; p_tag = the tag buffer.
              • Set the parse state to 'component required'.
            • Else; None of the above, then...
              • Add the current character to the tag buffer.
            • End; If the current character is...
          • Else; If the parse state is 'got tag', then...
            • If the current character is 'gt', then...
              • Add the current character to the tag buffer.
              • Run process_tag with parameter; p_tag = the tag buffer.
              • Set the parse state to 'component required'.
            • End; If the current character is 'gt'.
            • If the tag buffer contains 'gt' || 'A' - i.e. the start of an Anchor tag, then...
              • Add ' ' to the tag buffer.
            • End; If the tag buffer contains 'gt' || 'A'.
            • If the tag buffer contains 'gt' || 'A' || ' ' - i.e. the start of an Anchor tag followed by a space, then...
              • Add the current character to the tag buffer.
            • End; If the tag buffer contains 'gt' || 'A'.
          • Else; If the parse state is 'in data', then...
            • If the current character is 'gt', then...
              • Raise exception 'tag closer found in data'.
            • If the current character is 'lt', then...
              • Run process_dat with parameter; p_data = the data buffer.
              • Assign the current character to the tag buffer.
              • Set the parse state to 'in tag'.
            • Else; None of the above, then...
              • Add the current character to the data buffer.
            • End; If the current character is ...
          • End; If the parse state is ...
        • End; For each character in the HTML piece.
      • End; For each HTML piece in RAW_HTML where the run_no = p_raw_run_no ordered by piece_seq.
      • If the piece_num indicates that no HTML has been processed, then...
        • Raise exception 'no html parsed'.
      • End; If the piece_num indicates that no HTML has been processed.
  • Function: get_scec
    • Description: Get character from Special Common Entity Code.
    • Parameters:
      • p_scec
        • Datatype: VARCHAR2
        • Direction: IN
        • Description: Special Common Entity Code. Matches html_special_entity_codes.special_entity_code.
      • RETURN
        • Datatype: VARCHAR2
        • Description: Character dereferenced from the Special Common Entity Code.
    • Action:
      • Fetch the character from html_special_entity_codes using p_scec.
      • Return the character.
  • Function: replace_scec
    • Description: Replace all Special Common Entity Codes in a string.
    • Parameters:
      • p_string
        • Datatype: VARCHAR2
        • Direction: IN
        • Description: String to be replaced.
      • RETURN
        • Datatype: VARCHAR2
        • Description: Replaced string.
    • Action:
      • Capture in-string locally.
      • Capture in-string locally.
      • Loop forever...
        • Get the position of the first '&' character.
      • If there are no more scec to process, exit loop.
        • Concatenate all in-string characters before the '&' to out-string.
        • Strip those characters assigned to out-string from in-string plus the '&'.
        • Concatenate the dereferenced scec to the out-string.
        • Strip the remainder of the scec from the in-string.
      • End; Loop forever.
      • Concatenate any remaining in-string characters to the out-string.
      • Return the out-string.

0 Comments:

Post a Comment

<< Home