Get HTML - Package STA.GET_HTML Specification
- Package: get_html
- Description: Container for all procedures and functions relating to accessing data from remote web pages.
- Procedure: fetch_html (Overload 1)
- Description: Wrapper to do fetch with housekeeping.
- Parameters:
- p_web_page
- Datatype: VARCHAR2
- Direction: IN
- Description: Local name of the web page to be accessed. Matches HTML_PAGES.NAME.
- Sub-procedures:
- Procedure: initialise_raw_html
- Description: Set up mapping and page access rows for this run.
- Parameters:
- p_web_page
- Datatype: VARCHAR2
- Direction: IN
- Description: Local name of the web page to be accessed. Matches HTML_PAGES.NAME.
- p_run_no
- Datatype: VARCHAR2
- Direction: OUT
- Description: Generated run_no for this run.
- Action:
- Run ctl_gen.initialise_mapping with parameters; mapping_name = 'M_H2S_RAW_HTML'. p_run_no is returned.
- Run ctl_get_html.init_html_page_access with parameters; p_web_page = p_web_page and p_run_no = p_run_no returned in the previous call.
- Procedure: finalise_raw_html
- Description: Close page access and mapping for this run.
- Parameters:
- p_run_no
- Datatype: VARCHAR2
- Direction: IN
- Description: run_no for this run.
- Action:
- Run ctl_get_html.html_page_access_update_status with parameters; p_run_no = p_run_no and p_new_status = 'FETCHED'.
- Run ctl_gen.finalise_mapping with parameters; p_run_no = p_run_no.
- Action:
- Run sub-procedure initialise_raw_html with parameter; p_web_page = p_web_page. This returns p_run_no into a local variable, l_run_no.
- Run fetch_html (Overload 2) with parameters; p_web_page = p_web_page and p_run_no = l_run_no.
- Run sub-procedure finalise_raw_html with parameter; p_run_no = l_run_no.
- Procedure: fetch_html (Overload 2)
- Description: Fetch web page.
- Parameters:
- p_web_page
- Datatype: VARCHAR2
- Direction: IN
- Description: Local name of the web page to be accessed. Matches HTML_PAGES.NAME.
- p_run_no
- Datatype: NUMBER
- Direction: IN
- Description: Run number of this fetch run.
- Action:
- Fetch the URL from HTML_PAGES by dereferencing p_web_page.
- Fetch the HTML pieces from the web page into a collection using UTL_HTTP.REQUEST_PIECES.
- For each piece in the collection:
- Increment the piece sequence counter (forms part of the primary key for the target table and keeps the pieces in their page order).
- Insert the piece into RAW_HTML using the following: run_no = p_run_no, piece_seq = piece sequence, html_piece = current collection element, and ins_tsp = SYSDATE.
- End; For each piece in the collection.
- Procedure: parse_html (Overload 1)
- Description: Wrapper to do parse with housekeeping.
- Parameters: None
- Sub-procedures:
- Procedure: initialise_parse_html
- Description: Set up mapping and page access rows for this run.
- Parameters:
- p_run_no
- Datatype: VARCHAR2
- Direction: OUT
- Description: Generated run_no for this run.
- Action:
- Run ctl_gen.initialise_mapping with parameters; mapping_name = 'initialise_parse_html'. p_run_no is returned.
- Procedure: finalise_raw_html
- Description: Close mapping for this run.
- Parameters:
- p_run_no
- Datatype: VARCHAR2
- Direction: IN
- Description: run_no for this run.
- Action:
- Run ctl_gen.finalise_mapping with parameters; p_run_no = p_run_no.
- Action:
- Run sub-procedure initialise_parse_html. This returns p_run_no into a local variable, l_run_no.
- Run parse_html (Overload 2) with parameter; p_run_no = l_run_no.
- Run sub-procedure finalise_parse_html with parameter; p_run_no = l_run_no.
- Procedure: parse_html (Overload 2)
- Description: Parse all page accesses with status of 'FETCHED'.
- Parameters:
- p_parse_run_no
- Datatype: NUMBER
- Direction: IN
- Description: Run number of the parse run.
- Action:
- For each html_access row with a status of FETCHED identified by its run_no (locking the status for update):
- Run ctl_get_html.init_html_parse_passes with parameters; p_parse_run_no = p_parse_run_no, p_raw_run_no = the identifying run_no of this row.
- Run parse_html (Overload 3) with parameters; p_parse_run_no = p_parse_run_no, p_raw_run_no = the identifying run_no of this row.
- Run ctl_get_html.html_parse_passes_update_stat with parameters; p_parse_run_no = p_parse_run_no, p_raw_run_no = the identifying run_no of this row, p_new_status = PARSED.
- Run ctl_get_html.html_page_access_update_status with parameters; p_raw_run_no = the identifying run_no of this row, p_new_status = PARSED.
- End; For each html_access row with a status of FETCHED identified by its run_no.
- Procedure: parse_html (Overload 3)
- Description: Parse web page.
- Parameters:
- p_parse_run_no
- Datatype: NUMBER
- Direction: IN
- Description: Run number of the parse run.
- p_run_run_no
- Datatype: NUMBER
- Direction: IN
- Description: Run number of the raw html.
- Sub-procedures:
- Procedure: push_tag
- Description: Push a tag on to the tag stack.
- Parameters:
- p_tag_name
- Datatype: VARCHAR2
- Direction: IN
- Description: Name of the tag.
- Action:
- Extend, by 1 element, the depth of the stack.
- Write the tag name to the new element.
- Procedure: pop_tag
- Description: Pop a tag from the tag stack.
- Parameters:
- p_tag_name
- Datatype: VARCHAR2
- Direction: IN
- Description: Name of the tag.
- Action:
- If the tag on the top of the stack is to the same as the one to be popped, raise an exception; The nested structure of tags is maintained in the stack and this test checks that the HTML is well structured.
- Destroy the top element on the stack.
- Procedure: process_tag
- Description: Process a complete tag.
- Parameters:
- p_tag_name
- Datatype: VARCHAR2
- Direction: IN
- Description: Name of the tag.
- Action:
- Convert the tag name to upper case.
- If the second character in the tag string is not backslash ('\'), then...
- Set the tag type to 'opening'.
- Get the tag name ( = the tag string stipped of its first and last characters).
- If the first and second characters of the tag name are 'A ', then...
- This is the special case Anchor tag. Set the tag name to 'A'.
- End; If the first and second characters of the tag name are 'A '.
- Else; If the second character in the tag string is not backslash ('\'), then...
- Set the tag type to 'closing'.
- Get the tag name ( = the tag string stipped of its first, second and last characters).
- End; If the second character in the tag string is not backslash ('\').
- Fetch the treatment and paired data from HTML_TAGS by dereferencing the tag name.
- If this fetch didn't return just one row, raise an exception; The tag is either not an HTML tag or, more likely, it's specification has not been added to HTML_TAGS yet.
- If the tag is specified as 'paired'; i.e. Where an opening tag is encountered, there must also be a closing tag, then...
- If this is an opening tag, then...
- Run push_tag with parameter; p_tag_name = the tag name.
- Else; If this is an opening tag. (i.e. this is a closing tag), then...
- Run pop_tag with parameter; p_tag_name = the tag name.
- End; If this is an opening tag.
- End; If the tag is specified as 'paired'.
- If the tag is specified as 'keep'; (i.e. The tag is data or structure related and will be written to PARSED_HTML), then...
- If the tag is specified as 'paired', then...
- If this is an opening tag, then...
- Increment the indent (i.e. the local variable that maintains the indent structure of nested HTML.
- Else; If this is an opening tag (i.e. this is a closing tag), then...
- End; If this is an opening tag.
- End; If the tag is specified as 'paired'.
- Increment the piece sequence (i.e. the local variable that maintains a counter to complete the primary key of PARSED_HTML).
- Insert the tag into PARSED_HTML using the following; parse_run_no = p_parse_run_no (global to the outer parse_html (3) outer procedure), raw_run_no = p_raw_run_no (global to the outer parse_html (3) outer procedure), component_seq = piece sequence, indent = local indent, html_component = tag name, and ins_tsp = SYSDATE.
- End; If the tag is specified as 'keep'.
- Procedure: process_dat
- Description: Process a piece of data.
- Parameters:
- p_data
- Datatype: VARCHAR2
- Direction: IN
- Description: Container for a piece of data.
- Action:
- Strip leading and trailing spaces from the data and concatenate a single space to the end (this both removes spruious spaces and ensures that, where data is to wrap to the next row, a space seperator exists).
- If all that remains of the data is not a single space (in which case it was originally all spaces and is, therefore, unwanted) and is not ' ' (individual non-breaking spaces are unwanted also), then...
- Increment the piece sequence (i.e. the local variable that maintains a counter to complete the primary key of PARSED_HTML).
- Insert the data into PARSED_HTML using the following; parse_run_no = p_parse_run_no (global to the outer parse_html (3) outer procedure), raw_run_no = p_raw_run_no (global to the outer parse_html (3) outer procedure), component_seq = piece sequence, indent = local indent, html_component = data, and ins_tsp = SYSDATE.
- End; If all that remains of the data is not a.
- Action:
- Set the parse state to 'component required', i.e. looking for the start of an HTML component.
- For each HTML piece in RAW_HTML where the run_no = p_raw_run_no ordered by piece_seq, loop...
- Increment piece_num; records the number of characters in the available HTML.
- For each character in the HTML piece, loop...
- Get the current character.
- If the current character is ASCII(10) (newline), then...
- Convert it to a space character.
- End; If the current character is ASCII(10) (newline).
- If the parse state is 'component required', then...
- If the current character is 'gt', then...
- Add the current character to the tag buffer.
- Set the parse state to 'in tag', i.e. processing a tag.
- Else (the current character is not 'gt')
- Add the current character to the data buffer.
- Set the parse state to 'in data', i.e. processing data.
- End; If the current character is 'gt'.
- Else; If the parse state is 'in tag', then...
- If the current character is 'lt', then...
- Raise exception 'tag opener found in a tag'.
- Else; If the current character is ' ', then...
- Set the parse state to 'got tag', i.e. the tag buffer contains a 'lt' followed by a tag name.
- Else; If the current character is 'gt', then...
- Add the current character to the tag buffer.
- Run process_tag with parameter; p_tag = the tag buffer.
- Set the parse state to 'component required'.
- Else; None of the above, then...
- Add the current character to the tag buffer.
- End; If the current character is...
- Else; If the parse state is 'got tag', then...
- If the current character is 'gt', then...
- Add the current character to the tag buffer.
- Run process_tag with parameter; p_tag = the tag buffer.
- Set the parse state to 'component required'.
- End; If the current character is 'gt'.
- If the tag buffer contains 'gt' || 'A' - i.e. the start of an Anchor tag, then...
- Add ' ' to the tag buffer.
- End; If the tag buffer contains 'gt' || 'A'.
- If the tag buffer contains 'gt' || 'A' || ' ' - i.e. the start of an Anchor tag followed by a space, then...
- Add the current character to the tag buffer.
- End; If the tag buffer contains 'gt' || 'A'.
- Else; If the parse state is 'in data', then...
- If the current character is 'gt', then...
- Raise exception 'tag closer found in data'.
- If the current character is 'lt', then...
- Run process_dat with parameter; p_data = the data buffer.
- Assign the current character to the tag buffer.
- Set the parse state to 'in tag'.
- Else; None of the above, then...
- Add the current character to the data buffer.
- End; If the current character is ...
- End; If the parse state is ...
- End; For each character in the HTML piece.
- End; For each HTML piece in RAW_HTML where the run_no = p_raw_run_no ordered by piece_seq.
- If the piece_num indicates that no HTML has been processed, then...
- Raise exception 'no html parsed'.
- End; If the piece_num indicates that no HTML has been processed.
- Function: get_scec
- Description: Get character from Special Common Entity Code.
- Parameters:
- p_scec
- Datatype: VARCHAR2
- Direction: IN
- Description: Special Common Entity Code. Matches html_special_entity_codes.special_entity_code.
- RETURN
- Datatype: VARCHAR2
- Description: Character dereferenced from the Special Common Entity Code.
- Action:
- Fetch the character from html_special_entity_codes using p_scec.
- Return the character.
- Function: replace_scec
- Description: Replace all Special Common Entity Codes in a string.
- Parameters:
- p_string
- Datatype: VARCHAR2
- Direction: IN
- Description: String to be replaced.
- RETURN
- Datatype: VARCHAR2
- Description: Replaced string.
- Action:
- Capture in-string locally.
- Capture in-string locally.
- Loop forever...
- Get the position of the first '&' character.
- If there are no more scec to process, exit loop.
- Concatenate all in-string characters before the '&' to out-string.
- Strip those characters assigned to out-string from in-string plus the '&'.
- Concatenate the dereferenced scec to the out-string.
- Strip the remainder of the scec from the in-string.
- End; Loop forever.
- Concatenate any remaining in-string characters to the out-string.
- Return the out-string.
0 Comments:
Post a Comment
<< Home