ParseString method is simple - it invokes the parser over the
string passed as parameter and returns the resulting object model.
The parser finds only the elements defined through prior calls
to the AddTag method. The other content is treated as plain text
no matter if it contains another HTML tags or not. This is
convenient when using a HTML file as a template because the script
does not need to deal with the other content. Thus the script
defines the HTML tags that are meaningful for its work by a few
calls to the AddTag method and then calls the ParseString method
to obtain the object structure of the file. Even a tricky
technique like treating "incorrectly" some HTML elements
could help to simplify the returned structure. For exampl if you
are interested only in some attributes of the <BODY>
you may tell the parser that it has no closing tag
(</BODY>). This will not cause problems when generating the
document again, because the </BODY> tag will be generated
from the "plain text" corresponding to the non-parsed
parts of the document. But benefits could be considerable if you
need to simplify the document tree. However it is recommended to
define the element according to their usual HTML role in order to
keep the source code easier to understand (but this is the only
reason - if the benefits are considerable feel free to treat some
HTML elements in the most convenient way).
How the found tags are represented?
The returned collection is the document root. Every tag found
will be represented as item in it. This item will be
VarDictionary collection too. Every HTML attribute specified for
this tag will appear in the collection as a named item - the
name is the attribute name and its value is the attribute value.
If the tag contains other tags (for example <P> tag
containing several <A> tags) they will be represented as
items in the same collection too. To distinguish between the
sub-tags and the HTML attributes the script may use simple
IsObject VBScript function (or typeof in JScript). - the
attributes are string values and the sub-tags are
sub-collections - VarDictionary objects. Thus every sub-tag is
an object and the attributes are non-object values - string
values.
The values with names beginning with "__" (two
underscores) are special values describing the tag itself. HTML
does not define any attributes like these thus they will not
interfere with the regular HTML attributes. There is one
important special value for every tag: "__class"
- its value contains the TAG name (e.g. "P" for
<P> tag or "TITLE" for the <TITLE> tag for
example). The ASP page that inspects the model returned by the
parser use this item to determine what kind of tag is
represented by the current "node". In most cases to
find a particular tag the script does not need to cycle through
the entire tree recursively. Using the VarDictionary's methods FindByValue
and FindByName
it is able to request from the current node a sub-node (in any
depth) by passing a simple criteria.
Example. Let's use this text for the example
<HTML><BODY>
<P ID="A1">A Paragraph</P>
</BODY></HTML>
Assume it is already in the src variable:
Set parser = Server.CreateObject("newObjects.utilctls.TextEmbedParser")
parser.AddTag "P",True,False,False,False,"",""
Set model = parser.ParseString(src)
We requested only the <P> elements (See AddTag
for details on its parameters). We specified that it has a
closing tag (e.g. </P>) and that it may not contain
another <P> tags as sub-elements.
The resulting collection (model) will have the following
items:
- model(1) - VarDictionary collection
model(1)("__class") = "text/plain"
- model(2) - VarDictionary collection
model(2)("__class") = "P"
- model(3) - VarDictionary collection
model(3)("__class") = "text/plain"
FindByName method could be used very effectively together
with the ID HTML attribute. By default the parser will see if
the given tag contains ID attribute and if the attribute
presents it will name the node (the VarDictionary collection
representing the tag in the resulting tree) with the value of
this attribute. This feature follows the usual way the HTML
elements are named thus you could use one naming convention for
server side processing and client side DHTML scripts - e.g. some
tags could be accessed by the same ID at the server side
(through the parser) and on the client side by a script in the
page. Therefore in the above example model.Key(2) will return
"A1".
What is the representation of the non-parsed content?
There is one node that has no HTML equivalent. Its __class
is "text/plain". Actually it may contain HTML
attributes the ASP page will not need to access but they are
treated as clear text by the parser and if the page had not used
the AddTag to request them then it is not interested in them
too. Thus the content of a HTML element that has open and close
HTML tags will appear as sub-node of the the element's node in
the tree and its __class will be "text/plain".
Modifying the tree the script could change the role of a
particular node by changing its __class. For example the script
may want to find a <DIV> element (by ID for example),
replace its content and then strip the enclosing DIV tags in the
regenerated document. Then the script could just do this: node("__class")
= "text/plain". Passing the tree to the
GenerateDoc method will place only the content of the node in
the generated document.
Where is the content of the text/plain nodes? They
have an item named __content - its value (string) is the
content. Pay attention for the specific characteristics of the
element when changing its __class to text/plain.
If it has some sub-nodes changing its __class to text/plain will
instruct the parser to skip them during the regeneration of the
document! Therefore changing the __class to test/plain is very
effective when a self-closing HTML element is replaced with some
text (for example tag you define for your usage only - something
like the CUSTOMFIELD tag used by the ASP Compiler), but it is
rarely convenient to use this technique if the element being
modified is expected to contain sub-tags.