Python代写:COMPSCI105-HTML-Tag-Checker

Requirement

A markup language is a language that annotates text so that the computer can manipulate the text. Most markup languages are human readable because the annotations are written in a way to distinguish them from the text. The most important feature of a markup language is that the tags it uses to indicate annotations should be easy to distinguish from the document content.

One of the most well-known markup languages is the one commonly used to create web pages, called HTML, or “Hypertext Markup Language”. In HTML, tags appear in “angle brackets”. When you load a Web page in your browser, you do not see the tags themselves: the browser interprets the tags as instructions on how to format the text for display.

Most tags in HTML are used in pairs to indicate where an effect starts and ends. For example:

1
2
3
<p>
this is a paragraph of text written in HTML
</p>

Here tag p represents the start of a paragraph, and tap p indicates where that paragraph ends.

Other tags include tag b that are used to place the enclosed text in bold font, and tag i indicate that the enclosed text is italic.
Note that “end” tags look just like the “start” tags, except for the addition of a backslash ‘/‘ after the symbol.

Sets of tags are often nested inside other sets of tags. For example, an ordered list is a list of numbered bullets.

You specify the start of an ordered list with the tag ol, and the end with /ol. Within the ordered list, you identify items to be numbered with the tags li (for “list item”) and /li. For example, the following specification:

1
2
3
4
5
<ol>
<li>First item</li>
<li>Second item</li>
<li>Third item</li>
</ol>

would result in the following:

  1. First item
  2. Second item
  3. Third item

Notice how you start the ordered list with the ol tag, specify three line items with matching li and /li tags, and the close the ordered list with the /ol tag.

You may have noticed that the pattern of using matching tags strongly resembles the pattern of matching parentheses that we discussed in class: when you use parentheses, brackets, and braces, they have to match in reverse order, such as “{[()]}”. A pattern such as “[(])” would be incorrect since the right bracket does not match the left parenthesis. Similarly, an HTML pattern such as ol li /ol /li would be incorrect since the closing tags are in the wrong order.

The aim of this question is to write an “HTML Checker” program that takes as input an HTML file, and produces a report indicating whether or not the tags are correctly matched.

Just as the parenthesis checker uses a stack to store symbols waiting for a match to be found, your program should also use a stack. You should include the implementation of the Stack ADT discussed in class.

Input: As input for your program, the sample test files (test1.html, test2.html, test3.html, test4.html, test5.html) can be download from the course website. You can open the test files with a text editor i.e. Notepad++. The test files are created with different scenarios both test1.html and test2.html have balanced tags, whereas the rest of the test files have unbalanced tag.

Processing the input file

The first task your program must do is read in an HTML file and extract the tags. A simple strategy for doing this would be to write a function “getTags” that:

  • reads one character at a time from the data file, throwing everything away until it gets to a “<”. (Discard the “<” as well.)
  • reads one character at a time, appending it to a string, until it gets to a “>” or whitespace. (Discard the “>” as well.)
    append the tag to a list.
  • returns tags found.

Make sure you account for end-of-file conditions in getTags. If you have completed everything correctly, you now have a list of tags, both start and end tags, once the getTags function is invoked.

HTML Tag Checker

Write a function called “checkTags” that iterates through your list of tags, looking for matches.

  • If there is a mismatch of beginning and ending tags, print an error message (see output section below) and terminate.
  • After processing the list of tags and there is no mismatch, print a confirmation message (see output section below).
  • At the end of the list, there are remaining tags on the stack, print a confirmation message (see output section below) and the remaining tags in the stack.

In addition, have your program build a list called “VALIDTAGS”. As you iterate through your list of tags, check to see if the tag appears in VALIDTAGS. If it doesn’t, add it to VALIDTAGS and print a confirmation message (see output section below).

Output

The output of your program should include the following:

  • A printout of your list of tags (the result of getTags).
  • One line for each tag as you process it, explaining the action and showing the current contents of the stack. You may have to modify your ADT to allow for the information to be displayed properly. Some examples are:
    1
    2
    3
    Tag b pushed: stack is now [html, body, b]
    Tag /b matches top of stack: stack is now [html, body]
    Tag ul pushed: stack is now [html, body, ul]

A message every time you add a tag to VALIDTAGS. For example:

  • New tag XXX found and added to list of valid tags

The Twist

There are some tags that do not need matching start and end tags! One example is br. This tag is used to indicate a line break at the current location. Another is meta, which is used to provide special information (“metadata”) about a webpage, and one more (left for you to identify in your data files).

If you followed the instructions above correctly, your HTML checker will notice that there are three tags that don’t have a match. Teach your program that this is okay for these three cases by maintaining a list called EXCEPTIONS which you hard-code into your main program. They will appear in your list of tags just as any other tags. However, when you begin your iteration through the list and you encounter one of these, you do not need to push it on the stack since you won’t be waiting for a close tag. Instead, just print an output line such as:

1
Tag br does not need to match: stack is still [html, body, b]

and continue.