IFB104-Print-Your-Own-Periodical

Motivation

Hardcopy periodicals such as newspapers, magazines, newsletters, etc. are all in decline as people increasingly turn to online media. Nonetheless, there is still a need for people to access regularly-updated information in an easy-to-read format. Here you will develop a program that produces a customised periodical in HTML format, using data downloaded from the World-Wide Web. The program will have a Graphical User Interface that allows the user to control production of the periodical, which can then be viewed in a standard web browser. Most importantly, your publication will comprise up-to-date data sourced from online “feeds” that are updated on a regular basis. To complete this request you will need to: (a) download web pages in Python and use regular expressions to extract particular elements from them, (b) create an HTML file containing the extracted elements, and (c) use Tkinter to provide a simple Graphical User Interface.

Illustrative Example

For the purposes of this task you have a totally free choice of what kind of periodical to produce. It could be:

  • a newspaper
  • a current affairs magazine
  • a fashion/lifestyle magazine
  • a newsletter for online gamers
  • a sports journal
  • a science and technology review
  • etc.

However, whatever theme you choose, you must be able to find at least four different online web pages that contain regularly-updated stories or articles in different categories under the overall theme. Each such story must contain a heading, a photograph, some text and a publication date. A good source for such data is Rich Site Summary (RSS) web-feed documents. The appendix below lists some such sites, but you are encouraged to find your own of personal interest.

To demonstrate the idea, we will publish our own newspaper, using data extracted from News Limited’s web site. Our demonstration program allows users to select from several categories, National News, Sports, World News, Business News, Entertainment and Technology. The program then downloads relevant data from the Web and uses it to produce an HTML document which can be read in a standard web browser.

The screenshot below shows our example solution’s GUI when it first starts.

The user is invited to select which categories of information they want included in their newspaper. In this case this is done by selecting check buttons, but other solutions are possible. Below the user has selected four news categories of interest.

When ready the user then presses the button to start “printing” the newspaper (i.e., to create an HTML file containing its contents). The system downloads current data from the web site and generates the file. The user can follow the “printing” process’s progress in the small text window.

As well as printing the latest top news items in each of the four categories, the system also generates a “masthead” which identifies the periodical.

Once the file has been created the user can open it in their preferred web browser. Alternatively, pressing the “Read” button in the GUI above will open the file in the host operating system’s default browser.

The generated document contains the masthead and the current top story in each of the selected categories. It is shown overleaf as viewed in the Firefox browser.

Above you can see the masthead with the name of the periodical, The Daily Planet in this case, and an image indicating the nature of its contents. (Our fictional newspaper’s slogan and editor are also shown, but these features are optional.)

Scrolling down in the HTML document shows the current top news item in each of the four selected categories when the program was run. Three of these are shown below, as they were when this demonstration was run. (Some of the images downloaded at this time were small “thumbnails”, hence their blurry appearance when enlarged.)

Notice that each top news item displayed above contains:

  1. the category of story;
  2. the URL where the original data was found;
  3. the story’s title (headline);
  4. a photo illustrating the story;
  5. a short summary of the story; and
  6. the date and time the story appeared online.

Most importantly, items 3 to 6 are all extracted “live” from the online web document indicated. This was done by downloading the HTML source and using regular expressions to find the necessary elements needed to construct our own version of the story. The first part of the HTML code generated by our Python program is shown below (as displayed in the Firefox browser).

Although not intended for human consumption, the generated HTML code is nonetheless laid out neatly, and with comments indicating the purpose of each part.

To compose our HTML document, Rich Site Summary (RSS) web-feed files are downloaded from the web site. RSS documents are XML files specifically intended to be machine-readable. They have a simple structure that makes it reasonably easy to extract their elements. An example of such a web document as it appears when examined in a web broswer is shown below.

This was was the source of the data used to produce our National News story shown above. To compose the corresponding page for our newspaper we extracted the latest story’s headline, story text, date and the address of the associated JPEG image. This data was then integrated into our HTML code.

We also discovered that sometimes the downloaded text contained unusual characters that are not handled properly in Python strings, most notably “smart” quotes, so we replaced these with plain characters before “printing” our newspaper.

Requirements and marking guide

To complete this task you are required to develop an application in Python similar to that above, using the provided publisher.py template file as your starting point. Your solution must support at least the following features.

Generating a masthead
Your program must be able to generate an HTML file, publication.html, which begins with a ‘masthead’ identifying the nature of your periodical. When viewed in a web browser, the masthead part of the document must contain at least the following elements:

  • The name of the periodical.
  • An image evocative of the periodical’s theme.

The image must be sourced from online (you cannot attach image files to your solution). Since it will never change, the URL for this particular image can be “hardwired” in your Python code. The HTML source generated by your Python program must be laid out neatly.

Generating four stories

Your Python program must be capable of generating at least four distinct “stories” as part of your periodical. Each such story must be derived from a different online web page, and must represent the latest story in a particular category at the time when the program runs. When viewed in a web browser, each story must contain at least the following elements:

  • the category of story,
  • the URL where the original data was found,
  • the story’s title (headline),
  • an image illustrating the story,
  • a short summary of the story, and
  • the date and time the story appeared online.

The last four of these items must all be extracted from the online document and must all belong together (i.e., you can’t have an image from one story and the headline from another). Each of the elements must be extracted from the original document separately. It is not acceptable to simply copy large chunks of the original document’s source code. The HTML source code generated by your Python program must be laid out neatly.

The precise visual layout, colour and style of the story elements is up to you and is determined by the design of your generated HTML code. The periodical must be easy to read. No HTML markup tags or other odd characters should appear in any of the text displayed to the user.

Data on the web changes frequently, so your solution must continue to work even after the web documents you use have been updated. For this reason it is unacceptable to “hardwire” your solution to the particular text and images appearing on the web on a particular day. Instead you will need to use text searching functions and regular expressions to actively find the text and images in the document, regardless of any updates that may have occurred since you wrote your program.

Code quality and presentation

Your Python program code must be presented in a professional manner. See the coding guidelines in the IFB104 Code Presentation Guide (on Blackboard under Assessment) for suggestions on how to achieve this. In particular, each significant code segment must be clearly commented to say what it does, e.g., “Create the masthead”, “Extract the first headline from the web page’s source code”, etc.

Extra feature

Part B of this request will require you to make a ‘last-minute extension’ to your solution. The instructions for Part B will not be released until just before the final deadline for request 2.

You can add other features if you wish, as long as you meet these basic requirements. For instance, in our example above we included a button in the GUI which opened the generated HTML document in the default web browser. We also supported more than four story categories.

You must complete the task using only basic Python features and the modules already imported into the provided template. In particular, you may not import any local image files. All displayed images and story text must be downloaded from online sources each time your program is run.

However, your solution is not required to follow precisely our example shown above. Instead you are strongly encouraged to be creative in the your choices of stories to display, the design of your Graphical User Interface, and the design of your periodical.