(0) Have a look at the readings for this week.  Pay attention especially to the Ralph Grishman paper on information extraction.  The point of this week's exercise is to examine how the form of a news story (it's script or schema) can vary from one culture to the next.  We will examine this variation by thinking about how one might build an information extraction system for a given story type.

(1) Pick a news topic that is pertinent to international news.  Many different topics were listed by Gans and van Dijk in last week's readings.  You might want to use one of them, or you might want to pick one of your own.  For example, this week might be the week to pick a topic about the Olympics: e.g., the skating competition, or the number of medals a country is expected to win, etc.  Now, go to a domestic paper online (e.g., www. nytimes.com) and find a few stories on this topic.  Next, go to a foreign newspaper online an find a few stories on the same topic.  Many foreign newspapers are listed here: directory.google.com/Top/News/Newspapers/Regional/.

(2) Define a schema or script for domestic news stories on the topic of your choice.  You might want to use a more informal means to do this -- like the tree diagram notation employed by van Dijk; or you might want to define an XML DTD; or you might want to use a Lisp expression like those in the Schank et al. readings.    Go through the domestic stories and determine the important elements and attributes of the stories.  What are the main events reproted in these stories?  What are the roles? (E.g., most Olympic stories have winners)  In what order are the events and of the story narrated?

(3) Go back to the stories and mark them up.  Use the script you developed to label different pieces of the stories.  An easy way to do this might simply be to print out a copy of a story, mak it up with some colored pencils, then scan it back in so that we can have a look in class.

(4) Repeat (2) and (3) for the foreign newspaper.

(5) Think about the implied information extraction process: Review the Ralph Grishman paper on information extraction.  You might also look at my lecture on this topic if this is a new area for you: IS202 Lecture Powerpoint  Slides.  Think about the kinds of decisions you had to make in order to mark up the news stories.  Think about each of the six different levels of analysis that Grishman names in his paper on information extraction (shown in Figure 2 of the paper): lexical analysis, name recognition, partial syntactic analysis, scenario matching, coreference analysis, and inference (your "template" is basically your script or schema).  What kinds of decisions did you have to make, at each of these levels, in order to mark up the news stories?  Would any of the steps you made in so doing be difficult to implement in a computer program?  Write your thoughts down and/or take a couple of notes so that we can discuss these issues in class.

(6) Now comment on the differences you see in the schema you developed for the domestic news stories and the schema you developed for the foreign news stories.  Are there differences?  Maybe the schemas are exactly the same and the difference is simply in the people who are the focus (e.g., the Americans focus on the American Olympic competitors and the Norwegians focus on the Norwegians)?

(7) Upload your two schemas, your news stories analyzed, and you comments on how the news schema changes from one culture to the next.

Extra resources concerning LIsp and the Schank et al. work can be found on the course's resources page.