News Delivery Architecture

News is an essential resource for investors in both institutional and retail marketplaces. There is a common feeling among investors that while quotes and trades are important as a reflection of the current and past state of the markets, it is News that �makes the market�. Consequently, Dataworks Enterprise, as a platform for the publication of financial data, must have an appropriate solution for the delivery of news information.

Dataworks Enterprise system has a News Ingestion system known as News Real. This is currently in use in our retail business. This review was undertaken to see how this architecture could be adapted to better support the institutional marketplace. This review has also afforded the opportunity to address some additional issues that have arisen internally, such as the requirement for a permission system.

The changes to the News Real are significant enough to warrant a re-evaluation of the use of News Content in each of our primary market places (Part 2) and a review of the existing News system and how it attempts to achieve the functionality required (Appendix A). In the third part of this document, we introduce revisions to the News Real system. This new system would replace the existing retail system with the enhance functionality we would seek to supply to the institutional market place.

News Content

What is News?

News items currently consist of a News Headline and a News Story. Each news item has:

In addition, this material may have abstracts, possibly automatically generated.

As stated above, News data is much more rich than we currently account for. At present, we are only interested in textual stories. However, the stories or research could be (or contain) charts, pictures, and all sorts of things. This does not apply so much to real-time financial news as to research. The real-time headlines tend to be delivered as quickly as possible and so does not usually have much more than just the text "dashed off" by the reporter. We must maintain this performance.

How is News Created?

A story is a news item, filed by journalists and made available to clients in one or more parts. All parts of a story have a common identifier, the STORY_ID. The first part of a story may be an alert. This is a brief item containing the most essential information relating to that emerging story. Sometimes several alerts for the same story are filed in quick succession. Alerts are delivered as a headline, marked as a News Alert. Consequently, the client simply receives a stream of headlines. Alerts may be followed up by a headline and the first piece of text for the story. This text and its associated headline are called a �take�. Subsequent takes may also be filed containing additional text as needed. As described above, a story also has attached to it meta-data in the form of category codes. These are transmitted with alerts and headlines.

The complete news story therefore consists of any alerts, the headline, the story text and the category codes. Each alert or take of the same story also contains two time stamps: the story date and time and the take date and time. The story date and time is the time (GMT) that the first alert or take for that story was filed and is the same for all alerts and takes of that story. The take date and time is the time that particular alert or take was filed.

How is News Delivered?

News is delivered to us through two primary mechanisms: Wires and FTP push.

Newsvendors traditionally deliver news as �news wires�. These began life as the well understood ticker feeds and evolved to the use of serial communications. News wires usually employ proprietary formats and may be either broadcast (the norm) or request-response. News wires are typically employed where the delivery of news is considered time critical. An example of a News Feed is the Dow Jones newsfeed.

Similarly, news is also transferred using FTP push. In this scenario, the news consumer opens a public (secured) FTP site that they make accessible to the vendor. The vendor then employs FTP (or possibly some other mechanism) to transfer files containing news stories. Again, the format of the news is either proprietary or uses one of the transfer formats described below. This type of mechanism is usually used for less time critical delivery and usually happens on some schedule (e.g. four times a day).

There are three standards used in the delivery of News content using FTP push. These standards are defined by the IPTC, an international consortium of news publishers and vendors. The standards are known as NITF, XML-NITF and NewsML.

NITF (News Interchange Transfer Format) is an SGML based news transfer format. It allows for the definition of headlines, categories, authors and other details of the news stories. It also allows for richer content including the provision of pictures. Most IPTC members use NITF in one form or another. Dataworks Enterprise currently ingests SGML-NITF for the German news agency, DPA. XML-NITF is updates the original NITF standard allowing for easier ingest using standard XML parsers. DPA will be moving to XML-NITF later this year.

NewsML is the newest of the standard IPTC formats. It began life as a replacement for the NITF standard to allow richer content to be transferred in standard XML formats. In its original form, it was formally known as XML-News. NewsML provides a complex schema allowing for the complete mark-up of all aspects of news stories. Although IPTC members have formally adopted NewsML, the standard is very new and, consequently, not used by many newsvendors at present.

News Data Formats

Textual data is delivered in a wide range of formats some more or less susceptible to conversion. For instance, news stories might be in SGML, HTML, XML, RTF, ASCII or some other format. It is not always possible to look at the data and determine very much about the intended formatting. This becomes a significant problem when dealing with tables and other ASCII formatting.

Data vendors and distributors such as Thomson integrate these news wires with their data systems allowing end users access to the news. To this end, these vendors normalize the data into some internal proprietary format for ingestion into their systems. Similarly, Dataworks Enterprise must also be responsible for some data normalisation.

The normalization is typically applied to:

The degree to which normalization is carried out depends on the vendor or distributor. For instance, vendors are in control of the inception and creation of news (they are primarily a news vendor). Consequently, Reuter feeds have normalized story formats (text and tab-text), category codes (E for Equity, M for Money) and company codes (RICs). ON the other hand, Thomson Financial PGE ingests News from a variety of sources. PGE normalizes company codes, but not category codes. News stories in PGE are delivered in RTF (Delphi variant), although the feed has the capability to deliver other formats, the workstation could not, currently, process those formats.

Non-textual data is typically limited to still pictures which are delivered using the whole gamut of pictures formats including JPEGs, GIFs, PNGs, TIF, etc.

News vs. Research

As an aside, it is interesting to note that News and Analyst Reports can be very similar in many respects. Both are delivered using file downloads; both types of system define symbology for meta-data including company symbols, category codes, sectors, etc.

One area where News and Research are fundamentally different is in the numbers and rates of update of stories and, consequently, the period of time that a story is held for. Research information is updated infrequently and, typically, kept indefinitely. News, on the other hand, consists of thousands of stories per day and is, consequently, deleted after a set period. Indeed, there may be contractual requirements that mean that we are not allowed to hold stories for too long.

Analyst reports might be ingested into Dataworks Enterprise as part of a centralized system similar to those created by TFI or might form part of an internal ingest and publish system for a large institutional investor. Our architecture should take into the account the similarity of research and news.

User Groups

As far as consumers are concerned, we have divided the potential user groups into four basic categories:

Web Servers (�Hit and Run�).
In this scenario, a web-page requests a set of headlines using categories and company names and retrieve the resulting list of headlines. There is no further interaction between the client and the server.
Standard Trader Desktops (Interest-Based).
In this scenario, the user requests headlines that match a filter and expect to be delivered historical headlines (usually page by page) followed by all news stories that match the criteria.
Web-Based Users on Streaming Data Systems (Interest-Based).
This is a variant of the previous where the delivery system is a streaming module to active web content.
Bulk Delivery (Replication)
In this scenario, the News system is simply used to replicate News Databases from one site to another.

There is a considerable crossover between these groups (particularly those that are interest based). We would like to build a system where a set of basic Dataworks Enterprise software components can be used to support each of these user groups.

Although we have considered the user groups, this document does not attempt to define the software systems used to present data to the user except in the most general manner. The purpose of Dataworks Enterprise news handling is to provide data to the application, not to prescribe how that data is displayed. We know that some clients will be generating HTML whilst others will have some form of custom news display/viewer. The news architecture should be equally capable of supporting both.

3. News System

News Architecture

The news system consists of a hierarchy of news storage servers. The News Servers in the hierarchy would be the same software with differences of configuration. Each server will store both stories and headlines simplifying the configuration.

Top-level server(s), which may be located centrally or at a customer site, will keep long-term history as a reference. Top-level servers will only be able to recover history from:

An upstream News Server is used to satisfy �replication� requests from downstream servers and to back fill queries where the downstream device has insufficient headlines to satisfy the query. The downstream News Server would be used to offload request traffic from the upstream ones. Downstream devices only cache headlines that are more recent. Each level would have a maximum cache capacity (number of headlines/size of cache) for reasons of reliability.

At the lowest level, News Servers satisfy requests from clients (Web Servers and Interest-based clients). For interest-based clients, it is the responsibility of the client to match incoming new headlines with the results of a query. To facilitate this, all headlines have a timestamp as part of the news creation process.

At start up, the top-level servers will restore headlines from disk and recover connections to feeds. When a downstream server connects to an upstream one, it clears its cache and requests history from the upstream device up to the capacity of its local cache. Lower level server keep smaller caches but have more hits, higher-level servers have more cache and fewer hits. Top-level news servers may be organised in a �figure of eight� configuration for recovery purposes.

Headline messages from sources are actually commands that indicate whether to add or delete a headline, or indicate corrections, new takes, expiry, etc. Expiry is ignored by the top-level server and never propagated from the source (although it may send its own expiry times to downstream devices). Drops are applied locally and passed on.

On each new/corrected headline message, each news server will check the cached headlines allowing it to work out whether the story is to be added (e.g. it is in capacity range). If it is a headline to be added locally, it does so, possibly dropping an older headline. If the news server is a top-level server, it stores the headline on disk and requests the underlying News Story, which is also written to disk. All news servers broadcast each new headline message to all its clients to allow them to stay in sync. This is similar to what News Server currently does.

All news servers support a common request for history. This is in the form of a query (defaults to all headlines in a short period of time). The news server streams down a fixed number of responses and then it terminates the request with a next pointer (which can be used to get the next set of responses provided this request is issued prior to releasing the previous request). Note that this is unlike the current array system of returning responses to allow the permissions system to operate on responses to queries in the same way as headline broadcasts. However, this does impose the use of a helper object for Web development (see below).

When a news server receives a request, it attempts to satisfy the request from the local repository until the amount of data locally cached from an upstream news server is reached. The News server then combines data from an upstream request (eliminating duplicates) with local data.

News clients use the News Server query language to retrieve a set of headlines they display. The client then applies the search criteria to incoming headlines filtering out those they do not need. Some specific matches would not be possible, e.g. full text search on a story.

News Sub-Systems

The basic elements of the news system are:

Ingest

A standard ingest mechanism that can be employed for any current news delivery system, albeit that some richer content may not be available. This would be based on the exiting ingest mechanism, so that we do not invalidate the work we have already put in.

Normalisation

Some data normalisation is required. This, typically, revolves around the story format, the category coding and the use of company identifiers.

Story Repository

A key element of the system is the story repository. This should be capable of handling very large numbers of documents (possibly millions). The only limitation to the number of stories held should be the amount of physical storage space and the ability of the system to utilize that space. Specifically, the system should have the ability to cache up to at least 390 days of history.

Indexing

An indexing system is required capable of indexing and cataloging a large online repository of news information. The indexes should be suitable for the types of query we wish to perform. Indexes should be created on:

Search and Retrieve

A query system needs to be provided that is capable of using the index system above to search and retrieve history of news headlines.

Searches should automatically filter out what is not required or not permissioned. The search results would be cursor'ed in sets allowing the downstream system to request the next and previous set.

Each matching result would include:

We would expect the results to either be ordered by relevance or reverse order of date/time. The search results would contain a reference to allow the downstream system or user to download (preferably streaming style) the body of the item.

Search queries should be able to specify an ID from which the search should take place to allow back fills and to ensure that no data is missed.

Search results should be delivered as an array as at present to allow Web services to continue to issue requests in blocking mode. However, this will have a significant impact on the complexity of the permission system, since it will have to be able to handle the request result arrays. In addition, an option should be placed in the permission system to allow the user to �hint� permissions to the News Server, to improve downstream traffic usage.

Streaming Updates

The news system would employ a consolidated update stream for new headlines applied to the index.

Delivery Architecture and Replication

We need to design for appropriate (and available) delivery systems. Therefore, it must be possible to broadcast as well as query point-to-point for the news and analytical data. This implies some local storage and caching built into the system and the ability to store news on the customer site.

News Requests

There are three basic types of request for news, headline requests, search requests and story requests.

Headline Requests

This is the request than a handler or upstream News Server delivers the consolidated broadcast stream of headlines. The request has the instrument name HEADLINE. The headlines are delivered from the moment the request is issued. No history is provided on this request. Optional data is not required.

Search Requests

All search requests begin with �?�. The default request is simply "?", which means, "give me the most recent set of headlines". More sophisticated requests can be issued by adding additional information to the search requestThe syntax of search requests is the syntax provided by the search engine query syntax.

Recovery Requests

This is the request than a handler or upstream News Server delivers the complete content of history page by page. This request is configurable in the News Server for upstream requests for history. An upstream News Server employs the request HISTORY for this purpose. The downstream device terminates iterating the chain when sufficient history to populate its local cache has been retrieve (or when upstream history is exhausted). A downstream News Server issues this request whenever the upstream device comes inline in order to recover from failure.

Story Requests

Anything that is not a headline or search request is treated as a story request. Optional data is not required.

News Data Formats

Headlines

Headlines are delivered to a News Server as records containing at least the following fields:

NAME

May be any of the following values:

NewsAlert = 1
NewsFirstTake = 2
NewsSubsequentTake = 3
NewsCorrection = 4
NewsCorrected = 5
NewsDrop = 7
NewsExpired = 8
NewsNextLR=9

NEWS_STORY_SRC

The name of the source providing the story

NEWS_STORY

The name of the item to request for the story

STORY_TIME

The datetime the story was created

STORY_TYPE

May be any of the following values:

Z = PERMANENT_STORY
R = TEMPORARY_STORY

ATTRIBTN

The name of the wire providing the data

BCAST_TEXT

The text of the headline

CO_IDS

The names of companies associated with the headline

PROD_CODE

A list of products to which the headline relates

TOPIC_CODE

Topic codes associated with the headline

PROC_TIME

The datetime at which the story is processed (defaults to STORY_TIME)

NEXT_LR

When Name=NewsNextLR this contains the next logical record in the sequence of headline query results

RTPERMISSIONS

The Permissions field for the headline

TAKE_TIME

The datetime of this particular take (defaults to STORY_TIME)

TAKE_SEQNO

The sequence number of the take

The action performed by downstream modules depends on the content of the NAME field. Downstream devices should handle messages in a manner such that duplicate transmission of the same headline message can be handled correctly.

NewsAlert
This indicates that there is a new News alert for the story. News Alerts typically have no story body associated with them.
NewsFirstTake
This indicates the first take on a breaking news story and is the mechanism used to add stories into the system.

NewsSubsequentTake
This indicates that the story is filed as a subsequent take on an existing story. New takes should contain the entire text of the story.

NewsCorrection
Indicates that the headline represents a correction to a previously filed story and headline. This is used to notify the user that some part of the original content of the story is missing, incorrect or otherwise invalid.

NewsCorrected
This is similar to Correction, except that only the body of the News Story has changed, not the headline.

NewsDrop
This indicates that the story is no longer valid and should be dropped from the system. Both permanent and temporary stories may be dropped.
NewsExpired
This indicates that all stories prior to the time given in the expiry should be dropped unless those stories are marked as permanent stories (see STORY_TYPE).
NewsNextLR
Indicates that the results of streaming a query is complete and that to retrieve more results, the downstream device needs to issue the query for NEXT_LR prior to dropping the current record.

Stories

News stories are delivered as:

TABTEXT

The format of the text in this segment:

T � Preformatted ASCII
X � ParaPerLine ASCII
R � PlainText ACSII
H � HTMLText
F � Preformatted RTF
P � Plain RTF

SEG_TEXT

The story text. This may be separated into a sequence of text bodies linked with the NEXT_LR field.

RTPERMISSIONS

The Permissions field for the story

ATTRIBTN

The name of the wire providing this story

NEXT_LR

The name of the record containing the next part of the story or an empty field (may be an empty string, a VT_NULL or a VT_EMPTY).

The TABTEXT field is used to identify the formatting of the text segment. News Server handles its stories as HTML Text. If any of the other three are used, the News Server converts the incoming story to HTML. The conversions, are by necessity very simple using the following rules:

n      Preformatted ASCII: Place <pre> at the beginning of the text and </pre> at the end. This formatting is usually used to stop word wrapping on textual tables.

n      ParaPerLine ASCII: Convert each new line character in the file to <P>.

n      PlainText ASCII: Convert each sequence of 2 new line characters into <P>

n      HTMLText: Pass the story through unchanged.

n      Preformatted RTF: RTF Text that contains tabular data and is marked up on a paragraph break per line. This is converted to text and marked up as Preformatted ASCII.

n      Plain RTF: This is converted to text and marked up converting paragraph boundaries with a newline creating wrappable text.

Query Results

Query results are delivered in the same format as headlines, but each headline is streamed to the client. On the final headline of the page, a NewsNextLR message is sent with the NEXT_LR field filled in with a next logical record. This simplifies the actions of the permission system.

4. News Implementation

Ingestion Modules

There are currently four modules responsible for basic ingestion:

PGE Handler

This module is responsible for importing all news data from PGE, including AFX, Dow Jones and ICV newswires. During start-up, this feed sends up to 36 hours of news for recovery purposes.

PRNews

This cut-down NewsML reader is used to import the PR-NewsWire. During start-up, this feed sends all current news stories for recovery purposes.

DPA Online

This is used to ingest NITF format news from DPA. During start-up, this feed sends all current news stories for recovery purposes.

Each of the ingest modules currently makes the news look like the standard News 2000 service. Where available a recorvery of old headlines is supplied.

Some additional work will be necessary to separate the recovery from the standard headline search, so that news servers can request news data separate from recovery records. This recovery request will be the only news related request (apart from HEADLINE and stories) that an ingest module will have to recognise.

In the future, we will need to create new ingest handlers.

News Server Ingest

The main bulk of the News Server ingest will remain the same with the addition of code used in the DeutchesBank category mapping system. However, the following additions will be made:

  • The ingestion module will have the ability to issue recovery requests to upstream servers.
  • The module will have the option of only storing headlines until the story is made available because of a downstream request, simplifying support for existing News 2000 subscribers.
  • The ingestion module will map incoming category codes to a standard set using the same techniques as providfed by Deutchesbank except that unknown category codes will be entered into the property database unchanged.
  • The source and alert properties may be removed.
  • News servers may import data from upstream News Servers for both content and queries (see below). Only one upstream server at any given time will be supported (Dataworks Enterprise will handle failover of multiple News Servers).
  • News servers will be organised to expire news items on a per source basis with changes required to the repository (see below).

    This work is expected to take no more than ten days to complete.

    News Server Request Module

    The main changes to the News Server will occur in the request module. The following features will be supported:

    This work is expected to take no more than ten days to complete.

    Content Indexer

    The main bulk of the content index and repository will remain the same, including the IFilter system. In the future, we made want to upgrade to a News specific XML parser. The following changes will be made:

    The new repository will support much more resilient architecture based around RAID systems or equivalent. This work is expected to take no more than ten days to complete.

    Module/Unit Testing

    In order to test the news server, we shall have to:

    Unit/Module Testing should be complete in ten days.

    News Helper Object

    In order to assist with the upgrade of Web applications, the News Server will have an accompanying object known as the News Real Helper Object. This is an object that provides the traditional array interfaces to the news for ASP clients without having to handle the streaming headlines that are generated by a query.

    The News Real Helper Object provides methods for requesting a news query, idnetifying whether a Next page exists and to issue the request for the next page. The object simulates blocking mode requests (used in ASP pages) and should be (mostly) plug compatible with existing Web pages uising the News.

    5. Issues Arising from the Specification

    The following issues have arisen from the foregoing specification.

    Timestamp Synchronisation

    Since times are applied to News stories by vendors, there is a potential for one News Vendor to constantly appear to be delivering news in a more timely fashion than another. This causes the perception by end users that one vendor is �better� than another in terms of timeliness of reporting and delivery of stories. In extreme cases, this also means that users sorting stories by time (the norm) may always only ever see one vendor at the top of the list of headlines.

    These problems also cause considerable problems to the ingestion and consolidation of News. Newsvendors can backfill news stories i.e. it is possible to receive old news headlines after subsequent ones. Consequently, the News Server cannot update the story dates and times on reception.

    Timestamps are also essentially for allowing clients to specify the boundary conditions of searches to create the effect of cursoring. Client must have the functionality to filter input for matches to the search criteria and the ability to filter out duplicate messages.

    Time Formats

    As far as time formats are concerned, the value of the time fields should always be in UTC time zone. This allows the indexing system to have a consistent value against which to index the stories. The preferred display value for News stories is current DD MMM YY, however, it might be more useful to provide the ISO standard format (YYYY/MM/DDTHH:MM:SS+/-HHMM) maintaining information on time zone of publication.

    uplicate Stories

    Due to the mechanisms used by vendors to construct news wires, it is frequently the case that the same story can exist on multiple wires. This is a nuisance to a user who subscribes to the same wire. However, such duplicates cannot be removed at the server as then downstream subscribers who only have access to one newswire may lose the story completely. Consequently, the duplicates have to be removed at the point of query or receipt of query. If the duplicates are removed during query at the server, then replication will no longer work correctly. Consequently, clients will need to remove the duplicates.

    Contractual Issues and Permissioning

    There are a number of problems associated with contractual issues.

    Firstly, some vendors have contracts that prohibit the storage of news stories beyond a given duration. Typically, these vendors support historical news data services of their own. Consequently, expiry has to be limited by the News feed (through the delivery of Expiry messages) to this duration.

    In addition, consumers of News should not be given access any content which exists prior to their contract dates. In addition, our customers specify news expiry periods (e.g. 90 days of News). They should not be given access to news outside this period. This is probably a permission issue.

    News Drops and Expiries

    Some vendors issue drops and reissue the story. If the News Server ignores the drop messages, then duplicate stories will exist in the news system. On the other hand, if the News Server handles drop messages, then the News archive can become no bigger than that which is provided by the vendor. Consequently, we would experience a loss in the news history that could be quite catastrophic. For instance, News2000 issues drops and expiries allowing News to exist for no more than 24 hours, which is not adequate for most of our users. Consequently, News Server handles drops but not expiries.

    Missing Updates from Stream

    When a client requests news history, they must issue a request for the headline stream prior to issuing the request for the history. Incoming headlines should be matched against the search criteria even before any history results have been received. This removes the possibility that headlines that are in transit during the request for history are missed.

    Equally, when the News Server stores a headline in its cache there may be a delay between the time at which the News Story is saved and the timed at which it is indexed. Consequently, the headline should not be delievered to downstream devices until the indexing process is complete.

    A: News Real Server

    The current News system consists of two types of component, the news ingest components and the News Real Server.

    Ingestion

    News data is ingested into Dataworks Enterprise through some module that is either:

    In either case, the module simply takes the incoming data and reformats it into something that is understood by Dataworks Enterprise according to very simple rules.

    News Real Server

    The News Real Server is a PDP Source that ingests news headlines and optionally stories and stores them for use by other PDP clients. It has a variety of configurations depending on the requirements of the customer.

    News Real servers either can talk directly to ingestion modules or can take a consolidated feed from other News Real Servers. News Real Servers can either store and index headlines internally (in memory) or stores the stories on file and employs MS Index Server (sometimes known as Content Indexer or CI) as an indexer. The News Real source (usually called News) provides a stream of updates to the HEADLINE instrument as news headlines are received (actually when the accompanying story is ingested). It also provides syntax for querying the database. The syntax used depends on whether Index Server is in use or not. Results are delivered using an array format. News Real Servers internally can use any Index Server query system (i.e. version 1, 2 or 3).

    There are three basic configurations:

    No Story Repository

    This configuration is used to provide an updating cache of News Headlines with no underlying News Story repository. This is the most basic News Server configuration and is used where the underlying News headline sources do not cache the headline data. In this configuration, the Cache Stories is switched off.

    News Real Server - Repository with Internal Indexing

    This configuration is used to provide both an updating cache of News Headlines and a News Story repository. The headline index is stored in the memory of the News Server and supports a very basic query language based on category codes and company names. The repository is a collection of HTML files marked up using HTML comments with additional data. This would be used where a repository is required, but not the full indexing capability of a Content Indexer. Note: it is possible to install this configuration with the News source not enabled allowing the News Server to act simply as a mechanism to store incoming news stories.

    News Real Server - Repository and MS Index Server

    This configuration is used to provide both an updating cache of News Headlines and a News Story repository. The headline index is stored using MS Index Server which supports a very sophisticated query language based on the full text and other properties of the News stories including category codes and company names. The repository is a collection of HTML files marked up using HTML comments with additional data. This would be used where a repository is required with the full indexing capability of the Content Indexer. Note, it is possible to install this configuration with the News source not enabled allowing the News Real Server to act simply as a mechanism to store and index incoming news stories. This configuration requires considerable additional installation and configuration of the Content Indexer.

    Index Server

    In most configurations, News Real uses an Index Server as the searching engine. Index Server is used because it is:

    Index Server V3 is much more stable than V2 and easier to install. All Windows 2000 machines have it installed by default. Index Server 3 will only run on Windows 2000 hosts. The chief problems with Index Server results from: