Dataworks Enterprise News SubSystem

Dataworks Enterprise News SubSystem is responsible for the ingest and distribution of News Data.

History and Documentation

There are, actually, two quite separate news servers (colloquially known as the "old news server"(NewsSvr.exe) and the "new news server"(NewsReal.exe)). The old news server was used prior to Release 2.1.0.

With Release 2.1.0, the old news server was replaced by a completely new module (NewsReal.exe). This provides all the facilities of the old news server and a whole collection of new functionality. The old news server is still distributed on the release CD and is still supported. We are recommending that users upgrade to the new news server. In a future release, the old news server will no longer be supported. Consequently, this Web Site will mainly deal with issues concerning the new news server.

News Servers ingest news headlines and stories from a variety of sources and keep a consolidated repository of News information for downstream consumers. News Servers deliver a consolidated stream of headlines and provides a search facility with both meta-data (such as company and category codes) and full text search capabilities.

The potential user groups fall into four basic categories:

Web Servers (�Hit and Run�).
In this scenario, a web-page requests a set of headlines using categories and company names and retrieve the resulting list of headlines. There is no further interaction between the client and the server.
Standard Trader Desktops (Interest-Based).
In this scenario, the user requests headlines that match a filter and expect to be delivered historical headlines (usually page by page) followed by all news stories that match the criteria.
Web-Based Users on Streaming Data Systems (Interest-Based).
This is a variant of the previous where the delivery system is a streaming module to active web content.
Bulk Delivery (Replication)
In this scenario, the News system is simply used to replicate News Databases from one site to another.

However, there is a considerable crossover between these groups (particularly those that are interest based).

The news system as a whole consists of a hierarchy of news storage servers. News Servers in the hierarchy would be the same software with differences of configuration. Each server stores both stories and headlines simplifying the configuration. Top-level server(s), which may be located centrally or at a customer site, will keep long-term history as a reference. Top-level servers can recover history from:

  • Feeds, which is of limited use as feeds drop stories after a very short period (24-36 hours)
  • Replication from another News Server

    An upstream News Server is used to satisfy �replication� requests from downstream News Servers and to back fill queries where the downstream device has insufficient headlines to satisfy the query. The downstream News Server would be used to offload request traffic from the upstream ones. Downstream servers only cache headlines that are more recent. Each level would have a maximum cache capacity (number of headlines/size of cache) for reasons of reliability.

    At start up, the top-level servers will restore headlines from disk and recover connections to feeds. When a downstream server connects to an upstream one, it clears its cache and requests history from the upstream device up to the capacity of its local cache. Lower level server keep smaller caches but have more hits, higher-level servers have more cache and fewer hits.

    When a news server receives a request, it attempts to satisfy the request from the local repository until the amount of data locally cached from an upstream news server is reached. The News server then combines data from an upstream request (eliminating duplicates) with local data.

    News clients use the News Server query language to retrieve a set of headlines they display. The client then applies the search criteria to incoming headlines filtering out those they do not need. Some specific matches would not be possible, e.g. full text search on a story.

    A key element of the system is the story repository. This is capable of handling very large numbers of documents (possibly millions). The only limitation to the number of stories held is the amount of physical storage space and the ability of the system to utilize that space. Specifically, the system has the ability to cache up to at least 390 days of history. The News indexing system is required capable of indexing and cataloging a large online repository of news information. The indexes are suitable for the types of query we wish to perform. Indexes are created on:

  • Metadata (category and company codes)
  • Full text of headlines and stories
  • Dates and times
  • Story UpdateIDs

    A query system needs to be provided that is capable of using the index system to search and retrieve history of news headlines. Searches automatically filter out what is not required or not permissioned. The search results would be cursor'ed in sets allowing the downstream system to request the next and previous set. Results are ordered by relevance or reverse order of date/time. The search results contain a reference to allow the downstream system or user to download the body of the item.

    The query system in the NewsReal Server is the Microsoft Index Server (formerly Content Indexer).

    History

    Release 2.0.0 - Original News Server Released
    Release 2.2.0 - New News Server Released. Old News Server still distributed on CD.

    Documentation

    Dependencies

    None.

    Configuration

    The NewsReal server has a standard configuration file that is commented. The main things to be configured are the input sources. There is an example of this configuration on the release CD.

    Sources

    The News Server advertises a source called "NewsReal" by default. The old news server called this "News" by default. The News Source is described here.

    Known Problems

    The majority of the known problems with the News Server are associated with the Index Server, particularly with version 2 of the Index Server.