News is an essential resource for investors in both institutional and retail
marketplaces. There is a common feeling among investors that while quotes and
trades are important as a reflection of the current and past state of the
markets, it is News that �makes the market�. Consequently, Dataworks
Enterprise, as a platform for the publication of financial data, must have an
appropriate solution for the delivery of news information.
Dataworks Enterprise system has a News Ingestion system known as News Real. This
is currently in use in our retail business. This review was undertaken to see
how this architecture could be adapted to better support the institutional
marketplace. This review has also afforded the opportunity to address some
additional issues that have arisen internally, such as the requirement for a
permission system.
The changes to the News Real are significant enough to warrant a re-evaluation
of the use of News Content in each of our primary market places (Part 2) and a
review of the existing News system and how it attempts to achieve the
functionality required (Appendix A). In the third part of this document, we
introduce revisions to the News Real system. This new system would replace the
existing retail system with the enhance functionality we would seek to supply
to the institutional market place.
News items currently consist of a News Headline and a News Story. Each news item
has: In addition, this material may have abstracts, possibly automatically generated. As stated above, News data is much more rich than we currently account for. At
present, we are only interested in textual stories. However, the stories or
research could be (or contain) charts, pictures, and all sorts of things. This
does not apply so much to real-time financial news as to research. The
real-time headlines tend to be delivered as quickly as possible and so does not
usually have much more than just the text "dashed off" by the reporter. We must
maintain this performance. A story is a news item, filed by journalists and made available to clients in
one or more parts. All parts of a story have a common identifier, the STORY_ID.
The first part of a story may be an alert. This is a brief item containing the
most essential information relating to that emerging story. Sometimes several
alerts for the same story are filed in quick succession. Alerts are delivered
as a headline, marked as a News Alert. Consequently, the client simply receives
a stream of headlines. Alerts may be followed up by a headline and the first
piece of text for the story. This text and its associated headline are called a
�take�. Subsequent takes may also be filed containing additional text as
needed. As described above, a story also has attached to it meta-data in the
form of category codes. These are transmitted with alerts and headlines. The complete news story therefore consists of any alerts, the headline, the
story text and the category codes. Each alert or take of the same story also
contains two time stamps: the story date and time and the take date and time.
The story date and time is the time (GMT) that the first alert or take for that
story was filed and is the same for all alerts and takes of that story. The
take date and time is the time that particular alert or take was filed. News is delivered to us through two primary mechanisms: Wires and FTP push. Newsvendors traditionally deliver news as �news wires�. These began life as the
well understood ticker feeds and evolved to the use of serial communications.
News wires usually employ proprietary formats and may be either broadcast (the
norm) or request-response. News wires are typically employed where the delivery
of news is considered time critical. An example of a News Feed is the Dow Jones
newsfeed. Similarly, news is also transferred using FTP push. In this scenario, the news
consumer opens a public (secured) FTP site that they make accessible to the
vendor. The vendor then employs FTP (or possibly some other mechanism) to
transfer files containing news stories. Again, the format of the news is either
proprietary or uses one of the transfer formats described below. This type of
mechanism is usually used for less time critical delivery and usually happens
on some schedule (e.g. four times a day). There are three standards used in the delivery of News content using FTP push.
These standards are defined by the IPTC, an international consortium of news
publishers and vendors. The standards are known as NITF, XML-NITF and NewsML. NITF (News Interchange Transfer Format) is an SGML based news transfer format.
It allows for the definition of headlines, categories, authors and other
details of the news stories. It also allows for richer content including the
provision of pictures. Most IPTC members use NITF in one form or another.
Dataworks Enterprise currently ingests SGML-NITF for the German news agency,
DPA. XML-NITF is updates the original NITF standard allowing for easier ingest
using standard XML parsers. DPA will be moving to XML-NITF later this year. NewsML is the newest of the standard IPTC formats. It began life as a
replacement for the NITF standard to allow richer content to be transferred in
standard XML formats. In its original form, it was formally known as XML-News.
NewsML provides a complex schema allowing for the complete mark-up of all
aspects of news stories. Although IPTC members have
formally adopted NewsML, the standard is very new and, consequently, not used
by many newsvendors at present. Textual data is delivered in a wide range of formats some more or less
susceptible to conversion. For instance, news stories might be in SGML, HTML,
XML, RTF, ASCII or some other format. It is not always possible to look at the
data and determine very much about the intended formatting. This becomes a
significant problem when dealing with tables and other ASCII formatting. Data vendors and distributors such as Thomson integrate these news
wires with their data systems allowing end users access to the news. To this
end, these vendors normalize the data into some internal proprietary format for
ingestion into their systems. Similarly, Dataworks Enterprise must also be
responsible for some data normalisation. The normalization is typically applied to: The degree to which normalization is carried out depends on the vendor or
distributor. For instance, vendors are in control of the
inception and creation of news (they are primarily a news vendor).
Consequently, Reuter feeds have normalized story formats (text and tab-text),
category codes (E for Equity, M for Money) and company codes (RICs). ON the
other hand, Thomson Financial PGE ingests News from a variety of sources. PGE
normalizes company codes, but not category codes. News stories in PGE are
delivered in RTF (Delphi variant), although the feed has the capability to
deliver other formats, the workstation could not, currently, process those
formats. Non-textual data is typically limited to still pictures which are delivered
using the whole gamut of pictures formats including JPEGs, GIFs, PNGs, TIF,
etc. As an aside, it is interesting to note that News and Analyst Reports can be very
similar in many respects. Both are delivered using file downloads; both types
of system define symbology for meta-data including company symbols, category
codes, sectors, etc.
One area where News and Research are fundamentally different is in the numbers
and rates of update of stories and, consequently, the period of time that a
story is held for. Research information is updated infrequently and, typically,
kept indefinitely. News, on the other hand, consists of thousands of stories
per day and is, consequently, deleted after a set period. Indeed, there may be
contractual requirements that mean that we are not allowed to hold stories for
too long. Analyst reports might be ingested into Dataworks Enterprise as part of a
centralized system similar to those created by TFI or might form part of an
internal ingest and publish system for a large institutional investor. Our
architecture should take into the account the similarity of research and news. As far as consumers are concerned, we have divided the potential user groups
into four basic categories: There is a considerable crossover between these groups (particularly those that
are interest based). We would like to build a system where a set of basic
Dataworks Enterprise software components can be used to support each of these
user groups. Although we have considered the user groups, this document does not attempt to
define the software systems used to present data to the user except in the most
general manner. The purpose of Dataworks Enterprise news handling is to provide
data to the application, not to prescribe how that data is displayed. We know
that some clients will be generating HTML whilst others will have some form of
custom news display/viewer. The news architecture should be equally capable of
supporting both. The news system consists of a hierarchy of news storage servers. The News
Servers in the hierarchy would be the same software with differences of
configuration. Each server will store both stories and headlines simplifying
the configuration. Top-level server(s), which may be located centrally or at a customer site, will
keep long-term history as a reference. Top-level servers will only be able to
recover history from: An upstream News Server is used to satisfy �replication� requests from
downstream servers and to back fill queries where the downstream device has
insufficient headlines to satisfy the query. The downstream News Server would
be used to offload request traffic from the upstream ones. Downstream devices
only cache headlines that are more recent. Each level would have a maximum
cache capacity (number of headlines/size of cache) for reasons of reliability. At the lowest level, News Servers satisfy requests from clients (Web Servers and
Interest-based clients). For interest-based clients, it is the responsibility
of the client to match incoming new headlines with the results of a query. To
facilitate this, all headlines have a timestamp as part of the news creation
process. At start up, the top-level servers will restore headlines from disk and recover
connections to feeds. When a downstream server connects to an upstream one, it
clears its cache and requests history from the upstream device up to the
capacity of its local cache. Lower level server keep smaller caches but have
more hits, higher-level servers have more cache and fewer hits. Top-level news
servers may be organised in a �figure of eight� configuration for recovery
purposes. Headline messages from sources are actually commands that indicate whether to
add or delete a headline, or indicate corrections, new takes, expiry, etc.
Expiry is ignored by the top-level server and never propagated from the source
(although it may send its own expiry times to downstream devices). Drops are
applied locally and passed on. On each new/corrected headline message, each news server will check the cached
headlines allowing it to work out whether the story is to be added (e.g. it is
in capacity range). If it is a headline to be added locally, it does so,
possibly dropping an older headline. If the news server is a top-level server,
it stores the headline on disk and requests the underlying News Story, which is
also written to disk. All news servers broadcast each new headline message to
all its clients to allow them to stay in sync. This is similar to what News
Server currently does.
All news servers support a common request for history. This is in the form of a
query (defaults to all headlines in a short period of time). The news server
streams down a fixed number of responses and then it terminates the request
with a next pointer (which can be used to get the next set of responses
provided this request is issued prior to releasing the previous request). Note
that this is unlike the current array system of returning responses to allow
the permissions system to operate on responses to queries in the same way as
headline broadcasts. However, this does impose the use of a helper object for
Web development (see below). When a news server receives a request, it attempts to satisfy the request from
the local repository until the amount of data locally cached from an upstream
news server is reached. The News server then combines data from an upstream
request (eliminating duplicates) with local data. News clients use the News Server query language to retrieve a set of headlines
they display. The client then applies the search criteria to incoming headlines
filtering out those they do not need. Some specific matches would not be
possible, e.g. full text search on a story. The basic elements of the news system are: A standard ingest mechanism that can be employed for any current news delivery
system, albeit that some richer content may not be available. This would be
based on the exiting ingest mechanism, so that we do not invalidate the work we
have already put in. Some data normalisation is required. This, typically, revolves around the story
format, the category coding and the use of company identifiers. A key element of the system is the story repository. This should be capable of
handling very large numbers of documents (possibly millions). The only
limitation to the number of stories held should be the amount of physical
storage space and the ability of the system to utilize that space.
Specifically, the system should have the ability to cache up to at least 390
days of history. An indexing system is required capable of indexing and cataloging a large online
repository of news information. The indexes should be suitable for the types of
query we wish to perform. Indexes should be created on: A query system needs to be provided that is capable of using the index system
above to search and retrieve history of news headlines. Searches should automatically filter out what is not required or not
permissioned. The search results would be cursor'ed in sets allowing the
downstream system to request the next and previous set. Each matching result would include: We would expect the results to either be ordered by relevance or reverse order
of date/time. The search results would contain a reference to allow the
downstream system or user to download (preferably streaming style) the body of
the item. Search queries should be able to specify an ID from which the search should take
place to allow back fills and to ensure that no data is missed. Search results should be delivered as an array as at present to allow Web
services to continue to issue requests in blocking mode. However, this will
have a significant impact on the complexity of the permission system, since it
will have to be able to handle the request result arrays. In addition, an
option should be placed in the permission system to allow the user to �hint�
permissions to the News Server, to improve downstream traffic usage. The news system would employ a consolidated update stream for new headlines
applied to the index. We need to design for appropriate (and available) delivery systems. Therefore,
it must be possible to broadcast as well as query point-to-point for the news
and analytical data. This implies some local storage and caching built into the
system and the ability to store news on the customer site. There are three basic types of request for news, headline requests, search
requests and story requests. This is the request than a handler or upstream News Server delivers the
consolidated broadcast stream of headlines. The request has the instrument name
HEADLINE. The headlines are delivered from the moment the request is issued. No
history is provided on this request. Optional data is not required. All search requests begin with �?�. The default request is simply "?", which
means, "give me the most recent set of headlines". More sophisticated requests
can be issued by adding additional information to the search requestThe syntax
of search requests is the syntax provided by the search engine query syntax. This is the request than a handler or upstream News Server delivers the complete
content of history page by page. This request is configurable in the News
Server for upstream requests for history. An upstream News Server employs the
request HISTORY for this purpose. The downstream device terminates iterating
the chain when sufficient history to populate its local cache has been retrieve
(or when upstream history is exhausted). A downstream News Server issues this
request whenever the upstream device comes inline in order to recover from
failure. Anything that is not a headline or search request is treated as a story request.
Optional data is not required. Headlines are delivered to a News Server as records containing at least the
following fields: NAME May be any of the following values: NEWS_STORY_SRC The name of the source providing the story NEWS_STORY The name of the item to request for the story STORY_TIME The datetime the story was created STORY_TYPE May be any of the following values: ATTRIBTN The name of the wire providing the data BCAST_TEXT The text of the headline CO_IDS The names of companies associated with the headline PROD_CODE A list of products to which the headline relates TOPIC_CODE Topic codes associated with the headline PROC_TIME The datetime at which the story is processed (defaults
to STORY_TIME) NEXT_LR When Name=NewsNextLR this contains the next logical
record in the sequence of headline query results RTPERMISSIONS The Permissions field for the headline TAKE_TIME The datetime of this particular take (defaults to
STORY_TIME) TAKE_SEQNO The sequence number of the take The action performed by downstream modules depends on the content of the NAME
field. Downstream devices should handle messages in a manner such that
duplicate transmission of the same headline message can be handled correctly. News stories are delivered as: TABTEXT The format of the text in this segment: SEG_TEXT The story text. This may be separated into a sequence
of text bodies linked with the NEXT_LR field. RTPERMISSIONS The Permissions field for the story ATTRIBTN The name of the wire providing this story NEXT_LR The name of the record containing the next part of the
story or an empty field (may be an empty string, a VT_NULL or a VT_EMPTY). The TABTEXT field is used to identify the formatting of the text segment. News
Server handles its stories as HTML Text. If any of the other three are used,
the News Server converts the incoming story to HTML. The conversions, are by
necessity very simple using the following rules: n
Preformatted ASCII: Place <pre> at the beginning of the text
and </pre> at the end. This formatting is usually used to stop word
wrapping on textual tables. n
ParaPerLine ASCII: Convert each new line character in the file to
<P>. n
PlainText ASCII: Convert each sequence of 2 new line characters
into <P> n
HTMLText: Pass the story through unchanged. n
Preformatted RTF: RTF Text that contains tabular data and is marked
up on a paragraph break per line. This is converted to text and marked up as
Preformatted ASCII. n
Plain RTF: This is converted to text and marked up converting
paragraph boundaries with a newline creating wrappable text. Query results are delivered in the same format as headlines, but each headline
is streamed to the client. On the final headline of the page, a NewsNextLR
message is sent with the NEXT_LR field filled in with a next logical record.
This simplifies the actions of the permission system. There are currently four modules responsible for basic ingestion: This module is responsible for importing all news data from PGE, including AFX,
Dow Jones and ICV newswires. During start-up, this feed sends up to 36 hours of
news for recovery purposes. This cut-down NewsML reader is used to import the PR-NewsWire. During start-up,
this feed sends all current news stories for recovery purposes. This is used to ingest NITF format news from DPA. During start-up, this feed
sends all current news stories for recovery purposes. Each of the ingest modules currently makes the news look like the standard News
2000 service. Where available a recorvery of old headlines is supplied. Some additional work will be necessary to separate the recovery from the
standard headline search, so that news servers can request news data separate
from recovery records. This recovery request will be the only news related
request (apart from HEADLINE and stories) that an ingest module will have to
recognise. In the future, we will need to create new ingest handlers. The main bulk of the News Server ingest will remain the same with the addition
of code used in the DeutchesBank category mapping system. However, the
following additions will be made: This work is expected to take no more than ten days to complete. The main changes to the News Server will occur in the request module. The
following features will be supported: This work is expected to take no more than ten days to complete. The main bulk of the content index and repository will remain the same,
including the IFilter system. In the future, we made want to upgrade to a News
specific XML parser. The following changes will be made: The new repository will support much more resilient architecture based around
RAID systems or equivalent. This work is expected to take no more than ten days
to complete. In order to test the news server, we shall have to: Unit/Module Testing should be complete in ten days. In order to assist with the upgrade of Web applications, the News Server will
have an accompanying object known as the News Real Helper Object. This is an
object that provides the traditional array interfaces to the news for ASP
clients without having to handle the streaming headlines that are generated by
a query. The News Real Helper Object provides methods for requesting a news query,
idnetifying whether a Next page exists and to issue the request for the next
page. The object simulates blocking mode requests (used in ASP pages) and
should be (mostly) plug compatible with existing Web pages uising the News. The following issues have arisen from the foregoing specification. Since times are applied to News stories by vendors, there is a potential for one
News Vendor to constantly appear to be delivering news in a more timely fashion
than another. This causes the perception by end users that one vendor is
�better� than another in terms of timeliness of reporting and delivery of
stories. In extreme cases, this also means that users sorting stories by time
(the norm) may always only ever see one vendor at the top of the list of
headlines. These problems also cause considerable problems to the ingestion and
consolidation of News. Newsvendors can backfill news stories i.e. it is
possible to receive old news headlines after subsequent ones. Consequently, the
News Server cannot update the story dates and times on reception. Timestamps are also essentially for allowing clients to specify the boundary
conditions of searches to create the effect of cursoring. Client must have the
functionality to filter input for matches to the search criteria and the
ability to filter out duplicate messages. As far as time formats are concerned, the value of the time fields should always
be in UTC time zone. This allows the indexing system to have a consistent value
against which to index the stories. The preferred display value for News
stories is current DD MMM YY, however, it might be more useful to provide the
ISO standard format (YYYY/MM/DDTHH:MM:SS+/-HHMM) maintaining information on
time zone of publication. Due to the mechanisms used by vendors to construct news wires, it is frequently
the case that the same story can exist on multiple wires. This is a nuisance to
a user who subscribes to the same wire. However, such duplicates cannot be
removed at the server as then downstream subscribers who only have access to
one newswire may lose the story completely. Consequently, the duplicates have
to be removed at the point of query or receipt of query. If the duplicates are
removed during query at the server, then replication will no longer work
correctly. Consequently, clients will need to remove the duplicates. There are a number of problems associated with contractual issues. Firstly, some vendors have contracts that prohibit the storage of news stories
beyond a given duration. Typically, these vendors support historical news data
services of their own. Consequently, expiry has to be limited by the News feed
(through the delivery of Expiry messages) to this duration. In addition, consumers of News should not be given access any content which
exists prior to their contract dates. In addition, our customers specify news
expiry periods (e.g. 90 days of News). They should not be given access to news
outside this period. This is probably a permission issue. Some vendors issue drops and reissue the story. If the News Server ignores the
drop messages, then duplicate stories will exist in the news system. On the
other hand, if the News Server handles drop messages, then the News archive can
become no bigger than that which is provided by the vendor. Consequently, we
would experience a loss in the news history that could be quite catastrophic.
For instance, News2000 issues drops and expiries allowing News to exist for no
more than 24 hours, which is not adequate for most of our users. Consequently,
News Server handles drops but not expiries. When a client requests news history, they must issue a request for the headline
stream prior to issuing the request for the history. Incoming headlines should
be matched against the search criteria even before any history results have
been received. This removes the possibility that headlines that are in transit
during the request for history are missed. Equally, when the News Server stores a headline in its cache there may be a
delay between the time at which the News Story is saved and the timed at which
it is indexed. Consequently, the headline should not be delievered to
downstream devices until the indexing process is complete. The current News system consists of two types of component, the news ingest
components and the News Real Server. News data is ingested into Dataworks Enterprise through some module that is
either: In either case, the module simply takes the incoming data and reformats it into
something that is understood by Dataworks Enterprise according to very simple
rules. The News Real Server is a PDP Source that ingests news headlines and optionally
stories and stores them for use by other PDP clients. It has a variety of
configurations depending on the requirements of the customer. News Real servers either can talk directly to ingestion modules or can take a
consolidated feed from other News Real Servers. News Real Servers can either
store and index headlines internally (in memory) or stores the stories on file
and employs MS Index Server (sometimes known as Content Indexer or CI) as an
indexer. The News Real source (usually called News) provides a stream of
updates to the HEADLINE instrument as news headlines are received (actually
when the accompanying story is ingested). It also provides syntax for querying
the database. The syntax used depends on whether Index Server is in use or not.
Results are delivered using an array format. News Real Servers internally can
use any Index Server query system (i.e. version 1, 2 or 3). There are three basic configurations: This configuration is used to provide an updating cache of News Headlines with
no underlying News Story repository. This is the most basic News Server
configuration and is used where the underlying News headline sources do not
cache the headline data. In this configuration, the Cache Stories is switched
off. This configuration is used to provide both an updating cache of News Headlines
and a News Story repository. The headline index is stored in the memory of the
News Server and supports a very basic query language based on category codes
and company names. The repository is a collection of HTML files marked up using
HTML comments with additional data. This would be used where a repository is
required, but not the full indexing capability of a Content Indexer. Note: it
is possible to install this configuration with the News source not enabled
allowing the News Server to act simply as a mechanism to store incoming news
stories. This configuration is used to provide both an updating cache of News Headlines
and a News Story repository. The headline index is stored using MS Index Server
which supports a very sophisticated query language based on the full text and
other properties of the News stories including category codes and company
names. The repository is a collection of HTML files marked up using HTML
comments with additional data. This would be used where a repository is
required with the full indexing capability of the Content Indexer. Note, it is
possible to install this configuration with the News source not enabled
allowing the News Real Server to act simply as a mechanism to store and index
incoming news stories. This configuration requires considerable additional
installation and configuration of the Content Indexer. In most configurations, News Real uses an Index Server as the searching engine.
Index Server is used because it is: Index Server V3 is much more stable than V2 and easier to install. All Windows
2000 machines have it installed by default. Index Server 3 will only run on
Windows 2000 hosts. The chief problems with Index Server results from:News Delivery Architecture
News Content
What is News?
How is News Created?
How is News Delivered?
News Data Formats
News vs. Research
User Groups
3. News System
News Architecture
News Sub-Systems
Ingest
Normalisation
Story Repository
Indexing
Search and Retrieve
Streaming Updates
Delivery Architecture and Replication
News Requests
Headline Requests
Search Requests
Recovery Requests
Story Requests
News Data Formats
Headlines
NewsFirstTake = 2
NewsSubsequentTake = 3
NewsCorrection = 4
NewsCorrected = 5
NewsDrop = 7
NewsExpired = 8
NewsNextLR=9
R = TEMPORARY_STORY
Stories
X � ParaPerLine ASCII
R � PlainText ACSII
H � HTMLText
F � Preformatted RTF
P � Plain RTF
Query Results
4. News Implementation
Ingestion Modules
PGE Handler
PRNews
DPA Online
News Server Ingest
News Server Request Module
Content Indexer
Module/Unit Testing
News Helper Object
5. Issues Arising from the Specification
Timestamp Synchronisation
Time Formats
uplicate Stories
Contractual Issues and Permissioning
News Drops and Expiries
Missing Updates from Stream
A: News Real Server
Ingestion
News Real Server
No Story Repository
News Real Server - Repository with Internal Indexing
News Real Server - Repository and MS Index Server
Index Server