What is Resolution?

Increasingly, we have a requirement to support some sort of symbol mapping or cross-reference service. One purpose of Dataworks Enterprise was to provide �a single source of data within Thomson (Thomson Financial)�. This means that there should be:

  • A single API through which data can be retrieved

  • A single set of data which can be retrieved through that API

    Consequently, symbology mapping is a basic requirement of our system. Currently, we use XREF to provide naming information, but XREF only handles those symbols that are available within the GT system. It has a kludge to deal with Piranha based sources. However, it has no knowledge of TF symbologies such as those in use in First Call, ILX, etc. Equally, our customers also have their own symbologies and there are those symbologies that are used by other third parties including our competitors. It would be nice if we could equally support these too. We need a mechanism to integrate all this.

    There are three basic services of the systems under review:

    Mapping Service

    This is most like a �super XREF� with ingest from multiple modules and would simply be a means of mapping a symbol in one system to a symbol in another.

    Navigational Service

    This would be code that used "rules" to map one entity to another. Systems use either database services or, even, simple name mangling to provide this sort of service.

    Screening Service

    This is a service that provides the ability to analyse large collections of data for matching user specific sets of criteria. Within Dataworks Enterprise this may be best achieved by providing a number of very flexible database population tools and the ADO handler.

    This document is specifically concerned with the mapping service, not with navigation or screening.

    Basic Requirements

    The name mapping system at its most basic level is a service that accepts a request for a source symbol, a source symbology and a destination symbology and returns the appropriate symbol in that symbology. However, in as much as we can, this whole process should be hidden from the user.

    This whole process should be implemented at the client side of the system. This allows us to:

  • Permission the use of the service or parts of the service

  • Provide localised services

    There are two key points about cross-referencing in general. Firstly, there can never be a complete solution to this problem. Any service that could be provided must be �sold� on the basis that it is a best effort, rather than being 100% accurate. The data issues associated with cross-reference data are complex and largely out of our control.

    Secondly, we have to take into account that there is every likelihood that we will need to enhance and extend the service we provide possibly on a �per customer-basis�. As stated above, most institutions have their own internal naming systems that they use to map names for entities. These are typically stored on databases and usually are part of some larger systems that are VERY private. To be able to integrate with these systems, we cannot think in terms of a single centralized naming service.

    It would be useful if some limited access to the service were provided to downstream modules. For instance, the service should be able to accept a request using a source symbology and symbol and return an entire record of potential mappings available.

    The service should be extensible so that we can bring new symbols online whenever we need to. This should not require bringing down the system to add new symbologies. The service should be able to ingest from any source or be able to be extended to do so.

    The service should be able to advertise the various symbologies it supports.

    The implementation of the service should be backwardly compatible with existing usage of Dataworks Enterprise.

    We also need to take into account the technologies that we have available on a customer site. For instance, the requirement to support broadcast-only access to customer sites also imposes some constraints on the design of the system. Any system that we create for name mapping must be distributable within Dataworks Enterprise architecture.

    There are also business and contractual issues that would need to be resolved. If we are going to provide this service, it needs to be separated from the current contractual requirements to purchase the underlying source of the data. For instance, to use XREF, customers have to purchase PGE data. If we have a symbology mapping service, we would need our customers to be licenced for all the sources that are providing mapping information, which is clearly nonsense.

    However, it should be possible to permission mappings so that they can be sold as separate data entities within the existing Dataworks Enterprise permissioning system. If mapping is not available there should be an acceptable degradation in functionality.

    Current Solutions

    There are a number of existing systems within Thomson Financial that attempt to support some features of symbology mapping services from basic to very sophisticated.

    Henry

    This system is supplied by DataStream and is the basis of the symbology information provided by PGE. Henry is a database that provides details of the relationships between entities such as quotes, stocks, companies, exchanges, etc.

    Navigator

    This is a set of databases providing name lookup information for DataStream users. Its source of information is an Access Database uploaded once per week, plus a daily run of a mainframe program (4321B). Navigator is accessed via COM or Corba using DAF. The data ultimately resides on a SQL database (Sybase, I believe).

    PIO/TKO

    PIO/TKO is a concordance database offering similar services to Henry and is based in the States. PIO is intended to be the repository for this type of information in Thomson Financial. PIO tends to have Thomson Financial data and third-par5ty data only. Like Henry and Navigator, PIO is also access using COM-Corba layers on DAF.

    XREF

    XREF is a Dataworks Enterprise module that ingests PGE symbology information (actually from Henry). It was originally written to cross-reference ISINs and PGE symbols although it is now used for a lot more than this. It is not, contrary to popular opinion, extensible. XREF provides queries through exposing a Dataworks Enterprise source. These queries typically result is a response that is either a set of name of matching records or a single record containing columns for the various names it supports. We are not proposing in this document to extend or replace XREF but rather to use the information it currently can provide in a new way.

    Note, we are discussing symbology cross-reference. There are a number of systems in place whose role is to provide concordance services (e.g. PIO, Henry, etc.). These systems attempt to provide sophisticated models of the financial marketplace and expose the symbology mapping as part of that model. These systems also provide some form of navigational system. For instance, these systems can satisfy queries like "Give me all the symbols for UK Chemical Stocks". We do not need anything so complex.

    It should be pointed out that the database and ingestion mechanisms for the various symbology systems do vary. Many such systems reside on large databases. The database may be populated manually or batch updated from other sources. The databases are typically not accessible directly, instead some form of �API� is provided to gain access to the information they hold. This API may be a function interface is a high level language, COM or Corba. Alternatively, it might actually be a file download. Some third parties distribute the data as a regular CD, others "over the wire". Some vendors have terminal services for doing lookup (screen scraping?). It would seem to make sense therefore that we use our normalisation services within Dataworks Enterprise to gain access to these systems and expose their functionality as a source). This makes our code more distributable.

    Glossary

    Namespace

    The scope in which a source defines the names for the entities it supports. There is a notion of consistency (multiple requests for the same name entity will produce relating to the same entity). Each source has one and only one namespace. Sources can be said to define the name space for the datasets they produce (this is the basic definition of a source). Source names are used to "route" requests. Instrument names are actually names within the source namespace.

    Symbology

    Symbology is a system in which a name uniquely identifies an entity irrespective of source. A valid symbol in one symbology may not be valid in another or may refer to an entirely different entity. Examples of symbologies include SEDOL, WKN, ISIN, Thomson Financial Number, a customer's private symbology, etc.

    Some symbologies are constructed from others. For instance the PGE symbology is mainly constructed from TICKER names and symbol codes. ISINs is the US are constructed from CUSIPs.

    Sources may support one or more symbologies (they must have at least one). If a source supports multiple symbologies, it must provide some kind of mechanism within its namespace to identify the symbology, i.e. it partitions its namespace. For instance, DataStream uses U: to delimit DSMnemonic numbers in its namespace, guaranteeing the uniqueness rule for the name space.

    For a source to support a symbology it must support requests based on that set of symbols. Although, PGE propagates ISINs it cannot be said to support them since you cannot make requests from PGE by ISIN.

    Though a source may claim to either support or carry a given symbology, it is not necessarily a complete reference for that symbol set. Very few sources have a complete reference for a symbol set other than their native one. Sources must be a complete reference for their native set of symbols.

    Symbology is used to map the name of an entity in one namespace is mapped to the identical (or similar) entity in another. It may be similar in that the entities of the destination namespace may actually be nearest match, e.g. a security in one system might map to a company in another because the second system only supports companies. We will call this process Mapping and the component whose role is mapping is the Mapping Service.

    Parties

    Suppliers of a particular symbol sets. For instance, DataStream is a party; S&P is a party for CUSIPs, etc. Note, that even though two parties may state that they can support a given symbology, the symbol for an entity can and often are different. For instance a DS view of the ISIN for a particular security may be different from First Call's notion.

    Sources may advertise that they support particular symbologies as a result of some form of ingest from a Party. For instance, Piranha imports symbologies from Worldscope.

    Constituency

    Whether a name in a namespace is an constituent of another entity, e.g. quote is a constituent of an index or exchange, security is a constituent of a company, a company may be a constituent of a sector, a country of a region and so on.

    Navigation

    The mechanism of moving between related entities within or across namespaces and symbologies. Navigation may employ Symbology, Constituency or any other technique. We will call the system supporting navigation the Navigation Service. This document does not explore issues surrounding Navigation.

    Screening

    A process of searching and sorting databases to produce lists of entities in a given symbology that matches a given set of criteria (i.e. a posh name for a set of database queries). For instance, screening might involve the query "Give me a list of instruments that exist in a given region and sector and which have a PE less than 5 and a market cap of greater than $3 billion. We will call the system providing this facility the Screening Service. This document does not explore issues surrounding screening.

    Resolution and Dataworks Enterprise

    The name mapping service will be implemented externally to the cache in order to simplify development and deployment of the system and to enhance the extensibility of the system. However, the cache will provide some mechanisms that employ these services. These mechanisms will be implemented in the client-side of the cache downstream from the permissioning elements but largely hidden from the user.

    Mapping Sources

    A new type of source will be introduced into Dataworks Enterprise, which is a �mapping source�. These sources will identify themselves using the source DataType �Mapping� (instead of �Record�, �Page� and �System� currently in use). These sources will not be visible to the user directly.

    The purpose of these sources is to provide the basic name mapping services used by Dataworks Enterprise. Each mapping source is actually a �front-end� to an existing system, be it Hawk, PIO, Henry, XREF or some customer system. Source level permissions would enable us to permission the data.

    The mapping source would have the task of:

  • advertising the mappings it can (might be able) to perform. As with standard sources described below, this would imply a change to the EndPointData structures in the cache. All mapping sources would support at least two symbologies.

  • taking a request (implemented as a standard record request) for a symbol within a given symbology (provided by the Symbology property) and return a mapping record.

    The mapping response is a record containing a series of fields one for each symbology supported a non-empty result. The name of the field is the name of a symbology and the value is the mapping. Where a value is empty, the mapping record does not contain the field.

    For instance, for the symbol GB0004594973 using symbology ISIN, a mapping source might return:

    Field Name Field Value
    DSCODE 900455
    ISIN GB0004594973
    TIDM ICI
    PGESYMBOL ICI.L
    SEDOL 0459497

    The mapping source would be responsible for applying some rules to its local query. For instance, where the request is for a source symbology that is granular to company and the destination is a symbology that is quote based, the mapping source would look for the domicile quote of the primary security of the company. In the example above, the ISIN number refers to a security, but the source operates at a quote level and so returns the domicile quote for the security given.

    In the first instance, we would add an additional Dataworks Enterprise Source to the existing XREF handler to provide a basic mapping function (�XREFMapping�). This would allow us to support some basic symbology queries over a broadcast link for PGE customers. In the future, I would imagine us providing other mapping sources for our internal databases, creating mapping sources for file downloads and possibly providing a prototypical mapping source as a basis of mapping in customer sites.

    Within the cache symbologies would be converted to Local Names and stored in this form. Local names are per-process reference counted names. They are passed across the cache within EndPointData structures as strings.

    Changes to Existing Sources

    The main change to sources would be adding functionality to:

    Advertise the symbologies they support. By default, a source only supports a single symbology (effectively, its namespace is a private symbology). Existing sources would continue to work by virtue of the fact that they ignore the Symbology property. However, mapping will not be applied to sources that do not advertise the symbologies they support.

  • On receipt of a request, they would query the Symbology property of the incoming request and use this to update their query to the back end system. For instance, Datastream handlers would prefix DSMnemonic symbology with �U:�. Existing sources should never receive anything other than their native symbol set. If they do get anything they will ignore it anyway.

    For instance, PGE might advertise a single symbology as it only supports a single mechanism of request. However, a source such as Datastream would be modified to accept the variety of symbology requests it supports. This would be achieved by using a new form of createSource() described below. This would allow the handler to advertise the symbologies that it supports.

    During a request, handlers would check the value of the RTItem�s Symbology property. If this property were empty, the handler would continue to operate as before. This provides backward compatibility. When the symbology property is filled in, the DS handler would extract the text of the symbology, convert it to a data reference and then query that reference for the symbologies it supports in order of preference. The handler then converts the symbology to its namespace (e.g. for an incoming Swift currency code, the PGE handler might add the suffix �/�) and passes the query on to its existing request processor. Note that the symbology of a request would not form part of the item discriminator in the cache.

    Some handlers, like those based on Piranha, may optionally choose to extract multiple symbologies from the data reference and construct a query based on those. For instance, Piranha sources work best if they are provided with ISIN, SEDOL and CUSIP numbers.

    The amount of changes required of a particular feed handler will depend on the number of alternative symbologies it supports.

    Changes to the Cache

    Changes to the RTSource Object

    Each source in the client source list will have a new property (collection?) that represents the symbologies that the source supports. This will be set by the source at mount time. This implies a new version of the CreateSource methods in both the public and private interfaces of the cache. It also implies that these properties can be shared between the server and client side through a change to the EndpointData structure. Resolver uses this information to resolve a query for data. NOTE: Changes to these structures will have an impact on Archival (see below). We may choose to expose this through an additional method in the RTSource interface or possibly by exposing an additional interface on the RTSource object (this will affect both client and server sides of the system). The data is in the form of a collection of simple strings providing �standard� names for symbologies, e.g. [�ISIN�, �CUSIP�, �PGESymbol�, �ILXSymbol�].

    Resolver

    The key new module in the cache is the Resolver. In the first instance, this will be completely hidden from the user. Resolver is located downstream of the permissioning module (so that the Resolution system can be permissioned) and will form part of the request chain. By default, Resolver will have no impact on the existing request/response mechanisms in the cache.

    Records (and Items) will have a new property of (a simple string that might be a single symbology or a collection of symbology mappings depending on the state of the request. The following processing is performed:

    1) When a bind is issued for any record, the Symbology property may populated with the name of the symbology used for the request. By default, this value is empty. This allows legacy code to continue to operate as before. If the Symbology property is empty, the requester is implying that they are using the native symbology of the source, i.e. a name that is valid in the namespace of the source. In this case, Resolver is unused and the request is propagated as usual proceeding to step 6.

    2) If the symbology property has a non-zero value (non-empty string), Resolver will take control of the request. It looks up the destination source in the RTSources collection. If the source is not present, then the request operates as before proceeding to step 6.

    3) If the source is present, Resolver checks the symbologies supported by the source. If the symbology is supported, Resolver packs up the symbology with the request and sends it to the source (implies a change to the RequestInfo archive). The symbology is passed with the request as a single RTDataRef element and we proceed to step 6.

    4) If the symbology is not supported, Resolver checks the �Mapping� sources available. As stated above, �Mapping� sources advertise the mappings they can perform. Resolver looks for a single hop mapping. If a single hop mapping is found, Resolver issues a request to the mapping source for the mapping data described above. On receipt of the data or status, Resolver checks the appropriate field (which may not be present). If the field is missing, Resolver looks for a different mapping and issues a new request.

    Once a mapping has completed, the request is updated by:

  • Converting the mapping data into a RTDataRef

  • Extracting the Text representation of the mapping and setting the symbology property.

  • Reissuing the request to the original source.

    In this event we proceed to step 6.

    5) If Resolver cannot find a single hop mapping, it resorts to a two-hop mapping. Resolver never attempts more than two hops due to the potential for circular references. This involves looking for a pair of mapping sources that support the source symbology, the destination symbology and a common third symbology. Resolver makes a unique collection of these mapping pairs. For each pair in the collection, it issues a request for the mapping from mapping source that has the source symbology and the common symbology. It checks the common symbology and if it is empty moves to the next in its unique list. If the common value is populated, it issues a request using this against the second mapping source. If destination symbology values are returned the Resolver constructs a RTDataRef containing the union of all mappings provided. In this manner the resolver searches for mappings between quite disparate Dataworks Enterprise sources.

    6) The request is issued to the source.

    Changes to the Archive Structure

    As part of this development there will be a change to the archive structures within the cache. This will be done in such a way that the archive will be backwardly compatible for downstream clients. The new Archive structure will introduce a prefixed length for the structures in the cache (Record, Status, RequestInfo, EndpointData, Field) to allow us to extend these structures in future version of the system without additional changes to the basic structure of the Archive itself. Length prefixes would have variable length to reduce the storage overhead, which could be large in the case of fields.

    When packing an Archive, the module will prefix its data with its length. This might involve caching the lengths of the structures for performance reasons.

    When unpacking an Archive the major structures will test the Archive version. If the version is older than that which provides prefix lengths it will read the Archive as it does at present. Otherwise, it will note the current archive position and read the length. It will then read its structure from the Archive. Once the read is complete, the code will reposition the Archive according to the length. However, this means that clients prior to the current revision of the Archive will not be able to read Archives from the first version that supports these prefixes.

    Unresolved Issues

    Many of the mapping sources have incomplete data.

    Many of the mapping sources have incorrect data.

    For a particular symbology, values are dependent on parties. Consequently, what DS thinks is the ISIN may not be the same as what First Call thinks.

    The databases update at different times generating inconsistencies

    There is no mechanism in this model for generating updates to symbology