White Paper: The wide scale deployment of Active Data Mining Solutions
Data Mining is a process
There is increased interest in a process or methodology for data mining.
It is argued that such a formalised process will widen the exploitation
of data mining as an enabling technology for solving business problems.
It will allow people with varying expertise in data mining and from
different business sectors to carry out successful data mining projects
with a high degree of consistency.
There are a number of initiatives for the development of a formal/documented
data mining process both in Europe and North America. It is reassuring
to the data mining community that the processes emerging from all
of these initiatives reveal a large degree of similarity. There
is widespread agreement on the main steps (stages) involved in such
a process and any differences relate only to the detailed tasks
within each stage. A summary of the major stages of a data mining
process is:
- Goal definition: This involves
defining the goal or objective for the data mining project. This
should be a business goal or objective which normally relates
to a business event such as arrears in mortgage repayment, customer
attrition (churn), energy consumption in a process, etc. This
stage also involves the design of how the discovered patterns
would be utilised as part of the overall business solution.
- Data selection: This is
the process of identifying the data needed for the data mining
project and the sources of this data.
- Data preparation: This involves
cleansing the data, joining/merging data sources and the derivation
of new columns (fields) in the data through aggregation, calculations
or text manipulation of existing data fields. The end result is
normally a flat table ready for the application of the data mining
itself (i.e. the discovery algorithms to generate patterns). Such
a table is normally split into two data sets; one set for pattern
discovery and one set for pattern verification.
- Data exploration: This involves
the exploration of the prepared data to get a better feel prior
to pattern discovery and also to validate the results of the data
preparation. Typically, this involves examining the statistics
(minimum, maximum, average, etc.) and the frequency distribution
of individual data fields. It also involves field versus field
graphs to understand the dependency between fields.
- Pattern Discovery: This
is the stage of applying the pattern discovery algorithm to generate
patterns. The process of pattern discovery is most effective when
applied as an exploration process assisted by the discovery algorithm.
This allows business users to interact with and to impart their
business knowledge to the discovery process. In the case of inducing
a tree, users can at any point in the tree construction examine
/ explore the data filtering to that path, examine the recommendation
of the algorithm regarding the next data field to use for the
next branch then use their business judgement to decide on the
data field for branching. The pattern discovery stage also involves
analysing the ability of the discovered patterns to predict the
propensity of the business event, and for verification against
an independent data set.
- Pattern deployment: This
stage involves the application of the discovered patterns to solve
the business goal of the data mining project. This can take many
forms:
- Patterns presentation: The description of the patterns
(or the graphical tree display) and their associated data
statistics are included in a document or presentation. This
requires the data mining tool to generate text reports and
WMF (Windows Meta File) representations of the graphical decision
tree.
- Business intelligence: The discovered patterns are
used as queries against a data base to derive business intelligence
reports. This requires the data mining tool to generate SQL
representations of the decision tree.
- Data Scoring & Labelling: The discovered patterns
are used to score and/or label each data record in the database
with the propensity and the label of the pattern it belongs
to. This can be done directly by the data mining tool or through
generation of SQL or C representation of the decision tree
- Decision Support Systems: The discovered patterns
are used to make components of a decision support system.
This can be achieved by embedding the data mining tool as
a decision making component, or a C module generated by the
data mining tool.
- Alarm monitoring: The discovered patterns are used
as 'norms' for a business process. Monitoring these patterns
will enable deviations from normal conditions to be detected
at the earliest possible time. This can be achieved by embedding
the data mining tool as a monitoring component, or through
using SQL generated by the data mining tool.
- Pattern Validity monitoring:
As a business process changes over time, the validity of patterns
discovered from historic data will deteriorate. It is therefore
important to detect these changes at the earliest possible time
by monitoring patterns with new data. Significant changes to the
patterns will point to the need to discover new patterns from
more recent data.
The wide scale deployment of data mining solutions
A repeatable data mining process will help ensure the success of a
data mining project. However, a successful data mining project also
needs developers with the following skills:
- Deep knowledge of the data and its history.
- Insight into the specific business area.
- Proficiency in the use of the data mining tool.
The above skills may be combined in one person or may require more
than one person. However, even in the largest of organizations there
is a relatively small number of such specialists/teams with the above
skills. To maximise the returns on data mining, the role of these
specialists in a data mining project should be to prepare a specific
'Data Mining Business Scenario'. Once such a scenario is prepared
it can be deployed on a much wider scale to a large user community
- inside or outside the organization. A Data Mining Business Scenario
can also be called a Data Mining Solution.
Preparing a Data Mining Business Scenario involves all the steps
of the data mining process; goal definition, data selection, data
preparation and transformation, data exploration, pattern discovery
and pattern deployment. The business scenario can be deployed to
a wide user base. As an example, consider the business scenario
of mortgage arrears in the portfolio of a financial institution:
Goal definition
Identify the profiles of mortgage accounts with a high or low propensity
to default on mortgage payments. Define default as 3 or more months
in arrears. The patterns discovered will be issued to branch managers
to help them with the processing of mortgage applications. The patterns
will also be issued to marketing managers to help them in their targeted
marketing and in the definition of new products/mortgage packages.
Finally, the patterns will be used by the Credit Manager to monitor
the changes in the mortgage portfolio over time.
Data selection
Identify the source data as the Mortgage Applications data base and
the monthly payments database. Furthermore, focus on historic mortgage
applications, for example, those made in1996 and 1997 and all payments
records from 1996 until present date.
Data Preparation
- Extract mortgage application records from 1996 and 1997.
- Extract payments records from 1996 until present date.
- Join mortgage application and payment tables.
- Derive the new fields age (from DOB), total income and Loan/property
value ratio.
Data Exploration
- Explore the frequency distribution of data fields.
- Explore the correlation between data fields.
- Plot the goal (arrears status) against other fields.
Pattern Discovery
- Induce a decision tree profiling arrears.
- Get the domain expert to validate the tree.
- Verify the patterns against test data sets.
Pattern Deployment
- Generate the patterns (tree) in WMF format. Import the graphical
WMF file into MS-Word and print out the tree as a chart. Issue
the tree print out to branch and marketing managers.
- Generate the patterns as SQL which is used to generate regular
reports for the Credit Manager on the proportion of new business
matching each of the discovered patterns.
The Deployment of Active Data Mining Solutions
The methods of deployment of patterns listed in the previous section
can be described as passive deployment. This is because the solutions
deployed can only utilise the patterns previously discovered.
In Active Mining Deployment, the users are empowered to
discover and explore new patterns within the business scenario
(solution) delivered to them. For example, in the area of mortgage
arrears described above, the business scenario can be prepared
as described, but the Credit Manager and the users in the branches
and marketing department can be given the ability to:
- Interactively develop new tree patterns in line with their
business expertise/requirements.
- Develop new tree patterns from new data as it becomes available.
- Monitor the impact of new data on existing patterns.
- Within the same business scenario, change the outcome field
and develop tree patterns for a new goal (outcome). For example,
instead of profiling the arrears propensity, the user may
be interested in the profile of mortgages with high loan amounts.
- Explore data throughout the tree patterns.
The active deployment of data mining, turns data mining into vertical
business applications for wide scale use by business people who
otherwise would not have the skills to develop a data mining process.
Technologies for the Deployment of Active Data Mining Solutions
There are a number of software technologies required in order
to realise the benefits of the active deployment of data mining.
These data mining software components allow the creation of vertical
applications with embedded active data mining for use by business
users.
- Embedded Data Transformation Engine:
This component allows the data transformation process designed
by the creators of the business scenario to be run against
newly available data without any technical intervention by
the business user.
- Embedded Pattern Discovery:
This allows the pattern discovery components to be embedded
within a vertical end user application. For example, the mining
of mortgage arrears can be embedded as part of a Customer
Relationship Management System.
- Graphical and interactive Pattern
Discovery: An essential part of active data mining
is to allow the business user to interact with the data mining
algorithm to ensure that business oriented patterns are discovered.
Extensive pattern visualisation and exploration features are
also an important aspect of active data mining.
- Scalability, Architecture and performance:
Embedding data mining as a component within a vertical application
will in no way reduce the need for the data mining component
to work within the IT infrastructure and to be capable of
handling large data volumes. The interactive nature of embedded
data mining makes it even more important to have high performance
pattern generation and exploration.
XpertRule Miner for the Active Deployment of Data Mining
XpertRule Miner is designed from the ground up to enable the active
deployment of data mining. It achieves this through the following
features:
- Graphical Support for the full
Data Mining Process XpertRule Miner provides a
graphical environment for supporting all the stages of the
data mining process. The click, drag and drop environment
allows non programmers to carry out complex data preparation,
mining and deployment processes. It is an ideal environment
for the development and testing of data mining business scenarios.
- Embedded Data Transformation Engine:
Any data transformation process developed within XpertRule
Miner can be executed by its Data Engine which is a component
that can be embedded in any business application.
- Embedded ActiveX Data Mining Component:
The ProfilerX tree induction component is delivered as an
ActiveX component which is highly graphical, interactive and
can be embedded within other business applications. The component
exposes methods and objects which enables it to be seamlessly
embedded within other applications - such as Customer Relationship
Management (CRM) Systems.
- Flexible architecture, High performance
and scalability XpertRule Miner provides one of
the most flexible deployment architecture as illustrated here:
The ActiveX tree induction client allows data mining to be
embedded within other applications or deployed over the Internet/Intranet.
The CAF (Contingency AND Frequency) servers can be deployed
on the client, middle tier or server and are scalable and highly
performant. These CAFs can exploit the high performance available
from parallel processing database servers by the firing of intelligent
query streams at the server. Alternatively, the CAF server can
cache data from any ODBC compliant database into a highly tokenised
format which is optimised for high performance mining on very
large data tables. Using this caching technique allows an ODBC
data source of millions of rows to be mined in minutes on average
specification machines (e.g. 300 MHz Pentium with 64MB of RAM).
|