- Extracting or 'mining' knowledge from large amount of Data
- Also known as Knowledge Discovery in Databases (KDD)- but actually Data Mining (DM) is a part of KDD
- KDD Consists of
a. Data Cleaning- to remove noise and unwanted data
b . Data integration - where multiple data sources may be combined- data warehouse
c. Data Selection- where data relevant to the analysis task are retrieved from the database
d. Data Transformation - Where data are transformed or consolidated into forms appropriate fro mining by performing summary or aggregate operations
e. Data Mining- an essential process where intelligent methods are applied in order to extract data patterns.
f. Pattern Evaluation - to identify truely interesting patterns representing knowledge based on some interestingness measures.
g. Knowledge presentation - where visualisaiton and knowledge representation techniques are used to present the mined knowledge to the user.
Thus we can also define DM as
" process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses or other information stores."
Thus there are the following components in the architecture of a DM system
1. Data Store: A store of data in which data cleaning and data integration can be formed.
2. DataBase of Data Warehouse Server: To fetch the relevant data, based on user's DM request.
3. Knowledge Base: Knowledge that is used to guide the search or evaluate the interestingness o a pattern. It includes
- concept hierarchies
-user beliefs
-metadata (eg. describing data from multiple heterogeneous sources)
4. DM Engine: Functional modules for tasks such as
- characterisation
-Association
-classification
-cluster analysis
-evolution analysis
-deviation analysis
Whey DM is helpful
The discovered knowledge can be applied to
- Decision Making
- Process control
- Information management
-Query processing
On what kind of Data DM can be done
1. Relational dB
It is a collection of tables each assigned a unique name. Each table consists of a set of attributes ( columns and fields) and usually stores a large set of tuples (records or rows)
When DM is applied to relational dBs-one can search for trends of data patterns.
Eg. “to predict the credit risk of the customer based on their income, age and previous credit info”
Data Warehouse:
Eg. A company has many branches and you are to find out “cos sales/itm type/branch for the third quarter”
It is very difficult as relevant data are spread over several databases, physically located at numerous locations. If the company has a DW, then it would be easy.
A DW- is a repository of information
- collected from multiple sources
-stored under a unified schema
-reside at a single site.
How DW are created
By
- data cleaning
- data transformation
- data integration
- data loading
- data refreshing
Other characteristics of DW
- Data in a DW is organized around manor subjects- customer, item, supplies, activity
- Provides information from a historical perspective ( eg. Last 5-10 years)
- Information is usually summarized
- Modeled by a multidimensional database structure, which provides multidimensional view of data eg. DATA CUBE
( Difference between a DW and Data Mart- A DW collects info that span an entire organization whereas a datamart is a department subset of a DW)
- A DW is suitable for OLAP (On Line Analytical Processing)
What is OLAP
Operation that allows a knowledge to be presented at different levels of abstractions.
3. Transactional Databases
consists of a file where each record represents a transaction
eg. Sales
Transaction Query
“All the items purchased by Sandy Smith”
DM Query “which items sold well together”
Advanced DB Systems
Objected Oriented DBs
Here each entity is considered an object. Each unit is associated with
Variables- That describe the objects
Messages- that the object can use to communicate with other objects
Methods- Where each method holds the code to implement a message
Eg. Get-photo(employee)- will retrieve and return a photo of the given employee object
- Object that share a common set of properties can be grouped into an object class
- Each object is an instance of its class.
Object Relational Databases
Basic relational data model with power to handle objects
Spatial databases
- maps
- VLSI Chip Design
- Medical and satellite images
DM can be used to find characteristics of houses near a specified location – say park
Temporal and Time Series DBs
Store time related Data. Eg. Data collected regarding stock exchanges
DM Strategies: Scheduling of bank tellers according to the volume of customer traffic.
Text and Multimedia DB
Text data bases contain word description for objects
Eg. To find a particular word or sentence together.
Multimedia contains image, audio and video data . DM- similarity based pattern matching.
Heterogeneous DB and Legacy DBs
A legacy DB is a group of heterogeneous DBs that combine different kinds of Data systems- such as relational or object oriented …difficult to exchange info.
DM can transform the data into more generalized conceptual level (eg. Fair, good or excellent)- from which info exchange can easily be performed.
The World Wide Web- the data objects are linked together. DM- understanding user access patterns called mining path traversal pattern- still a promising discipline.
Data Mining Functionalities
What kinds of patterns can be mined ?
Concept / Class description: Characterisation and Discrimination
Data Characterisation; Summarisation of the general characteristics or features of a target class of data eg. Summarise the characteristics of customer who spend >$1000 a year at ABC stores
Data Discrimination: Comparison of the general features of target class data objects with general features of objects from one or a set of contrasting classes.
Eg. Compare two groups of ABC customers
- who shop for computer products regularly vs. those who rarely shop.
Association Analysis
Discovery of association rules showing attribute-value conditions that occur frequently together in a given set of data.
Eg. Age (X, “20…29”)^income (X, “20K…29K”)=> Buys (X,”CD player”)
[Support = 2%, confidence= 60%]
where X is a variable representing customer. It indicates that of the ABC customers under study 2% (support) are 20 to 29 years of age with an income of 20K to 29K and have purchased a CD player at ABC. There is a 60% probability that a customer in this age and income group will purchase a CD player.
3. Prediction : To predict some missing or unavailable data values rather than class labels.