Wednesday, October 17, 2007

Lecture-1

Data Mining

- Extracting or 'mining' knowledge from large amount of Data


- Also known as Knowledge Discovery in Databases (KDD)- but actually Data Mining (DM) is a part of KDD


- KDD Consists of


a. Data Cleaning- to remove noise and unwanted data


b . Data integration - where multiple data sources may be combined- data warehouse


c. Data Selection- where data relevant to the analysis task are retrieved from the database


d. Data Transformation - Where data are transformed or consolidated into forms appropriate fro mining by performing summary or aggregate operations


e. Data Mining- an essential process where intelligent methods are applied in order to extract data patterns.


f. Pattern Evaluation - to identify truely interesting patterns representing knowledge based on some interestingness measures.


g. Knowledge presentation - where visualisaiton and knowledge representation techniques are used to present the mined knowledge to the user.





Thus we can also define DM as


" process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses or other information stores."





Thus there are the following components in the architecture of a DM system





1. Data Store: A store of data in which data cleaning and data integration can be formed.


2. DataBase of Data Warehouse Server: To fetch the relevant data, based on user's DM request.


3. Knowledge Base: Knowledge that is used to guide the search or evaluate the interestingness o a pattern. It includes


- concept hierarchies


-user beliefs


-metadata (eg. describing data from multiple heterogeneous sources)


4. DM Engine: Functional modules for tasks such as


- characterisation


-Association


-classification


-cluster analysis


-evolution analysis


-deviation analysis





Whey DM is helpful





The discovered knowledge can be applied to


- Decision Making


- Process control


- Information management


-Query processing





On what kind of Data DM can be done



1. Relational dB



It is a collection of tables each assigned a unique name. Each table consists of a set of attributes ( columns and fields) and usually stores a large set of tuples (records or rows)



When DM is applied to relational dBs-one can search for trends of data patterns.



Eg. “to predict the credit risk of the customer based on their income, age and previous credit info”

Data Warehouse:

Eg. A company has many branches and you are to find out “cos sales/itm type/branch for the third quarter”

It is very difficult as relevant data are spread over several databases, physically located at numerous locations. If the company has a DW, then it would be easy.
A DW- is a repository of information
- collected from multiple sources
-stored under a unified schema
-reside at a single site.

How DW are created
By
- data cleaning
- data transformation
- data integration
- data loading
- data refreshing

Other characteristics of DW
- Data in a DW is organized around manor subjects- customer, item, supplies, activity
- Provides information from a historical perspective ( eg. Last 5-10 years)
- Information is usually summarized
- Modeled by a multidimensional database structure, which provides multidimensional view of data eg. DATA CUBE


( Difference between a DW and Data Mart- A DW collects info that span an entire organization whereas a datamart is a department subset of a DW)
- A DW is suitable for OLAP (On Line Analytical Processing)
What is OLAP

Operation that allows a knowledge to be presented at different levels of abstractions.

3. Transactional Databases

consists of a file where each record represents a transaction

eg. Sales

Transaction Query
“All the items purchased by Sandy Smith”
DM Query “which items sold well together”

Advanced DB Systems

Objected Oriented DBs

Here each entity is considered an object. Each unit is associated with

Variables- That describe the objects
Messages- that the object can use to communicate with other objects
Methods- Where each method holds the code to implement a message

Eg. Get-photo(employee)- will retrieve and return a photo of the given employee object

- Object that share a common set of properties can be grouped into an object class
- Each object is an instance of its class.

Object Relational Databases

Basic relational data model with power to handle objects

Spatial databases
- maps
- VLSI Chip Design
- Medical and satellite images

DM can be used to find characteristics of houses near a specified location – say park

Temporal and Time Series DBs

Store time related Data. Eg. Data collected regarding stock exchanges

DM Strategies: Scheduling of bank tellers according to the volume of customer traffic.

Text and Multimedia DB

Text data bases contain word description for objects
Eg. To find a particular word or sentence together.

Multimedia contains image, audio and video data . DM- similarity based pattern matching.

Heterogeneous DB and Legacy DBs

A legacy DB is a group of heterogeneous DBs that combine different kinds of Data systems- such as relational or object oriented …difficult to exchange info.

DM can transform the data into more generalized conceptual level (eg. Fair, good or excellent)- from which info exchange can easily be performed.

The World Wide Web- the data objects are linked together. DM- understanding user access patterns called mining path traversal pattern- still a promising discipline.

Data Mining Functionalities
What kinds of patterns can be mined ?
Concept / Class description: Characterisation and Discrimination

Data Characterisation; Summarisation of the general characteristics or features of a target class of data eg. Summarise the characteristics of customer who spend >$1000 a year at ABC stores

Data Discrimination: Comparison of the general features of target class data objects with general features of objects from one or a set of contrasting classes.
Eg. Compare two groups of ABC customers
- who shop for computer products regularly vs. those who rarely shop.

Association Analysis

Discovery of association rules showing attribute-value conditions that occur frequently together in a given set of data.

Eg. Age (X, “20…29”)^income (X, “20K…29K”)=> Buys (X,”CD player”)
[Support = 2%, confidence= 60%]

where X is a variable representing customer. It indicates that of the ABC customers under study 2% (support) are 20 to 29 years of age with an income of 20K to 29K and have purchased a CD player at ABC. There is a 60% probability that a customer in this age and income group will purchase a CD player.

3. Prediction : To predict some missing or unavailable data values rather than class labels.


4. Cluster Analysis


- Objects are clustered into groups on the basis of

- maximising the intraclass similarity.

- minimising the interclass similarity.


5. Outlier Analysis
Data objects that do not comply with general behaviour or model of the data are outliers.
- most DM discard outliers
- But in some applications such as fraud, outlier is more important than actual data
6. Evolution Analysis
- Describes trend for objects whose behaviour changes with time
eg. stock regularities of a particular company from the stock market data.
What makes a pattern interesting
A pattern is interesting if:
- It is easily understood by humans
- valid on a new data with some degree of certainty.
- potentially useful
- novel
- validates a hypothesis that a user seeks to confirm.
Rem: An interesting pattern represents knowledge
Objective measures of Pattern interestingness:
1. Support: percentage of transactions that the given rules satisfy.
2. Confidence: Degree of certainty of the detected association.
Subjective Measures
Based on the user's belief in the data .
Thus a pattern is interesting if it:
- contradicts or
- supports user's beliefs
Can a DM system generates all of the interesting patterns
No. But user provided constraints and interestingness measures can be used to focus the research.
Can a DM system generates only interesting patterns:
- Optimising problem in DM
- Challenging issue