Priyank Goyal's Data Mining Lectures: 2007

Monday, November 5, 2007

What exactly is a DW

A dW is a subject-oriented, integrated, time variant and non –volatile collection in support of management’s decision making process.

Subject Oriented
A dW is organized around major subjects of customer, suppliers, products and sales, it excludes data that is not useful in decision making process

Integrated
A dW is usually constructed by integrating multiple records eg. Relation dB, flat-files

Time-Variant
Provides information from a historical perspective ( past 5-10 years)

Non –Volatile
Does not require transaction processing, recovery and concurrency control. It requires initial loading and access.

What is data Warehousing

It is the process of constructing and using data warehouses.

How are organizations using information from dW

- Increasing customer focus
o Analysis of customer buying patterns
- Repositioning Products and Managing Product Portfolios
o Compare the performance of sales by quarter or territory.
- Analysing operations and looking for sources of profit
- Managing customer relationships
- Making environmental corrections
- Managing the cost of corporate assets

DWing is useful from the Point of View of heterogeneous database integration

- to integrate data from diverse databases.

It is popular because

- Update driven rather than query driven approach
o Information is integrated in advance and stored in warehouse for direct quering and analysis.
o Brings high performance to the integrated dB system
o Query processing in DW does not interfere with processing at local store
o DW can store and integrate historical information and support multidimensional queries.

Different Between Operational dB system and DW

ODBS: Major task: To perform online transaction and query processing- these are called on-line transaction processing (OLTP) systems.

- Cover most of day to day operations of organization eg. Purchasing, inventory etc.

DW- Serve users or knowledge workers in the role of data analysis and decision making. These are known as Online Analytical Processing (OLTP) systems.

Why not perform OLAP directly on operational dB rather than having a DW

- An Operational dB is designed for day-to-day operation. DW queries are more complex.
- In OP. dB concurrency control mechanism are required. But we need concurrency in DB to process the queries concurrently.
- Decision support systems require historical data whereas Op. Database do not maintain historical data
- Op. dB contain only raw data such as transactions which need to be consolidated before analysis.

But such Op. databases are coming which also support data warehousing.

Data Warehousing Architecture

Views regarding design of dW

Top-down View: It allows selection of relevant information necessary for dW.
Data source View: Exposes the info. Being captured, stored and managed by operational systems.
Data Warehouse View: It includes fact tables and dimension tables. It represents the info that is stored inside the dW.
Business Query View: Perspective of data in dW from the POV of the end user.

Skills required in Building a dW

Business skills: How the system stores and manages the data. How to build extractors- transfer data from operational system to dW. How to build refresh software- to update the dW data.
Technology Skills: How to make assessment from quantitative information
Program Management Skills: Need to interface with many technologies vendors and end users.

Process of DW design

See the figure

Steps in WH design process

Choose a business process to model eg. Orders, invoices, sales etc.
Choose the grain of the business process ( grain: fundamental level of data to be represented in the fact table) ( a fact table is a large central table containing the bulk of the data with no redundancy)
Choose the dimension that will apply to each fact table. Record eg. Time, item, customer, supplier.

See figure:

Choose the measures that will populate each fact table. Record eg. Dollars_sold and units_sold

Major Issues in Data Mining

Major Issues in Data Mining

Mining Methodology and User Interaction Issues

Mining Different kinds of knowledge in Data Bases
A dM should cover a wide spectrum of data analysis
A dMS should be able to do interactive mining of knowledge at different levels of interactions
Incorporation of background knowledge
DMQL need to be developed to allow users to do ad hoc mining
Discovered knowledge should be able to be expressed in HLL or visual representation
A DMS should be able to handle noisy or incomplete data
Some technique to develop the interestingness of patterns

Performance Issue
DM algorithm must be efficient and scalable ie. Express at different levels of granulities
Parallel, distributed and incremental mining should be developed

Parallel
Incremental- that incorporates data updates without having to mine data again “from scratch”
Issues related to diversity of data
Handling of relational and complex type of data
Mining knowledge from heterogeneous databases and global information system

classification of data mining systems

Wednesday, October 17, 2007

Lecture-1

Data Mining

- Extracting or 'mining' knowledge from large amount of Data

- Also known as Knowledge Discovery in Databases (KDD)- but actually Data Mining (DM) is a part of KDD

- KDD Consists of

a. Data Cleaning- to remove noise and unwanted data

b . Data integration - where multiple data sources may be combined- data warehouse

c. Data Selection- where data relevant to the analysis task are retrieved from the database

d. Data Transformation - Where data are transformed or consolidated into forms appropriate fro mining by performing summary or aggregate operations

e. Data Mining- an essential process where intelligent methods are applied in order to extract data patterns.

f. Pattern Evaluation - to identify truely interesting patterns representing knowledge based on some interestingness measures.

g. Knowledge presentation - where visualisaiton and knowledge representation techniques are used to present the mined knowledge to the user.

Thus we can also define DM as

" process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses or other information stores."

Thus there are the following components in the architecture of a DM system

1. Data Store: A store of data in which data cleaning and data integration can be formed.

2. DataBase of Data Warehouse Server: To fetch the relevant data, based on user's DM request.

3. Knowledge Base: Knowledge that is used to guide the search or evaluate the interestingness o a pattern. It includes

- concept hierarchies

-user beliefs

-metadata (eg. describing data from multiple heterogeneous sources)

4. DM Engine: Functional modules for tasks such as

- characterisation

-Association

-classification

-cluster analysis

-evolution analysis

-deviation analysis

Whey DM is helpful

The discovered knowledge can be applied to

- Decision Making

- Process control

- Information management

-Query processing

On what kind of Data DM can be done

1. Relational dB

It is a collection of tables each assigned a unique name. Each table consists of a set of attributes ( columns and fields) and usually stores a large set of tuples (records or rows)

When DM is applied to relational dBs-one can search for trends of data patterns.

Eg. “to predict the credit risk of the customer based on their income, age and previous credit info”

Data Warehouse:

Eg. A company has many branches and you are to find out “cos sales/itm type/branch for the third quarter”

It is very difficult as relevant data are spread over several databases, physically located at numerous locations. If the company has a DW, then it would be easy.
A DW- is a repository of information
- collected from multiple sources
-stored under a unified schema
-reside at a single site.

How DW are created
By
- data cleaning
- data transformation
- data integration
- data loading
- data refreshing

Other characteristics of DW
- Data in a DW is organized around manor subjects- customer, item, supplies, activity
- Provides information from a historical perspective ( eg. Last 5-10 years)
- Information is usually summarized
- Modeled by a multidimensional database structure, which provides multidimensional view of data eg. DATA CUBE

( Difference between a DW and Data Mart- A DW collects info that span an entire organization whereas a datamart is a department subset of a DW)
- A DW is suitable for OLAP (On Line Analytical Processing)
What is OLAP

Operation that allows a knowledge to be presented at different levels of abstractions.

3. Transactional Databases

consists of a file where each record represents a transaction

eg. Sales

Transaction Query
“All the items purchased by Sandy Smith”
DM Query “which items sold well together”

Advanced DB Systems

Objected Oriented DBs

Here each entity is considered an object. Each unit is associated with

Variables- That describe the objects
Messages- that the object can use to communicate with other objects
Methods- Where each method holds the code to implement a message

Eg. Get-photo(employee)- will retrieve and return a photo of the given employee object

- Object that share a common set of properties can be grouped into an object class
- Each object is an instance of its class.

Object Relational Databases

Basic relational data model with power to handle objects

Spatial databases
- maps
- VLSI Chip Design
- Medical and satellite images

DM can be used to find characteristics of houses near a specified location – say park

Temporal and Time Series DBs

Store time related Data. Eg. Data collected regarding stock exchanges

DM Strategies: Scheduling of bank tellers according to the volume of customer traffic.

Text and Multimedia DB

Text data bases contain word description for objects
Eg. To find a particular word or sentence together.

Multimedia contains image, audio and video data . DM- similarity based pattern matching.

Heterogeneous DB and Legacy DBs

A legacy DB is a group of heterogeneous DBs that combine different kinds of Data systems- such as relational or object oriented …difficult to exchange info.

DM can transform the data into more generalized conceptual level (eg. Fair, good or excellent)- from which info exchange can easily be performed.

The World Wide Web- the data objects are linked together. DM- understanding user access patterns called mining path traversal pattern- still a promising discipline.

Data Mining Functionalities
What kinds of patterns can be mined ?
Concept / Class description: Characterisation and Discrimination

Data Characterisation; Summarisation of the general characteristics or features of a target class of data eg. Summarise the characteristics of customer who spend >$1000 a year at ABC stores

Data Discrimination: Comparison of the general features of target class data objects with general features of objects from one or a set of contrasting classes.
Eg. Compare two groups of ABC customers
- who shop for computer products regularly vs. those who rarely shop.

Association Analysis

Discovery of association rules showing attribute-value conditions that occur frequently together in a given set of data.

Eg. Age (X, “20…29”)^income (X, “20K…29K”)=> Buys (X,”CD player”)
[Support = 2%, confidence= 60%]

where X is a variable representing customer. It indicates that of the ABC customers under study 2% (support) are 20 to 29 years of age with an income of 20K to 29K and have purchased a CD player at ABC. There is a 60% probability that a customer in this age and income group will purchase a CD player.

3. Prediction : To predict some missing or unavailable data values rather than class labels.

4. Cluster Analysis

- Objects are clustered into groups on the basis of

- maximising the intraclass similarity.

- minimising the interclass similarity.

5. Outlier Analysis

Data objects that do not comply with general behaviour or model of the data are outliers.

- most DM discard outliers

- But in some applications such as fraud, outlier is more important than actual data

6. Evolution Analysis

- Describes trend for objects whose behaviour changes with time

eg. stock regularities of a particular company from the stock market data.

What makes a pattern interesting

A pattern is interesting if:

- It is easily understood by humans

- valid on a new data with some degree of certainty.

- potentially useful

- novel

- validates a hypothesis that a user seeks to confirm.

Rem: An interesting pattern represents knowledge

Objective measures of Pattern interestingness:

1. Support: percentage of transactions that the given rules satisfy.

2. Confidence: Degree of certainty of the detected association.

Subjective Measures

Based on the user's belief in the data .

Thus a pattern is interesting if it:

- contradicts or

- supports user's beliefs

Can a DM system generates all of the interesting patterns

No. But user provided constraints and interestingness measures can be used to focus the research.

Can a DM system generates only interesting patterns:

- Optimising problem in DM

- Challenging issue

Priyank Goyal's Data Mining Lectures

Monday, November 5, 2007

Major Issues in Data Mining

classification of data mining systems

Wednesday, October 17, 2007

Lecture-1

Blog Archive

About Me