Priyank Goyal's Data Mining Lectures

Wednesday, September 16, 2009

Data Mining Primitives- Introduction

Data Mining Primitives, Languages and System Architecture

1. Data Mining Primitives: Designed in order to faciliate efficient and fruitful knowledge discovery.

A DMP includes:

a. Specification of relevant portion of DB
b. The kind of knowledge to be mined
c. Background knowledge useful in guiding the discovery process.
d. Interestingness measure for pattern evaluation.
e. How the discovered knowledge should be visualised

Use of DMQL ( Data Mining Query Language)

It facilitates the DM system communication with other information systems.

DM Primitives- What defines a DM Task

A DM query is defined in terms of the following primitives

1. Task Relevant Data: This is DB portion to be investigated

2. Kind of Knowledge to be mined: It means specified data mining function to be performed eg. characterisation, discrimination, association, clustering or evolution analysis

3. Background Knowledge: It includes knowledge about the domain to be mined. This includes concept hierarchies which allows the data to be mined at different levels of granualities. This also includes evaluation of the patterns according to the degree of expectedness and unexpectedness.

4. Interestingness Measures: These functions are used to separate uninteresting patterns from knowledge. It includes support ( % of tuples) and confidence ( degree).

5. Presentation and Visualisation of Discovered Patterns: This includes rules, tables, charts, groups, decision trees and cubes.

Tuesday, April 28, 2009

Concept Description-2

Concept Description-2

Data Generalisation: It can be done in two ways

a. Attribute Removal
b. Attribute Generalisation

a. Attribute Removal: The rule is "If there is a large set of distinct values for an attribute of the initial working relation, but either:

i. There is no concept hierarchy defined for the attribute), or

(ii) Its higher-level concepts are expressed in terms of other attributes, then the attribute should be removed from the working relation.

Attribute Generalisation

If there is a large set of distinct values for an attribute in the initial working relation, and there exist a concept hierarchy on the attribute than that concept hierarchy should be selected and applied on the attribute.

3. Count and Aggregate value Acculumation

Why needed
- To obtain a quantitative measure of generalisation.

How to Calculate

- A number called count is associated with each tuple in the initial working relation.
- Its value is initialised to 1
- Through generalisation there will be a group of identical tuples
- Such identical tuples are merged into one with their counts accumulated

eg let 52 data tuples are all generalised to same tuple say T. Then these are merged to form one tuple whose count is 52.

Attribute Generalisation Control

It is the control of how high an attribute should be generalised.

Two Approached to Generalised Control

1. Attribute Generalisation ( AGTC) Threshold Control- It sets a generalisation threshold for the attributes.
- If the number of distinct values of an attribute is greater than attribute threshold, further generalisation is possible.

2. Generalised Relation Threshold Control ( GRTC)

It sets a threshold for the generalised relation

- If the number of distinct tuples in the generalisation relation is > threshold, further generalisation should be possible.

Here first AGTC than GRTC is applied to generalise the data.

Saturday, April 25, 2009

Data Mining Functionalities- 1

Data Mining Functionalities- Characterisation and Comparison

DM
- Descriptive DM
- Predictive DM

Concept Description

It is the most basic form of discriptive DM
- It describes a given set of task relevant data.
- In a concise and summarative manner
- presenting interesting general properties of the data

Conception Description Has
- Characterisation: It summarises and describes a collection of data called target class.
- Comparison: It summarises and distinguishes
- one collection of data
- called the target class
- from the other colleciton(s) of data
- Collectively called the contrasting class

Concept Characterisation

There are two approaches:
1. the data cube OLAP approach
2. Attribute Oriented Induction Approach
It can be implemented using
- relational structure
- data cube structure.

1.
2. Attribute oriented Induction Approach

Technique of AOI approach

a. Data Focussing:

It corresponds to speficication of task relevant data

eg.

"use"- Big University DB
"mine characteristics as "- Science Students
"in relevance to "- name, gender, major
"from" - student
'where"- status
"in"- graduate

The table so obtained is called the Initial Working Relation.

Sunday, September 28, 2008

Data Preprocessing - Part 2

Noisy Data
Noise- is a random error or variance in a measured variable
Noise Smoothing Techniques
Binning
Smooth assorted data by consulting its ‘neighbourhood’
Perform local smoothing
In this the data for price are first sorted and then partitioned into equidepths bins of depth 3.
Sorted data for price ( in dollars)
: 4, 8, 15, 21, 21, 24, 25, 28, 34
Partitioned into (Equidepth bins)
Bin 1 : 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
In smoothing by bin means, each value in a bin is replaced by the mean value eg. Mean of values 5, 8, 15 in bin 1 is 9, therefore, each value is replaced by 9
Smoothing:
Bin 1: 9,9,9
Bin 2: 22,22,22
Bin 3: 29,29,29
Smoothing by bin boundaries: The min. and max values in a given bin are identified as the bin-boundaries.
Each bin value is then replaced by closest boundary value
Bin1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
Clustering
Outliers can be detected by clustering. Here simple values are organized into groups or ‘clusters’.
Combined computer and human inspection
Here with the help of computer ‘surprise elements’ are identified. A human can then sort through the patterns in the list to identify the actual garbage ones.
Regression
Data can be smoothed by fitting the data to a function, such as with regression
Inconsistent Data
To correct
- Manufally through external references
- eg. errors at data entry can be corrected using a paper trace.
- Knowledge management tools
- To find contradicting values
3.3 DATA INTEGRATION AND TRANSFORMATION
Integration
- Merging of data from multiple data sources
Transformation
- into forms appropriate for mining
3.3.1 Data Integration
- combines data from multiple sources into coherent data store
Issues to consider during Data integration
1. Schema integration
2. Redudancy
3. Detection and Resolution of data value conflicts
Details
1. Schema integration
- to match up real world entities from multiple data sources
eg. How to be sure that customer_id in ond dB is same as cust_no in another.
solution- use meta data
2. Redundancy
- An attribute is redundant if it can be derived from another table
- eg. annual review
Solution: use correlation analysis
3. Detection and Resolution of Data Value Conflicts
- attribute values from different values may differ
eg. a weight attribute may be stored in metric units in one system and British Imperical System in another
3.3.2 Data Transformation
- Data are transformed
consolidated into forms appropriate for mining
1. Smoothing
- to remove noise from data
- binning
- clustering and
- Regression
2. Aggregation
- summary or aggregation option are applied on data- eg. daily sales data is aggregated into monthly sales
3. Generalisation
- low level data are replaced by higher concept- eg. age can be replaced by -> young, middle aged or senior
4. Normalisation
- Attribute data are scaled so as to fall within a small specified range eg. -1.0 to 1.0 or 0.0 to 1.0
it is of three types
- min-max normalisation
- Z-score ,,,,,,,,
- Normalisation by decimal scaling
Min- Max Normalisation
Let minA- min. value of an attribute A
maxA- max. value of an attribute A
V- value of the attribute A we want to normalise
that (M-m) is the range of normal
then normalised value v' is
v'= ((v-minA)*(M-m)/(maxA-minA))+m
eg. Let min and max, value of an attribute 'income' are $ 12,000 and $ 98,000 . Let range (0.0,1.0) then $ 73,600 is normalised by max-min method as:
((73600-12000)*(1.0-0.0)/(98000-12000))+0=0.716
b. Z-score normalisation
Let A' and s are mean and s.d. of an attribute A then
v'= (v-A')/s
eg. if mean = $54000 and s.d.= $16000 with z-score normalisation, a value of $73600 will be normalised to
(73600-54000)/16000 = 1.225
Normalisation by decimal scaling
normalises by moving the decimal point of values of attribute A
eg. v'= v/10^j
eg. A = range -986 to 917- to normalise by scaling we divide each value by 1000 therefore, -986 normalises to -0.986
Data Reduction
- to obtain a reduced representation of the data set that is
- much smaller in volume
- yet closely maintain the integrity of original data
Strategies for Data Reduction
1. Data Cube Aggregation
- Aggregation operations are applied to the construction of a data cube
2. Dimension Reduction
Where irrelevant or redundant dimensions are detected and removed
3. Data Compression
Encoding mechanisms are used to reduce the data set size
4. Numerosity Reduction
where data are replaced by alternative smaller data representation
5. Discretisation and Concept Hierarchy Generation
Where raw data values for attributes are replaced by ranges or higher conceptual levels.
3.4.1 Data Cube Aggregation
eg. the data can be aggregated so that the resulting data summarises the total sale/year instead of per quarter.
2. Dimensionality Reduction
- Irrelavent and redundant dimensions are detected and removed
How to find the best attributes
Methods
1. Stepwise Forward Selection
- Start with an empty set of attributes
2 the best of the original attribute is determined and added to the set
- perform step 2
2. Stepwise Backward elimination
- start with full set of attributes
- at each step, remove the worst attribute
3. Combination of Backward and Forward
- at each step procedure selects the best attributes and removes the worst from the remaining
2. Data Compression
Methods of lossy data compression
1. Wavelet transforms
Discrete wavelet transform (DWT)
- linear signal processing technique
- when applied to a data vector D
- transforms it to a numerically diff. vector
- D' of wavelet coefficients
The two vectors are of the same length
Q. If wavelet X formed data are of the same length, how can we reduce the data
- the wave coefficient data can be truncated
- compression of finger print images
- computer vision
- analysis of time series data
- data cleaning
2. Principal component Analysis
It has four steps
1. The input data are normalised so that each attribute falls within the same range.
2. PCA computes vectors that provide a basis for the normalised input data. These vectors are called principal components.
- input data is a linear combination of the input components
3. The PCs are sorted in order of decreasing significance or strength
4. Since the components are sorted the size of data can be reduced by eliminating the weaker components
3.4.4 Numerocity Reduction
Techniques
1. Regression and Log-linear model- In linear regression, the data ae modeled to fit a straight line.
Log linear model- approximate discrete MD Probability distributions
Histogram
- popular form of data reduction
- A histogram partitiones the data for an attribute into disjoint subsets or buckets typcially reflects the average frequency of the values
Clustering
In this cluster representations of the data are used to replace the actual data
sampling
allows a large data set to be represented by a much smaller random sample
Possible Samples
1. Simple Random Sample without Replacement- SRSWOR- drawing n of N tuples from data set D.
2. Simple Random Sample with Replacement- similar to SRSWOR- except that each time a tuple is drawn from D, it is recorded then replaced
3. Cluster Sample: Tuples in D are grouped into M mutually disjoint "clusters" then an SRS of m clusters can be obtained where M less than m
4. Here D is divided into mutually dijoint parts called strata, a sample of D is generated by obtaining an SRS at each stratum.
Discretisation and Concept Hierarchy Generation
- Used to reduce the number of values for a given continuous attribute
- By dividing the range of attributes into intervals
- Interval levels can then be used to replace actual data values
Concept Hierarcy
Defines a discretisation of the attribute
- can be used to reduce the data
- by collecting and replacing low-level concepts (eg. age) by higher-level concepts (young, middle-aged, senior)
How to Generate Concept Hierarchies
- Binning- Smoothing by bin means or medians
Histogram Analysis
Cluster Analysis
- Each cluster can be decomposed to further subclusters- thus forming a concept hierarchy
where ent ()- entropy function
ent ( s1)= sigma ( i=1 to m) pi log (base 2) (pi)
where pi= probability of class i in S1
3. Apply step 2 till
ent (s) - I (s,t) >8
where 8 -> stopping criterion
Segmentation by Natural Partitioning
- In this numerical ranges are naturally partitioned.
eg. Annual salaries are broken into ranges like ( $50000, $60,000)
are more desirable than ranges like
($ 51, 263.98, $60,872.34)

Data Preprocessing- part 1

Data Preprocessing
- Data Cleaning
- Data Transformation
- Data Integration
- Data Reduction

1. Data Cleaning
These routine attempt to
- fill the missing values
-smooth out noise while identifying outliers
- correct inconsistencies in the data

Missing Values: Let there are many tuples having no recorded value eg. customer income

How to fill the missing values:

1. Ignore the tuple
- not very effective
- useful when the tuple has several missing values

2. fill the missing values manually
- time consume
- not possible with many missing vlues

3. Use a global constant to fill in the missing value
- eg "unknown' or "simbol of infinity"
- if there are several 'unknowns' then progammer may think that it is an interesting pattern.

4. Use the attribute mean to fill in the missing values

e.g. Average income of customer is $28000. Use this value

5. Use the attribute mean for all samples belonging to the same class as the given tuple

eg. if classifying customers according to credit risk, replace the missing value with the average income value for customers in the same credit risk category as that of the given tuple.

6. Use the most probable value to fill in the missing value by
- regression
- inference based tools

Methods 3-6 bias the data. Method 6 is a popular strategy.

Tuesday, April 8, 2008

data mining technique

A view of the datamining techniques can be got from here

Monday, November 5, 2007

What exactly is a DW

A dW is a subject-oriented, integrated, time variant and non –volatile collection in support of management’s decision making process.

Subject Oriented
A dW is organized around major subjects of customer, suppliers, products and sales, it excludes data that is not useful in decision making process

Integrated
A dW is usually constructed by integrating multiple records eg. Relation dB, flat-files

Time-Variant
Provides information from a historical perspective ( past 5-10 years)

Non –Volatile
Does not require transaction processing, recovery and concurrency control. It requires initial loading and access.

What is data Warehousing

It is the process of constructing and using data warehouses.

How are organizations using information from dW

- Increasing customer focus
o Analysis of customer buying patterns
- Repositioning Products and Managing Product Portfolios
o Compare the performance of sales by quarter or territory.
- Analysing operations and looking for sources of profit
- Managing customer relationships
- Making environmental corrections
- Managing the cost of corporate assets

DWing is useful from the Point of View of heterogeneous database integration

- to integrate data from diverse databases.

It is popular because

- Update driven rather than query driven approach
o Information is integrated in advance and stored in warehouse for direct quering and analysis.
o Brings high performance to the integrated dB system
o Query processing in DW does not interfere with processing at local store
o DW can store and integrate historical information and support multidimensional queries.

Different Between Operational dB system and DW

ODBS: Major task: To perform online transaction and query processing- these are called on-line transaction processing (OLTP) systems.

- Cover most of day to day operations of organization eg. Purchasing, inventory etc.

DW- Serve users or knowledge workers in the role of data analysis and decision making. These are known as Online Analytical Processing (OLTP) systems.

Why not perform OLAP directly on operational dB rather than having a DW

- An Operational dB is designed for day-to-day operation. DW queries are more complex.
- In OP. dB concurrency control mechanism are required. But we need concurrency in DB to process the queries concurrently.
- Decision support systems require historical data whereas Op. Database do not maintain historical data
- Op. dB contain only raw data such as transactions which need to be consolidated before analysis.

But such Op. databases are coming which also support data warehousing.

Data Warehousing Architecture

Views regarding design of dW

Top-down View: It allows selection of relevant information necessary for dW.
Data source View: Exposes the info. Being captured, stored and managed by operational systems.
Data Warehouse View: It includes fact tables and dimension tables. It represents the info that is stored inside the dW.
Business Query View: Perspective of data in dW from the POV of the end user.

Skills required in Building a dW

Business skills: How the system stores and manages the data. How to build extractors- transfer data from operational system to dW. How to build refresh software- to update the dW data.
Technology Skills: How to make assessment from quantitative information
Program Management Skills: Need to interface with many technologies vendors and end users.

Process of DW design

See the figure

Steps in WH design process

Choose a business process to model eg. Orders, invoices, sales etc.
Choose the grain of the business process ( grain: fundamental level of data to be represented in the fact table) ( a fact table is a large central table containing the bulk of the data with no redundancy)
Choose the dimension that will apply to each fact table. Record eg. Time, item, customer, supplier.

See figure:

Choose the measures that will populate each fact table. Record eg. Dollars_sold and units_sold