Data Mining Primitives, Languages and System Architecture
1. Data Mining Primitives: Designed in order to faciliate efficient and fruitful knowledge discovery.
A DMP includes:
a. Specification of relevant portion of DB
b. The kind of knowledge to be mined
c. Background knowledge useful in guiding the discovery process.
d. Interestingness measure for pattern evaluation.
e. How the discovered knowledge should be visualised
Use of DMQL ( Data Mining Query Language)
It facilitates the DM system communication with other information systems.
DM Primitives- What defines a DM Task
A DM query is defined in terms of the following primitives
1. Task Relevant Data: This is DB portion to be investigated
2. Kind of Knowledge to be mined: It means specified data mining function to be performed eg. characterisation, discrimination, association, clustering or evolution analysis
3. Background Knowledge: It includes knowledge about the domain to be mined. This includes concept hierarchies which allows the data to be mined at different levels of granualities. This also includes evaluation of the patterns according to the degree of expectedness and unexpectedness.
4. Interestingness Measures: These functions are used to separate uninteresting patterns from knowledge. It includes support ( % of tuples) and confidence ( degree).
5. Presentation and Visualisation of Discovered Patterns: This includes rules, tables, charts, groups, decision trees and cubes.
Wednesday, September 16, 2009
Tuesday, April 28, 2009
Concept Description-2
Concept Description-2
Data Generalisation: It can be done in two ways
a. Attribute Removal
b. Attribute Generalisation
a. Attribute Removal: The rule is "If there is a large set of distinct values for an attribute of the initial working relation, but either:
i. There is no concept hierarchy defined for the attribute), or
(ii) Its higher-level concepts are expressed in terms of other attributes, then the attribute should be removed from the working relation.
Attribute Generalisation
If there is a large set of distinct values for an attribute in the initial working relation, and there exist a concept hierarchy on the attribute than that concept hierarchy should be selected and applied on the attribute.
3. Count and Aggregate value Acculumation
Why needed
- To obtain a quantitative measure of generalisation.
How to Calculate
- A number called count is associated with each tuple in the initial working relation.
- Its value is initialised to 1
- Through generalisation there will be a group of identical tuples
- Such identical tuples are merged into one with their counts accumulated
eg let 52 data tuples are all generalised to same tuple say T. Then these are merged to form one tuple whose count is 52.
Attribute Generalisation Control
It is the control of how high an attribute should be generalised.
Two Approached to Generalised Control
1. Attribute Generalisation ( AGTC) Threshold Control- It sets a generalisation threshold for the attributes.
- If the number of distinct values of an attribute is greater than attribute threshold, further generalisation is possible.
2. Generalised Relation Threshold Control ( GRTC)
It sets a threshold for the generalised relation
- If the number of distinct tuples in the generalisation relation is > threshold, further generalisation should be possible.
Here first AGTC than GRTC is applied to generalise the data.
Data Generalisation: It can be done in two ways
a. Attribute Removal
b. Attribute Generalisation
a. Attribute Removal: The rule is "If there is a large set of distinct values for an attribute of the initial working relation, but either:
i. There is no concept hierarchy defined for the attribute), or
(ii) Its higher-level concepts are expressed in terms of other attributes, then the attribute should be removed from the working relation.
Attribute Generalisation
If there is a large set of distinct values for an attribute in the initial working relation, and there exist a concept hierarchy on the attribute than that concept hierarchy should be selected and applied on the attribute.
3. Count and Aggregate value Acculumation
Why needed
- To obtain a quantitative measure of generalisation.
How to Calculate
- A number called count is associated with each tuple in the initial working relation.
- Its value is initialised to 1
- Through generalisation there will be a group of identical tuples
- Such identical tuples are merged into one with their counts accumulated
eg let 52 data tuples are all generalised to same tuple say T. Then these are merged to form one tuple whose count is 52.
Attribute Generalisation Control
It is the control of how high an attribute should be generalised.
Two Approached to Generalised Control
1. Attribute Generalisation ( AGTC) Threshold Control- It sets a generalisation threshold for the attributes.
- If the number of distinct values of an attribute is greater than attribute threshold, further generalisation is possible.
2. Generalised Relation Threshold Control ( GRTC)
It sets a threshold for the generalised relation
- If the number of distinct tuples in the generalisation relation is > threshold, further generalisation should be possible.
Here first AGTC than GRTC is applied to generalise the data.
Saturday, April 25, 2009
Data Mining Functionalities- 1
Data Mining Functionalities- Characterisation and Comparison
DM
- Descriptive DM
- Predictive DM
Concept Description
It is the most basic form of discriptive DM
- It describes a given set of task relevant data.
- In a concise and summarative manner
- presenting interesting general properties of the data
Conception Description Has
- Characterisation: It summarises and describes a collection of data called target class.
- Comparison: It summarises and distinguishes
- one collection of data
- called the target class
- from the other colleciton(s) of data
- Collectively called the contrasting class
Concept Characterisation
There are two approaches:
1. the data cube OLAP approach
2. Attribute Oriented Induction Approach
It can be implemented using
- relational structure
- data cube structure.
1.
2. Attribute oriented Induction Approach
Technique of AOI approach
a. Data Focussing:
It corresponds to speficication of task relevant data
eg.
"use"- Big University DB
"mine characteristics as "- Science Students
"in relevance to "- name, gender, major
"from" - student
'where"- status
"in"- graduate
The table so obtained is called the Initial Working Relation.
DM
- Descriptive DM
- Predictive DM
Concept Description
It is the most basic form of discriptive DM
- It describes a given set of task relevant data.
- In a concise and summarative manner
- presenting interesting general properties of the data
Conception Description Has
- Characterisation: It summarises and describes a collection of data called target class.
- Comparison: It summarises and distinguishes
- one collection of data
- called the target class
- from the other colleciton(s) of data
- Collectively called the contrasting class
Concept Characterisation
There are two approaches:
1. the data cube OLAP approach
2. Attribute Oriented Induction Approach
It can be implemented using
- relational structure
- data cube structure.
1.
2. Attribute oriented Induction Approach
Technique of AOI approach
a. Data Focussing:
It corresponds to speficication of task relevant data
eg.
"use"- Big University DB
"mine characteristics as "- Science Students
"in relevance to "- name, gender, major
"from" - student
'where"- status
"in"- graduate
The table so obtained is called the Initial Working Relation.
Subscribe to:
Posts (Atom)