Noisy Data
Noise- is a random error or variance in a measured variable
Noise Smoothing Techniques
Binning
Smooth assorted data by consulting its ‘neighbourhood’
Perform local smoothing
In this the data for price are first sorted and then partitioned into equidepths bins of depth 3.
Sorted data for price ( in dollars)
: 4, 8, 15, 21, 21, 24, 25, 28, 34
Partitioned into (Equidepth bins)
Bin 1 : 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
In smoothing by bin means, each value in a bin is replaced by the mean value eg. Mean of values 5, 8, 15 in bin 1 is 9, therefore, each value is replaced by 9
Smoothing:
Bin 1: 9,9,9
Bin 2: 22,22,22
Bin 3: 29,29,29
Smoothing by bin boundaries: The min. and max values in a given bin are identified as the bin-boundaries.
Each bin value is then replaced by closest boundary value
Bin1: 4, 4, 15
Bin 2: 21, 21, 24
Bin 3: 25, 25, 34
Clustering
Outliers can be detected by clustering. Here simple values are organized into groups or ‘clusters’.
Combined computer and human inspection
Here with the help of computer ‘surprise elements’ are identified. A human can then sort through the patterns in the list to identify the actual garbage ones.
Regression
Data can be smoothed by fitting the data to a function, such as with regression
Inconsistent Data
To correct
- Manufally through external references
- eg. errors at data entry can be corrected using a paper trace.
- Knowledge management tools
- To find contradicting values
3.3 DATA INTEGRATION AND TRANSFORMATION
Integration
- Merging of data from multiple data sources
Transformation
- into forms appropriate for mining
3.3.1 Data Integration
- combines data from multiple sources into coherent data store
Issues to consider during Data integration
1. Schema integration
2. Redudancy
3. Detection and Resolution of data value conflicts
Details
1. Schema integration
- to match up real world entities from multiple data sources
eg. How to be sure that customer_id in ond dB is same as cust_no in another.
solution- use meta data
2. Redundancy
- An attribute is redundant if it can be derived from another table
- eg. annual review
Solution: use correlation analysis
3. Detection and Resolution of Data Value Conflicts
- attribute values from different values may differ
eg. a weight attribute may be stored in metric units in one system and British Imperical System in another
3.3.2 Data Transformation
- Data are transformed
consolidated into forms appropriate for mining
1. Smoothing
- to remove noise from data
- binning
- clustering and
- Regression
2. Aggregation
- summary or aggregation option are applied on data- eg. daily sales data is aggregated into monthly sales
3. Generalisation
- low level data are replaced by higher concept- eg. age can be replaced by -> young, middle aged or senior
4. Normalisation
- Attribute data are scaled so as to fall within a small specified range eg. -1.0 to 1.0 or 0.0 to 1.0
it is of three types
- min-max normalisation
- Z-score ,,,,,,,,
- Normalisation by decimal scaling
Min- Max Normalisation
Let minA- min. value of an attribute A
maxA- max. value of an attribute A
V- value of the attribute A we want to normalise
that (M-m) is the range of normal
then normalised value v' is
v'= ((v-minA)*(M-m)/(maxA-minA))+m
eg. Let min and max, value of an attribute 'income' are $ 12,000 and $ 98,000 . Let range (0.0,1.0) then $ 73,600 is normalised by max-min method as:
((73600-12000)*(1.0-0.0)/(98000-12000))+0=0.716
b. Z-score normalisation
Let A' and s are mean and s.d. of an attribute A then
v'= (v-A')/s
eg. if mean = $54000 and s.d.= $16000 with z-score normalisation, a value of $73600 will be normalised to
(73600-54000)/16000 = 1.225
Normalisation by decimal scaling
normalises by moving the decimal point of values of attribute A
eg. v'= v/10^j
eg. A = range -986 to 917- to normalise by scaling we divide each value by 1000 therefore, -986 normalises to -0.986
Data Reduction
- to obtain a reduced representation of the data set that is
- much smaller in volume
- yet closely maintain the integrity of original data
Strategies for Data Reduction
1. Data Cube Aggregation
- Aggregation operations are applied to the construction of a data cube
2. Dimension Reduction
Where irrelevant or redundant dimensions are detected and removed
3. Data Compression
Encoding mechanisms are used to reduce the data set size
4. Numerosity Reduction
where data are replaced by alternative smaller data representation
5. Discretisation and Concept Hierarchy Generation
Where raw data values for attributes are replaced by ranges or higher conceptual levels.
3.4.1 Data Cube Aggregation
eg. the data can be aggregated so that the resulting data summarises the total sale/year instead of per quarter.
2. Dimensionality Reduction
- Irrelavent and redundant dimensions are detected and removed
How to find the best attributes
Methods
1. Stepwise Forward Selection
- Start with an empty set of attributes
2 the best of the original attribute is determined and added to the set
- perform step 2
2. Stepwise Backward elimination
- start with full set of attributes
- at each step, remove the worst attribute
3. Combination of Backward and Forward
- at each step procedure selects the best attributes and removes the worst from the remaining
2. Data Compression
Methods of lossy data compression
1. Wavelet transforms
Discrete wavelet transform (DWT)
- linear signal processing technique
- when applied to a data vector D
- transforms it to a numerically diff. vector
- D' of wavelet coefficients
The two vectors are of the same length
Q. If wavelet X formed data are of the same length, how can we reduce the data
- the wave coefficient data can be truncated
- compression of finger print images
- computer vision
- analysis of time series data
- data cleaning
2. Principal component Analysis
It has four steps
1. The input data are normalised so that each attribute falls within the same range.
2. PCA computes vectors that provide a basis for the normalised input data. These vectors are called principal components.
- input data is a linear combination of the input components
3. The PCs are sorted in order of decreasing significance or strength
4. Since the components are sorted the size of data can be reduced by eliminating the weaker components
3.4.4 Numerocity Reduction
Techniques
1. Regression and Log-linear model- In linear regression, the data ae modeled to fit a straight line.
Log linear model- approximate discrete MD Probability distributions
Histogram
- popular form of data reduction
- A histogram partitiones the data for an attribute into disjoint subsets or buckets typcially reflects the average frequency of the values
Clustering
In this cluster representations of the data are used to replace the actual data
sampling
allows a large data set to be represented by a much smaller random sample
Possible Samples
1. Simple Random Sample without Replacement- SRSWOR- drawing n of N tuples from data set D.
2. Simple Random Sample with Replacement- similar to SRSWOR- except that each time a tuple is drawn from D, it is recorded then replaced
3. Cluster Sample: Tuples in D are grouped into M mutually disjoint "clusters" then an SRS of m clusters can be obtained where M less than m
4. Here D is divided into mutually dijoint parts called strata, a sample of D is generated by obtaining an SRS at each stratum.
Discretisation and Concept Hierarchy Generation
- Used to reduce the number of values for a given continuous attribute
- By dividing the range of attributes into intervals
- Interval levels can then be used to replace actual data values
Concept Hierarcy
Defines a discretisation of the attribute
- can be used to reduce the data
- by collecting and replacing low-level concepts (eg. age) by higher-level concepts (young, middle-aged, senior)
How to Generate Concept Hierarchies
- Binning- Smoothing by bin means or medians
Histogram Analysis
Cluster Analysis
- Each cluster can be decomposed to further subclusters- thus forming a concept hierarchy
where ent ()- entropy function
ent ( s1)= sigma ( i=1 to m) pi log (base 2) (pi)
where pi= probability of class i in S1
3. Apply step 2 till
ent (s) - I (s,t) >8
where 8 -> stopping criterion
Segmentation by Natural Partitioning
- In this numerical ranges are naturally partitioned.
eg. Annual salaries are broken into ranges like ( $50000, $60,000)
are more desirable than ranges like
($ 51, 263.98, $60,872.34)
Sunday, September 28, 2008
Data Preprocessing- part 1
Data Preprocessing
- Data Cleaning
- Data Transformation
- Data Integration
- Data Reduction
1. Data Cleaning
These routine attempt to
- fill the missing values
-smooth out noise while identifying outliers
- correct inconsistencies in the data
Missing Values: Let there are many tuples having no recorded value eg. customer income
How to fill the missing values:
1. Ignore the tuple
- not very effective
- useful when the tuple has several missing values
2. fill the missing values manually
- time consume
- not possible with many missing vlues
3. Use a global constant to fill in the missing value
- eg "unknown' or "simbol of infinity"
- if there are several 'unknowns' then progammer may think that it is an interesting pattern.
4. Use the attribute mean to fill in the missing values
e.g. Average income of customer is $28000. Use this value
5. Use the attribute mean for all samples belonging to the same class as the given tuple
eg. if classifying customers according to credit risk, replace the missing value with the average income value for customers in the same credit risk category as that of the given tuple.
6. Use the most probable value to fill in the missing value by
- regression
- inference based tools
Methods 3-6 bias the data. Method 6 is a popular strategy.
- Data Cleaning
- Data Transformation
- Data Integration
- Data Reduction
1. Data Cleaning
These routine attempt to
- fill the missing values
-smooth out noise while identifying outliers
- correct inconsistencies in the data
Missing Values: Let there are many tuples having no recorded value eg. customer income
How to fill the missing values:
1. Ignore the tuple
- not very effective
- useful when the tuple has several missing values
2. fill the missing values manually
- time consume
- not possible with many missing vlues
3. Use a global constant to fill in the missing value
- eg "unknown' or "simbol of infinity"
- if there are several 'unknowns' then progammer may think that it is an interesting pattern.
4. Use the attribute mean to fill in the missing values
e.g. Average income of customer is $28000. Use this value
5. Use the attribute mean for all samples belonging to the same class as the given tuple
eg. if classifying customers according to credit risk, replace the missing value with the average income value for customers in the same credit risk category as that of the given tuple.
6. Use the most probable value to fill in the missing value by
- regression
- inference based tools
Methods 3-6 bias the data. Method 6 is a popular strategy.
Subscribe to:
Posts (Atom)