Sunday, September 28, 2008

Data Preprocessing- part 1

Data Preprocessing
- Data Cleaning
- Data Transformation
- Data Integration
- Data Reduction

1. Data Cleaning
These routine attempt to
- fill the missing values
-smooth out noise while identifying outliers
- correct inconsistencies in the data

Missing Values: Let there are many tuples having no recorded value eg. customer income

How to fill the missing values:

1. Ignore the tuple
- not very effective
- useful when the tuple has several missing values

2. fill the missing values manually
- time consume
- not possible with many missing vlues

3. Use a global constant to fill in the missing value
- eg "unknown' or "simbol of infinity"
- if there are several 'unknowns' then progammer may think that it is an interesting pattern.

4. Use the attribute mean to fill in the missing values

e.g. Average income of customer is $28000. Use this value

5. Use the attribute mean for all samples belonging to the same class as the given tuple

eg. if classifying customers according to credit risk, replace the missing value with the average income value for customers in the same credit risk category as that of the given tuple.

6. Use the most probable value to fill in the missing value by
- regression
- inference based tools

Methods 3-6 bias the data. Method 6 is a popular strategy.

No comments: