What is data pre-processing?
Data pre-processing is an important step in the data mining process. It describes any type of processing performed on raw data to prepare it for another processing procedure. Data preprocessing transforms the data into a format that will be more easily and effectively processed for the purpose of the user.
Importance of data pre-processing.
Real-world data is usually incomplete,( it may contain missing values), noisy,(data may contain errors while transmission or dirty data), inconsistent,(data may contain duplicate values or unexpected values which lead to inconsistency). Data preprocessing is a proven method of solving such problems.
No quality data, no quality mining results! which means that if the analysis is performed on low-quality data then the results obtained will also be of a low quality which is not desired in the decision-making process. For a quality result, it is necessary to clean this dirty data. To convert dirty data into quality data, there is need of data pre-processing techniques.
Major Tasks in data pre-processing.
- Data Cleaning.
- Data Integration.
- Data Transformation.
- Data Reduction.
Data cleaning or data cleansing techniques attempt to fill in missing values, smooth out noise while identifying outliers and correct inconsistencies in the data. Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table or database.
Tasks in data cleaning:
- Fill in missing values
- Identify outliers and smooth noisy data
- Correct inconsistent data
1. Fill in missing values :
- Ignore the tuple.
- Fill in the missing values manually
- Use a global constant to fill in the missing value.
- Use the most probable value
- Use the attribute mean or median for all the samples belonging to the same class as the given tuple.
2.Identify outliers and smooth noisy data
- Outlier analysis.
Data Integration is the process of integrating data from multiple sources and has a single view over all these sources. Data integration can be physical or virtual.
Tasks in data integration:
- Data Integration-Combines data from multiples sources into a single data store.
- Schema integration-Integrate metadata from different sources
- Entity identification problem-Identify real-world entities from multiple data sources
- Detecting and resolving data value conflicts-For the same real-world entity, attribute values from different sources are different
- Handling Redundancy in Data Integration
Data transformation is the process of converting data from one format or structure into another format or structure.In this preprocessing step, the data are transformed or consolidated so that the resulting mining process may be more efficient, and the patterns found may be easier to understand.
Data Transformation Strategies:
- Smoothing: which works to remove noise from the data
- Attribute construction (Feature construction)-where new attributes are constructed and added from the given set of attributes to help the mining process.
- Aggregation-where summary or aggregation operations are applied to the data
- Normalization-where the attribute data are scaled so as to fall within a smaller range.
- Discretization-where the raw values of a numeric attribute are replaced by interval labels.
- Concept hierarchy generation for nominal data-where the attributes such as street can be generalized to higher level concepts like city or country.
A database/data warehouse may store terabytes of data and to perform complex analysis on such a voluminous data may take very very long time on the complete data set. Therefore, data reduction is used to obtain a reduced representation of the data set that is much smaller in volume but yet produces the same analytical results. Data reduction is the transformation of numerical or alphabetical digital information derived empirically or experimentally into a corrected, ordered, and simplified form.
Data reduction Strategies:
- Data Compression
- Dimensionality reduction
- Discretization and concept hierarchy generation
- Numerosity reduction
- Data cube aggregation.