Clean the Data Using a Predefined Specification
Once you’ve identified the problems in your dataset, you will want to develop a cleaning routine. This cleaning routine will be used to help you produce more reliable, consistent, and accurate results in your data.
A Typical Cleaning Routine
- Identify invalid data. Use your standards of data quality and your key necessities to identify all the invalid or inaccurate data.
- Investigate the reasons for the bad data. Having this understanding will assist you in taking the necessary actions to correct the data.
- Determine how the dirty data should be cleaned. Whenever possible, invalid data should be corrected so it can be used for your project.
- Perform accuracy tests to ensure the data were properly cleaned. Accuracy tests are a physical comparison of the data collected with the actual event/object.
For example, you may want to compare the written run report with the electronic version that was recently entered into the database.
These steps may seem time consuming but they are worth every minute!
Identify Methods to Minimize More Bad Data >>