• Zev Isert

The Basics of Data Cleaning

What is dirty data?

In the data science industry, the saying “garbage in, garbage out” is ubiquitous, meaning that if you use dirty data, your model’s results will not be accurate. Dirty data is data that doesn’t accurately reflect the true underlying system or process because of inconsistencies and/or errors in the data.

There are many different reasons why your business may experience issues with dirty data. One of these is system integration. For example, if your business acquires another company with its own independent data systems, it will be difficult to combine the many different data types and systems into one consistent, organized pipeline.

Another reason you may experience dirty data is from user error. Many companies collect data through user input (online form, surveys, etc.). For instance, if a user incorrectly inputs their information, or creates a second profile after they forgot their old password it will cause inaccurate data and duplicates in your system.

These are only two of the most common causes, there are many additional reasons why you may be experiencing problems with dirty data.

What is data cleaning?

Data cleaning is the process of ensuring that the data used in your AI models is clean and accurate, in order to receive valuable results. Data cleaning is all about maximizing the value of the data that you currently consider an asset at your organization. So what makes data “dirty”?

  • Duplicates

  • Unwanted outliers

  • Missing data

  • Inconsistencies

  • Incorrect formatting

  • Typos

  • Incorrect capitalization

  • Irrelevant data

  • etc.

For clean data you must ensure that you have dealt with all these issues, whether it be by filtering and removing data or correcting and augmenting the data you already have. The cleaning process you will use depends on your data type. If you have a static dataset, then a one-time clean is ok. However, for batch and streaming data it is necessary to have an automated process to clean the data in motion.