Descriptive statistics is a powerful tool to gain insight into the generalization of a data set. The first thing we’ll do is load data from different sources and then organize it a bit.

 

Summary of the previous section

Clean data is a key component in the development of data-driven solutions since it helps us to boost productivity, minimize annoyance, and provide higher-quality outcomes.

Duplicated observations

Duplicate data is eliminated only if it is the product of an error in the generation of the data set.

Irrelevant observations

Non-informative variables can slow down processes, sometimes less is more, it is better to exclude redundant and non-informative information.

Structural issues

Homogenization of the records, maximum and minimum values in the records, consistencies in the texts, and type of variable.

Undesirable outliers

Extreme values that could affect subsequent adjustments are identified and eliminated or imputed.

Missing data

If we are working with a time series, missing values can be inconvenient, it is advisable to impute them.

Verify and QA

Do the descriptive statistics after the cleaning process indicate that the data is consistent and of good quality?