Data is at the core of our business at ConWX, and we know that the quality of input data is reflected in the accuracy of our forecasts. That is why cleaning data is the first step to ensure we have the best input data for training our models. Data cleaning helps us diagnose issues such as outliers, missing values, and noisy data, which all affect data quality.
Some estimate that data scientists spend 80% of their time cleaning and manipulating data, and less than 20% analysing it. In our experience, the ratio is not that dire, but truth be told, data cleaning is a big part of our work at ConWX.
Taking the amount of time used on cleaning data, we have made a few guidelines on, how to make data cleaning as smooth and easy as possible.
Advice on data cleaning from our data scientists
Use the tool that makes sense. It’s essential to have a wide range of tools available as there is no one-tool-fix-all. Whether it’s Python or Excel, there are pros and cons for each tool for the task at hand. Before deciding on the tool, ask yourself, how fast does it need to be done, can the logical pattern for cleaning the data be easily implemented in the tool, is it a recurring task, what is the tool you are most comfortable with. You will likely end up using different tools for different steps in the cleaning of data.
Correct data if you have enough information. Use all available features to make the most out of the dataset. Say you have a power production time series for a wind park where the maximum production changes over time. If you also have the turbine availability and potential curtailment, you can use that to scale the power to 100% availability, and use the scaled data to train your models.
Less is more. Sometimes you are better off eliminating data that deviates from the standard or simply looks odd. Having said that, be sure not to eliminate too much noise from the data as this might end up mispresenting the true nature of your data.
Communicate with the source. Do not be afraid to contact the source of the data and ask for more information. It can save you from making wrong assumptions or simply discarding good data.
Good luck cleaning your data!