Data is at the core of our business at ConWX, and we know that the quality of input data is reflected in the accuracy of our forecasts. That is why cleaning data is the first step to ensure we have the best input data for training our models. It helps us to diagnose issues such as outliers, missing values, and noisy data, which all affect data quality.
Some estimate that data scientists spend 80% of their time cleaning and manipulating data, and less than 20% analysing it. In our experience, the ratio is not that dire, but truth be told, data cleaning is a big part of our work at ConWX.
Taking the amount of time used on cleaning data, we have made a few guidelines on, how to make data cleaning as smooth and easy as possible.
Advice on data cleaning from our data scientists
Use the tool that makes sense. It’s essential to have a wide range of tools available as there is no one-tool-fix-all. Whether it’s Python or Excel, there are pros and cons for each tool for the task at hand. Before deciding on the tool, ask yourself, how fast does it need to be done, can the logical pattern for cleaning the data be easily implemented in the tool, is it a recurring task, what is the tool you are most comfortable with. You will likely end up using different tools for different steps in the cleaning of data.
Correct data if you have enough information. Use all available features to make the most out of the dataset. Say you have a power production time series for a wind park where the maximum production changes over time. If you also have the turbine availability and potential curtailment, you can use that to scale the power to 100% availability, and use the scaled data to train your models.
Less is more. Sometimes you are better off eliminating data that deviates from the standard or simply looks odd. Having said that, be sure not to eliminate too much noise from the data as this might end up mispresenting the true nature of your data.
Communicate with the source. Do not be afraid to contact the source of the data and ask for more information. It can save you from making wrong assumptions or simply discarding good data.
This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.
Strictly Necessary Cookies
Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.
If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.
3rd Party Cookies
This website uses Google Analytics to collect anonymous information such as the number of visitors to the site, and the most popular pages.
Keeping this cookie enabled helps us to improve our website.
Please enable Strictly Necessary Cookies first so that we can save your preferences!
Additional Cookies
This website uses the following additional cookies:
Activecampaign
Please enable Strictly Necessary Cookies first so that we can save your preferences!