What is data wrangling?

Data wrangling is the process of preparing data for analysis or computation. It includes transforming data into various formats and statistical derivations, cleaning data to account for missing observations, clumsy data types, or outright errors, and arranging data into workable categorized data sets.

When should data be cleaned?

Anyone working with data knows it needs to be cleaned many times along the data pipeline. After data is collected, it must be wrangled before it is stored. When it is taken from storage, it has to be wrangled before it can be input into analytical processes. When data plans change, stored data need to be wrangled to accommodate changes in selected variables, categorization, and record ranges. Sometimes data need to be tidied, this requires wrangling. When models are constructed, data requirements change and data need to be wrangled again. Data need to be wrangled when models are revisited and rerun because questions change and inputs change. Presenting your results also involves wrangling the data. A recent survey of data scientists revealed data scientists spend roughly 80% of their time preparing data; 60% of time is spent wrangling data and 20% is spent collecting data.

What kind of applications should be used to clean data?

Most people are familiar with Microsoft Excel and its open source competitors. Excel is a great example of a data wrangling tool because it is very visual. You can see easily what you are working with when you're wrangling data with a spreadsheet. Spreadsheet applications, however, have a lot of limitations. In Excel, for example, the records in an open sheet are limited to ~ 1,000,000, while available columns are limited to ~ 16,000. This may not seem restrictive, but in the context of even trade data, not to mention big data, it is. See here for a more complete list of excel limitations. Spreadsheets also rely on minimal macro technologies for scripting and are therefore not very efficient data wrangling workhorses.

Programming languages, and the data-focused packages constructed in them, are better for wrangling data. In data science, two main programming languages are used to wrangle data. These are python and S (through the R implementation). While the two languages have many differences, they both share popularity among data scientists that transforms directly into large numbers of workable packages. These packages can be harnessed in concordance with hardware adjustments to facilitate working with big data sets, those data sets with more than a billion records. There are also many software packages that can be used in data science. Wikipedia provides a few in this list.