Introduction

A frequent error in Data Science projects is thinking that they start with analysis. In fact, when a data analyst is asked where they spend most of their time, the answer remains the same: 80% in Data Wrangling ⁹.

Data in its natural form (Raw Data), usually contain registration errors that make exact analysis impossible. Being recorded by different systems and people, it is normal to end up with a file in which the same value is expressed in different ways (for example, a date can be recorded as June 28, or as 28/06), there may be blank records, and of course, grammatical errors.

When analyzing this data, all these records have to be pre-processed. That is, the data must be cleaned, unified, consolidated, and normalized so that it can be used to extract valuable information. Data Wrangling is the process of preparing data to be leveraged.

In the following chapters, we will see several common steps of the Data Wrangling process such as Importing data into R from files, converting data to tidy type, string processing, html processing, date and time formatting, and text mining.

In this section, we will master the essential skills for getting data into R and reshaping it for analysis. We will start by learning how to import data from diverse sources, including CSV files, Excel spreadsheets, and web pages. Once loaded, we will explore how to transform data between “wide” and “tidy” (long) formats using pivot_longer() and pivot_wider(), ensuring our data is structured correctly for visualization and modeling. We will also cover how to combine multiple datasets using the powerful family of join functions (left_join(), inner_join(), etc.). Finally, we will delve into specialized processing techniques, including web scraping, string manipulation with regular expressions, date conversion with lubridate, and the fundamentals of text mining to extract insights from unstructured text.

https://www.infoworld.com/article/3228245/the-80-20-data-science-dilemma.html ↩︎