How online communities and #tidytuesday took my R skills to the next level

Recently, I’ve been making a more concerted effort in learning R. One of the major hurdles I’ve had to overcome in order to do this was to move from tutorials and to doing more independent work. This can be quite a difficult thing to do, particularly if it isn’t immediately obvious how using R or another programming language can be relevant or beneficial to your work. As a current student who has had very limited coding experience in my current courses and has had to learn coding almost exclusively through self-study, this has been particularly the case.

Once I came across the #tidytuesday hashtag on Twitter and the associated community that participates in this event, doing visualisations without directly following a tutorial become much easier. Having a helpful community as well as a pre-selected source of data to work with each week helped to establish a routine, another key element of establishing a coding practice. As well as having a routine, being able to see experienced R users’ graphs and code and being able to reach out to them for tips has been invaluable in breaking through plateaus I had previously been stuck in.

To date, I’ve only done a couple of visualisations, and relatively simple ones at that. While functional and readable, they’re not overly attractive and appealing. However, having some legible pieces of independent work to show off as the beginning of a portfolio is satisfying as well as a confidence-booster. It’s an indication that I’m going in the right direction and the effort I’ve put into practicing R is yielding results.

There are, of course, many different online communities for all manner of programming languages that one may be interested in learning. With social media and online communities becoming ever more important and ubiquitous for professional development and networking, finding a like-minded community such as that associated with #tidytuesday or #rstats on Twitter may be the step for you to go beyond tutorials and to start working on independent coding tasks and projects.

For those interested, the code to my graph for the Kansas City Chiefs’ total attendance can be found at my GitHub profile.

The Importance of Data Cleaning and Preparation

For the first of my ‘portfolio’ posts, I am going to discuss one of the major stumbling blocks that I, and many others starting out in fields such as business intelligence and data science, have come across. Data cleaning and preparation is among the most important parts of the project lifecycle for any business intelligence and data science project. It is estimated that 80% of the work of people working in these fields relates to data cleaning and preparation in some way. Unfortunately, it’s often overlooked in university programs, online courses and in learning materials in general, despite its obvious importance.

Often, introductory courses will look at more exciting parts of business intelligence and data science, such as data visualization and machine learning. To an extent, this is understandable. These topics are useful ‘hooks’ to get beginners started on interesting and engaging tasks. However, without learning how to clean and prepare data, thoroughly understanding and being able to work through all the stages of a project is not feasible. Insufficient data cleaning and preparation will also compromise the final results obtained. As the saying goes, ‘Garbage In – Garbage Out’.

In this section of the article, I will go through some general principles and best practices for data cleaning and preparation. While there are of course many more techniques and advanced concepts within this area, they are beyond the scope of this article. I intend for this to be a starting point people, who like myself, are new to fields related to data and who want to get an idea of how to clean and prepare data.

When a dataset is obtained, the first thing to do is an exploratory analysis of it. In this stage, you should get a feel for the data within it. One of the first things to look for when doing the exploratory analysis is to make sure that the entries are valid. For example, do the fields that require a number have a numerical entry? On a similar note, entries should also make sense within the dataset provided. This will require a bit of domain knowledge of the subject of the data. For example, if looking at a dataset of wages, do the amounts make sense? If the average value within the dataset is, say, $100,000, and there is an entry that is $1,000,000, there’s a good chance this is an incorrect entry. However, this is all dependent on the context of the dataset.

Duplicate and null entries are also a priority to check for during this stage. Particularly with larger datasets, these entries are likely to arise at some point. They can often be overlooked as they are not always as obvious to find, particularly at an initial glance of a dataset.

Wikipedia provides a useful summary of the dimensions of data quality. They are as follows:

  • Validity (Do measures conform to defined rules or constraints?)
  • Accuracy (Do measures conform to a standardized value?)
  • Completeness (Are all the required measures known?)
  • Consistency (Are the recorded measures the same across the dataset?)
  • Uniformity (Does the dataset use the same units of measurement?)

 

An awareness of these factors of data quality and some preliminary work to ensure these are adhered to in the preparation and cleaning stages can vastly improve the final results of a project, as well as save a lot of time avoiding confusion and errors in later stages of a project. It can take some time to become accustomed to doing this and can be tedious at times, but establishing good practices of data cleaning and preparation is one of the most valuable things any beginner to business intelligence and data science can do.