Every day, words like “Artificial Intelligence”, “Big Data”, “Data Analytics”, and “Machine Learning” are thrown around constantly these days, but what do they actually mean? Perhaps more importantly, why should you care?

This series is going to dive into the world of data analytics and provide information on these topics to help you understand the importance of data.

Lets get started. 

According to Forbes Magazine, every single day 2.5 quintillion bytes of data are created and 90% of the world’s data has been created in the last 2 years. The problem is that data does not necessarily translate to knowledge. Without some source of context, data and information can be impossible to analyze and comprehend. Before we get too far, I think it’s important that we define the differences between data, information, and knowledge. 

  • Data – facts and statistics collected for reference 
  • Information – data that has been processed into a form that is meaningful to the recipient
  • Knowledge – what has been understood and evaluated from the information

The translation from data to knowledge is vitally important. We use data to diagnose diseases, land planes, and make decisions. If the data is not presented in a way that is readable and comprehensible, we cannot translate it into information, or knowledge. There needs to be a way to take the data that is hard to explain with words or numbers and present it in a way that allows data to be translated into information and knowledge accurately (we will discuss some methods to do this later). 

We, as humans, do data analytics all the time. Whenever we compare prices of different kinds of spaghetti sauce at the store or when we shop at multiple stores to find the best value for our money, we are doing data analytics. 

Typically, when people think of data analytics or statistics, they think of measures like mean (or average), median, and mode. While these statistics can be useful, sometimes we need to find patterns, trends, and relationships between variables. For this, we use data visualizations. There are numerous types of graphs and charts that can be used to show how variables impact each other and we will dive into this in a future article.

We can also train computers to recognize these patterns for us. When we train a computer to process data without any explicit instructions like this, it is called machine learning. These machine learning models make predictions about new data by learning from existing data to make an educated guess. So, for example, I could train a machine learning model with this data on survivors of the titanic: 

Passenger IDSexAgeSurvived
1Male220
2Female381
3Female261
4Female351
5Male350
6Male0
7Male540
8Male20
9Female271
10Female141

If I created a model from this data set, it most likely would come to the conclusion that males did not survive and females did, so if I introduced a new row of data like this: 

Passenger IDSexAgeSurvived
11Male27?

The machine learning model would most likely tell me that this person did not survive the titanic based on their gender. This is a very small data set, so it’s not ideal for training machine learning models, but on a larger scale, this can be extremely helpful.

There are many tools available for transforming, visualizing, cleaning, and classifying data. Some of these include R and R Studio, Matlab, and Python. Throughout this series, I will be using R to walk through some examples as it is one of the more common tools used amongst data scientists. 

We will also, cover the basics of data and data types, data visualizations, cleaning data, clustering and classifying data, and data regression.

Part 2: What is Data will be available next Thursday, February 13. In the meantime, follow our linkedin page for more great content!

Resources:

https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/#2fe71b3860ba

Chaim Zins. (2007). Conceptual approaches for defining data, information, and knowledge. https://doi.org/10.1002/asi.20508

https://www.kaggle.com/c/titanic/data