What is Data Science? A mystery or not?
What is Data Science? A Mystery or Not?
Listen to “What is Data Science” on Audio.
What exactly is data science? Is it a buzzword? A fancy way of saying advanced statistics and analytics? Or is it a mystery? Well, let’s find out.
This article will cover:
- What data science is,
- What the data science life cycle is,
- And what tools data scientists use to do their work.
Data science is a combination of various fields of study in mathematics, statistics, computer science, and information science. In data science, we use scientific methods and processes, algorithms and computer systems to extract information from
structured and unstructured data. We then use this knowledge gained from this data to drive business decisions, solve complex problems in business and in life, and build machines that mimic human intelligence.
These machines we build that mimic human intelligence can see and recognize the things human beings see and recognize with our eyes, these machines can learn new skills just like a human being, do the jobs humans used to do, and in a lot of cases, these machines can outperform human beings in skills and knowledge.
For data scientists to be really effective, they need to have a good background in statistics, discrete mathematics, computer programming, algorithms, have good communication skills, and it is also helpful to have domain knowledge for the
industry you are working in.
According to an article by UC Berkeley, there are 5 stages to the data science process.
Stage 1 is Capturing the Data.
In this stage, you acquire data, receive data entry, receive signal reception, and extract data.
Stage 2 is Maintaining the Data.
Under data maintenance, you have data warehousing, data cleansing, data staging, data processing, and data architecture.
Stage 3 is Processing the Data
Processing the data involves data mining, clustering/classification, data modeling, and data summarization.
Stage 4 is Analyzing the Data
Analyzing the data mostly consists of exploring and confirming the data, predictive analysis, regression, text mining, qualitative analysis, and anything else you feel is necessary to analyze your data.
Stage 5 is Communication.
The last stage of the data science lifecycle is communicating the knowledge gained from your data science process. Communicating your data often means data reporting, data visualization, business intelligence, solving business problems and most importantly making decisions based on the data you are working with.
Now that we have discussed what data science is and the process of doing data science work. Lets explore some of the tools data scientists use to do their work.
R is a computer programming language that is used for statistical and data analysis. It is mainly used in academia and it is also used in industry. Its only purpose is data analysis and visualization.
Python – While R is used mostly for statistics and data analysis, Python is like Jack of all trade programming language. Think
of R as the specialist specializing in data analysis while python is like the generalist that can do it all from building websites to data science to building robots.
Apache Hadoop is collection of open source software utilities that uses a network of many computers to solve data and computation problems using the MapReduce programming model.
MapReduce is a programming model that is used in processing and generating massive data sets in parallel by first filtering and sorting the data and then using the reduce method which performs a summary operation.
Apache spark is an open source computing framework used for programming clusters of data.
Structured Query Language (SQL) is a programming language used to store and manipulate data stored in a relational database.
Not Only SQL (NoSQl) is similar to SQL but is used to manipulate and store data in a non-relational database.
Cloud computing is massive data centers built by giant technology companies. These technology companies then rents out their data centers and large computing powers to other companies and individuals over the internet. This way, average people like you and I alongside companies like
Netflix can compute and run large data sets without having to spend the money, time, and resources to build our own data centers. Using cloud computing is much more time effective and much cheaper than building traditional data centers which are really really expensive to build.
D3 is a JavaScript library used for data visualization and it uses HTML and CSS.
Tableau is an interactive data visualization software.
Jupyter Notebooks and Google Colabs both do the same thing. They are a computational environments where you can write your code, put comments on your code, execute your code, and just but regular text and math formulas as part of your code notebook.
Think of if as a regular notebook for writing whatever you want to
write, but you can also write and execute code on it. There are many other computational notebooks out there like atom. But personally, I like Google colab more than all other notebooks that I have tried.
Github/git is a software that developers use to keep track of the changes in their code. You can use it to create different versions of your code. Multiple people working in a group can use it to keep track of each others code as well as pull and edit somebody else’s code. It is essential in every software development cycle.
So far, we have discussed data science is, the lifecycle of a data science project, what tools data scientists use. Now what is more important than figuring what a data scientist is, is finding out IF there is a demand for people with data science
skills.
And the answer to that question is YES, YES, and YES. There is a market demand for people with data science skills.
Lets take a quick look at the numbers based on data provided by Forbes Magazine and Glassdoor.
According to Forbes Magazine, By 2020, there will be a 28% increase in demand for data scientists.
According to Glassdoor, there are 6,510 current job openings for data scientists and the median average salary is $108,000.
Data science has also been named the #1 best job in America every year since 2015 and we are in 2019. So for the last four years, data science has ranked #1 in America
Now that you know there is a demand for data science jobs, the question still remains what type of careers are available with a data science training?
In your own opinion, what is data science? Let me know what you think in the comment section below.
One Comment