Introduction to Data Science

Challenges deep-dive
Why the Hype Around
Data Science?
● The demand for data scientists will soar by 28% by 2023
● Data scientist roles have grown over 650% since 2012, but
currently, 35,000 people in the US have data science skills,
while hundreds of companies are hiring for those roles.
● Software engineering is a common starting point for
professionals who are in the top five fasting growing jobs today.
● Data Science gives you career flexibility

What is Machine
Learning ?
Machine learning teaches computers to do what comes naturally to
humans and animals: learn from experience. Machine learning
algorithms use computational methods to “learn” information directly
from data without relying on a predetermined equation as a model.
The algorithms adaptively improve their performance as the number
of samples available for learning increases.

A Deﬁnition
A computer program is said to learn from experience E with
respect to some task T and some performance measure P if its
performance on T, as measured by P, improves with experience E.
-Tom Mitchell

A Small Question
Suppose we feed a learning algorithm a lot of historical weather
data, and have it learn to predict weather. In this setting, what is
T,P,E?

More Data,
More Questions,
Better Answers

Real World
Applications
With the rise in big data, machine learning has become particularly
important for solving problems in areas like these:
● Image processing and computer vision,for face recognition,
motion detection, and object detection
● Computational biology, for tumor detection, drug discovery, and
DNA sequencing
● Energy production, for price and load forecasting
● Automotive, aerospace, and manufacturing, for predictive
maintenance
● Natural language processing

How Machine
Learning Works
Machine learning uses two types of techniques:
● Supervised learning, which trains a model on known input and
output data so that it can predict future outputs
● Unsupervised learning, which finds hidden patterns or intrinsic
structures in input data.

Supervised
Learning
The aim of supervised machine learning is to build a model that
makes predictions based on evidence in the presence of
uncertainty. A supervised learning algorithm takes a known set of
input data and known responses to the data (output) and trains a
model to generate reasonable predictions for the response to new
data

Classification - predict discrete responses
Classification models classify input data into categories.for
example, whether an email is genuine or spam, or whether a tumor
is cancerous or benign.
Regression - predict continuous responses
for example, changes in temperature or fluctuations in power
demand. Typical applications include electricity load forecasting and
algorithmic trading.

Unsupervised
Learning
Unsupervised learning finds hidden patterns or intrinsic structures in
data. It is used to draw inferences from dataset consisting of input
data without labeled responses.

Clustering is the most common unsupervised learning technique. It
is used for exploratory data analysis to find hidden patterns or
groupings in data.Applications for clustering include gene sequence
analysis,market research, and object recognition

Knowledge Test
Which of the following would you apply supervised learning to?
1. Given genetic (DNA) data from a person, predict the odds of him/her developing
diabetes over the next 10 years.
2. Given a large dataset of medical records from patients suffering from heart
disease, try to learn whether there might be different clusters of such patients for
which we might tailor separate treatments.
3. Given data on how 1000 medical patients respond to an experimental drug (such
as effectiveness of the treatment, side effects, etc.), discover whether there are
different categories or "types" of patients in terms of how they respond to the
drug, and if so what these categories are.
4. Have a computer examine an audio clip of a piece of music, and classify whether
or not there are vocals (i.e., a human voice singing) in that audio clip, or if it is a
clip of only musical instruments (and no vocals).

Knowledge Test
Which of the following questions can be answered using a
classification algorithm?
1. How does the exchange rate depend on the GDP?
2. Does a document contain the handwritten letter S?
3. How can I group supermarket products using purchase
frequency?

Knowledge Test
1. Suppose you are working on weather prediction, and you
would like to predict whether or not it will be raining at 5pm
tomorrow. You want to use a learning algorithm for this.Would
you treat this as a classification or a regression problem?
2. Suppose you are working on stock market prediction. You
would like to predict whether or not a certain company will
declare bankruptcy within the next 7 days (by training on data
of similar companies that had previously been at risk of
bankruptcy). Would you treat this as a classification or a
regression problem?

How Do You
Decide Which
Algorithm
to Use?

Choosing the right algorithm can seem overwhelming
There are dozens of supervised and unsupervised machine
learning algorithms, and each takes a different approach to
learning.

There is no best method or one size fits all. Finding the right
algorithm is partly just trial and error
But algorithm selection also depends on the size and type of data
you’re working with, the insights you want to get from the data, and
how those insights will be used.

When should we use
Machine Learning
Consider using machine learning when you have a complex task or
problem involving a large amount of data and lots of variables, but
no existing formula or equation.

Knowledge Test
Have a look at the statements below and identify the one which
is not a machine learning problem
1. Given a viewer's shopping habits, recommend a product to
purchase the next time she visits your website.
2. Given the symptoms of a patient, identify her illness.
3. Predict the USD/EUR exchange rate for February 2023.
4. Compute the mean wage of 10 employees for your company.

Knowledge Test
Which of the following statements uses a machine learning
model?
1. Determine whether an incoming email is spam or not
2. Obtain the name of last year's FIFIA Ballon d’Or champion
3. Automatically tagging your new Facebook photos
4. Select the student with the highest grade on a statistics course

There is NO
Straight Line
With machine learning there’s rarely a straight line from start to
finish. You’ll find yourself constantly iterating and trying different
ideas and approaches

Machine Learning
Challenges
● Data comes in all shapes and sizes
● Preprocessing your data might require specialized knowledge
and tools
● It takes time to find the best model to fit the data.

Questions to Ask
Before Starting
Every machine learning workflow begins with three questions:
● What kind of data are you working with?
● What insights do you want to get from it?
● How and where will those insights be applied?
Your answers to these questions help you decide whether to use
supervised or unsupervised learning.

Data Science -
Five Questions
There are only five questions that data science answers:
● Is this A or B?
● Is this weird?
● How much – or – How many?
● How is this organized?
● What should I do next?

Step 1 -
Load the Data
We store the labeled data sets in a text file. A flat file format such as
text or CSV is easy to work with and makes it straightforward to
import data.
Machine learning algorithms aren’t smart enough to tell the
difference between noise and valuable information. Before using the
data for training, we need to make sure it’s clean and complete

Step 2 -
Preprocess the Data
To preprocess the data we do the following:
● Look for outliers–data points that lie outside the rest of the data
● Check for missing values
● Divide the data into two sets
○ We save part of the data for testing (the test set) and use
the rest (the training set) to build models. This is referred
to as holdout, and is a useful cross-validation technique

Step 3 -
Derive Features
Deriving features (also known as feature engineering or feature
extraction) turns raw data into information that a machine learning
algorithm can use.
Use feature selection to:
• Improve the accuracy of a machine learning algorithm
• Boost model performance for high-dimensional data sets
• Improve model interpretability
• Prevent overfitting

Step 4 -
Build and Train Model
● The predefined algorithms and the test data are used for
building the model.
● The training data is used to train and evaluate the model

Step 5 -
Improve the Model
Improving a model can take two different directions: make the
model simpler or add complexity.
Simplify - reduce the number of features
Add Complexity - make it more fine-tuned

Simplify
Popular feature reduction techniques include:
● Correlation matrix – shows the relationship between
variables, so that variables (or features) that are not highly
correlated can be removed.
● Principal component analysis (PCA) - eliminates redundancy
by finding a combination of features that captures key
distinctions between the original features and brings out strong
patterns in the dataset.
● Sequential feature reduction – reduces features iteratively on
the model until there is no improvement in performance

Add Complexity
● Use model combination – merge multiple simpler models into
a larger model that is better able to represent the trends in the
data than any of the simpler models could on their own.
● Add more data sources

TO DO
● Getting Started
● Familiarize with Maths and
Algorithms
● Select the Infrastructure or
Tool
● Create your profile and
participate in competition

Christy Abraham Joy
Email - christyabrahamjoy@gmail.com
Mob - +91 94000 95273
Feel Free to Contact!

Introduction to Data Science

More Related Content

Introduction to Data Science