A brief introduction to DataScience with explaining of the concepts, algorithms, machine learning, supervised and unsupervised learning, clustering, statistics, data preprocessing, real-world applications etc.
It's part of a Data Science Corner Campaign where I will be discussing the fundamentals of DataScience, AIML, Statistics etc.
5. Challenges deep-dive
Why the Hype Around
Data Science?
● The demand for data scientists will soar by 28% by 2023
● Data scientist roles have grown over 650% since 2012, but
currently, 35,000 people in the US have data science skills,
while hundreds of companies are hiring for those roles.
● Software engineering is a common starting point for
professionals who are in the top five fasting growing jobs today.
● Data Science gives you career flexibility
8. Challenges deep-dive
What is Machine
Learning ?
Machine learning teaches computers to do what comes naturally to
humans and animals: learn from experience. Machine learning
algorithms use computational methods to “learn” information directly
from data without relying on a predetermined equation as a model.
The algorithms adaptively improve their performance as the number
of samples available for learning increases.
9. Challenges deep-dive
A Definition
A computer program is said to learn from experience E with
respect to some task T and some performance measure P if its
performance on T, as measured by P, improves with experience E.
-Tom Mitchell
10. Challenges deep-dive
A Small Question
Suppose we feed a learning algorithm a lot of historical weather
data, and have it learn to predict weather. In this setting, what is
T,P,E?
13. Challenges deep-dive
Machine learning teaches computers to do what comes naturally to
humans and animals: learn from experience. Machine learning
algorithms use computational methods to “learn” information directly
from data without relying on a predetermined equation as a model.
The algorithms adaptively improve their performance as the number
of samples available for learning increases.
Real World
Applications
With the rise in big data, machine learning has become particularly
important for solving problems in areas like these:
● Image processing and computer vision,for face recognition,
motion detection, and object detection
● Computational biology, for tumor detection, drug discovery, and
DNA sequencing
● Energy production, for price and load forecasting
● Automotive, aerospace, and manufacturing, for predictive
maintenance
● Natural language processing
14. Challenges deep-dive
How Machine
Learning Works
Machine learning uses two types of techniques:
● Supervised learning, which trains a model on known input and
output data so that it can predict future outputs
● Unsupervised learning, which finds hidden patterns or intrinsic
structures in input data.
16. Challenges deep-dive
Supervised
Learning
The aim of supervised machine learning is to build a model that
makes predictions based on evidence in the presence of
uncertainty. A supervised learning algorithm takes a known set of
input data and known responses to the data (output) and trains a
model to generate reasonable predictions for the response to new
data
17. Classification - predict discrete responses
Classification models classify input data into categories.for
example, whether an email is genuine or spam, or whether a tumor
is cancerous or benign.
Regression - predict continuous responses
for example, changes in temperature or fluctuations in power
demand. Typical applications include electricity load forecasting and
algorithmic trading.
19. Clustering is the most common unsupervised learning technique. It
is used for exploratory data analysis to find hidden patterns or
groupings in data.Applications for clustering include gene sequence
analysis,market research, and object recognition
20. Knowledge Test
Which of the following would you apply supervised learning to?
1. Given genetic (DNA) data from a person, predict the odds of him/her developing
diabetes over the next 10 years.
2. Given a large dataset of medical records from patients suffering from heart
disease, try to learn whether there might be different clusters of such patients for
which we might tailor separate treatments.
3. Given data on how 1000 medical patients respond to an experimental drug (such
as effectiveness of the treatment, side effects, etc.), discover whether there are
different categories or "types" of patients in terms of how they respond to the
drug, and if so what these categories are.
4. Have a computer examine an audio clip of a piece of music, and classify whether
or not there are vocals (i.e., a human voice singing) in that audio clip, or if it is a
clip of only musical instruments (and no vocals).
21. Knowledge Test
Which of the following questions can be answered using a
classification algorithm?
1. How does the exchange rate depend on the GDP?
2. Does a document contain the handwritten letter S?
3. How can I group supermarket products using purchase
frequency?
22. Knowledge Test
1. Suppose you are working on weather prediction, and you
would like to predict whether or not it will be raining at 5pm
tomorrow. You want to use a learning algorithm for this.Would
you treat this as a classification or a regression problem?
2. Suppose you are working on stock market prediction. You
would like to predict whether or not a certain company will
declare bankruptcy within the next 7 days (by training on data
of similar companies that had previously been at risk of
bankruptcy). Would you treat this as a classification or a
regression problem?
24. Choosing the right algorithm can seem overwhelming
There are dozens of supervised and unsupervised machine
learning algorithms, and each takes a different approach to
learning.
25. There is no best method or one size fits all. Finding the right
algorithm is partly just trial and error
But algorithm selection also depends on the size and type of data
you’re working with, the insights you want to get from the data, and
how those insights will be used.
31. Challenges deep-dive
When should we use
Machine Learning
Consider using machine learning when you have a complex task or
problem involving a large amount of data and lots of variables, but
no existing formula or equation.
33. Knowledge Test
Have a look at the statements below and identify the one which
is not a machine learning problem
1. Given a viewer's shopping habits, recommend a product to
purchase the next time she visits your website.
2. Given the symptoms of a patient, identify her illness.
3. Predict the USD/EUR exchange rate for February 2023.
4. Compute the mean wage of 10 employees for your company.
34. Knowledge Test
Which of the following statements uses a machine learning
model?
1. Determine whether an incoming email is spam or not
2. Obtain the name of last year's FIFIA Ballon d’Or champion
3. Automatically tagging your new Facebook photos
4. Select the student with the highest grade on a statistics course
36. Challenges deep-dive
There is NO
Straight Line
With machine learning there’s rarely a straight line from start to
finish. You’ll find yourself constantly iterating and trying different
ideas and approaches
37. Challenges deep-dive
Machine learning teaches computers to do what comes naturally to
humans and animals: learn from experience. Machine learning
algorithms use computational methods to “learn” information directly
from data without relying on a predetermined equation as a model.
The algorithms adaptively improve their performance as the number
of samples available for learning increases.
Machine Learning
Challenges
● Data comes in all shapes and sizes
● Preprocessing your data might require specialized knowledge
and tools
● It takes time to find the best model to fit the data.
38. Challenges deep-dive
Machine learning teaches computers to do what comes naturally to
humans and animals: learn from experience. Machine learning
algorithms use computational methods to “learn” information directly
from data without relying on a predetermined equation as a model.
The algorithms adaptively improve their performance as the number
of samples available for learning increases.
Questions to Ask
Before Starting
Every machine learning workflow begins with three questions:
● What kind of data are you working with?
● What insights do you want to get from it?
● How and where will those insights be applied?
Your answers to these questions help you decide whether to use
supervised or unsupervised learning.
39. Challenges deep-dive
Machine learning teaches computers to do what comes naturally to
humans and animals: learn from experience. Machine learning
algorithms use computational methods to “learn” information directly
from data without relying on a predetermined equation as a model.
The algorithms adaptively improve their performance as the number
of samples available for learning increases.
Data Science -
Five Questions
There are only five questions that data science answers:
● Is this A or B?
● Is this weird?
● How much – or – How many?
● How is this organized?
● What should I do next?
40. Knowledge Test
Which of the following questions can be answered using a
classification algorithm?
1. How does the exchange rate depend on the GDP?
2. Does a document contain the handwritten letter S?
3. How can I group supermarket products using purchase
frequency?
43. Challenges deep-dive
Machine learning teaches computers to do what comes naturally to
humans and animals: learn from experience. Machine learning
algorithms use computational methods to “learn” information directly
from data without relying on a predetermined equation as a model.
The algorithms adaptively improve their performance as the number
of samples available for learning increases.
Step 1 -
Load the Data
We store the labeled data sets in a text file. A flat file format such as
text or CSV is easy to work with and makes it straightforward to
import data.
Machine learning algorithms aren’t smart enough to tell the
difference between noise and valuable information. Before using the
data for training, we need to make sure it’s clean and complete
44. Challenges deep-dive
Machine learning teaches computers to do what comes naturally to
humans and animals: learn from experience. Machine learning
algorithms use computational methods to “learn” information directly
from data without relying on a predetermined equation as a model.
The algorithms adaptively improve their performance as the number
of samples available for learning increases.
Step 2 -
Preprocess the Data
To preprocess the data we do the following:
● Look for outliers–data points that lie outside the rest of the data
● Check for missing values
● Divide the data into two sets
○ We save part of the data for testing (the test set) and use
the rest (the training set) to build models. This is referred
to as holdout, and is a useful cross-validation technique
45. Challenges deep-dive
Machine learning teaches computers to do what comes naturally to
humans and animals: learn from experience. Machine learning
algorithms use computational methods to “learn” information directly
from data without relying on a predetermined equation as a model.
The algorithms adaptively improve their performance as the number
of samples available for learning increases.
Step 3 -
Derive Features
Deriving features (also known as feature engineering or feature
extraction) turns raw data into information that a machine learning
algorithm can use.
Use feature selection to:
• Improve the accuracy of a machine learning algorithm
• Boost model performance for high-dimensional data sets
• Improve model interpretability
• Prevent overfitting
46. Challenges deep-dive
Machine learning teaches computers to do what comes naturally to
humans and animals: learn from experience. Machine learning
algorithms use computational methods to “learn” information directly
from data without relying on a predetermined equation as a model.
The algorithms adaptively improve their performance as the number
of samples available for learning increases.
Step 4 -
Build and Train Model
● The predefined algorithms and the test data are used for
building the model.
● The training data is used to train and evaluate the model
47. Challenges deep-dive
Machine learning teaches computers to do what comes naturally to
humans and animals: learn from experience. Machine learning
algorithms use computational methods to “learn” information directly
from data without relying on a predetermined equation as a model.
The algorithms adaptively improve their performance as the number
of samples available for learning increases.
Step 5 -
Improve the Model
Improving a model can take two different directions: make the
model simpler or add complexity.
Simplify - reduce the number of features
Add Complexity - make it more fine-tuned
48. Simplify
Popular feature reduction techniques include:
● Correlation matrix – shows the relationship between
variables, so that variables (or features) that are not highly
correlated can be removed.
● Principal component analysis (PCA) - eliminates redundancy
by finding a combination of features that captures key
distinctions between the original features and brings out strong
patterns in the dataset.
● Sequential feature reduction – reduces features iteratively on
the model until there is no improvement in performance
49. Add Complexity
● Use model combination – merge multiple simpler models into
a larger model that is better able to represent the trends in the
data than any of the simpler models could on their own.
● Add more data sources
50. TO DO
● Getting Started
● Familiarize with Maths and
Algorithms
● Select the Infrastructure or
Tool
● Create your profile and
participate in competition