What is Data Science?

  • Data Science is an interdisciplinary field that uses scientific methods, algorithms, and systems to extract knowledge and insights from structured and unstructured data.
  • It combines statistics, programming, domain knowledge, and machine learning to solve real-world problems.

The Data Science Workflow

Problem Definition
     ↓
Data Collection
     ↓
Data Cleaning & Wrangling
     ↓
Exploratory Data Analysis (EDA)
     ↓
Feature Engineering
     ↓
Model Building & Training
     ↓
Model Evaluation
     ↓
Deployment & Monitoring

History

  • How: Emerged from statistics and computer science in the early 2000s. The term “Data Scientist” was popularized by DJ Patil and Jeff Hammerbacher around 2008.
  • Who: Key figures include John Tukey (EDA), Leo Breiman (Random Forests), Geoffrey Hinton (Deep Learning), Yann LeCun, Yoshua Bengio.
  • Why: Explosion of digital data (Big Data) required new tools and methods beyond traditional statistics to extract value at scale.
  • Timeline:
    • 1960s — Statistical computing begins
    • 1977 — John Tukey coins “Exploratory Data Analysis”
    • 1990s — Data mining and knowledge discovery emerge
    • 2001 — William Cleveland proposes “Data Science” as a discipline
    • 2008 — “Data Scientist” role defined at Facebook/LinkedIn
    • 2012 — Harvard Business Review calls it “The Sexiest Job of the 21st Century”
    • 2015+ — Deep learning revolution; TensorFlow, PyTorch released
    • 2020+ — AutoML, MLOps, LLMs become mainstream

Introduction

Advantages

  • Turns raw data into actionable insights and business value
  • Powers AI/ML applications across every industry
  • High demand and well-compensated career path
  • Open-source ecosystem (Python, R) lowers barrier to entry
  • Applicable to virtually every domain: healthcare, finance, gaming, science

Disadvantages

  • Requires strong math background (linear algebra, calculus, statistics)
  • Data quality issues consume most of the work (80% cleaning, 20% modeling)
  • Models can be black boxes — hard to interpret and explain
  • Privacy and ethical concerns with personal data
  • Computationally expensive for large-scale training
Data Analyst     → Describes what happened (reports, dashboards)
Data Scientist   → Predicts what will happen (models, ML)
Data Engineer    → Builds pipelines to move/store data
ML Engineer      → Deploys and scales ML models in production
AI Researcher    → Advances the theory and algorithms

Mathematics & Statistics Foundations

Linear Algebra

  • Core to understanding ML algorithms and neural networks.
import numpy as np
 
# Vectors
v = np.array([1, 2, 3])
 
# Matrix
A = np.array([[1, 2], [3, 4]])
 
# Dot product
print(np.dot(v, v))        # 14
 
# Matrix multiplication
B = np.array([[5, 6], [7, 8]])
print(A @ B)               # [[19 22], [43 50]]
 
# Transpose
print(A.T)                 # [[1 3], [2 4]]
 
# Inverse
print(np.linalg.inv(A))
 
# Eigenvalues & Eigenvectors (used in PCA)
vals, vecs = np.linalg.eig(A)
  • Key concepts: vectors, matrices, dot product, matrix multiplication, transpose, inverse, eigenvalues, eigenvectors, SVD

Statistics Essentials

import numpy as np
from scipy import stats
 
data = [2, 4, 4, 4, 5, 5, 7, 9]
 
# Central tendency
print(np.mean(data))       # 5.0
print(np.median(data))     # 4.5
print(stats.mode(data))    # ModeResult(mode=4, count=3)
 
# Spread
print(np.var(data))        # variance
print(np.std(data))        # standard deviation: 1.87
 
# Percentiles
print(np.percentile(data, 25))   # Q1
print(np.percentile(data, 75))   # Q3
 
# Correlation
x = [1, 2, 3, 4, 5]
y = [2, 4, 5, 4, 5]
print(np.corrcoef(x, y)[0, 1])   # Pearson r
  • Key concepts: mean, median, mode, variance, std dev, distributions, hypothesis testing, p-value, confidence intervals, correlation

Probability

from scipy.stats import norm, binom, poisson
 
# Normal distribution
mu, sigma = 0, 1
print(norm.pdf(0, mu, sigma))     # PDF at x=0: 0.3989
print(norm.cdf(1.96, mu, sigma))  # CDF: ~0.975 (95% CI)
 
# Binomial distribution
# P(X=3) for n=10 trials, p=0.5
print(binom.pmf(3, 10, 0.5))      # 0.117
 
# Poisson distribution
print(poisson.pmf(2, mu=3))        # P(X=2) with lambda=3
  • Key concepts: probability rules, Bayes’ theorem, distributions (normal, binomial, Poisson), expected value, law of large numbers

Python for Data Science

  • See Python for full language reference. This section covers the data science-specific stack.

NumPy — Numerical Computing

import numpy as np
 
# Create arrays
a = np.array([1, 2, 3, 4, 5])
zeros = np.zeros((3, 4))          # 3x4 matrix of zeros
ones = np.ones((2, 3))
rng = np.arange(0, 10, 2)         # [0 2 4 6 8]
linspace = np.linspace(0, 1, 5)   # [0. 0.25 0.5 0.75 1.]
rand = np.random.rand(3, 3)       # random 3x3 matrix
 
# Array operations (vectorized — no loops needed)
b = np.array([10, 20, 30, 40, 50])
print(a + b)       # [11 22 33 44 55]
print(a * 2)       # [2 4 6 8 10]
print(a ** 2)      # [1 4 9 16 25]
print(np.sqrt(a))  # [1. 1.41 1.73 2. 2.23]
 
# Indexing & slicing
m = np.array([[1,2,3],[4,5,6],[7,8,9]])
print(m[1, 2])     # 6
print(m[:, 1])     # [2 5 8] — column 1
print(m[0, :])     # [1 2 3] — row 0
 
# Boolean indexing
print(a[a > 3])    # [4 5]
 
# Reshape
print(a.reshape(5, 1))
print(m.flatten())  # [1 2 3 4 5 6 7 8 9]
 
# Aggregations
print(m.sum())         # 45
print(m.sum(axis=0))   # column sums: [12 15 18]
print(m.sum(axis=1))   # row sums: [6 15 24]
print(m.mean())        # 5.0
print(m.max())         # 9
print(m.argmax())      # 8 (flat index)

Pandas — Data Manipulation

import pandas as pd
 
# Create DataFrame
df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "Diana"],
    "age": [25, 30, 35, 28],
    "score": [88.5, 92.0, 78.3, 95.1],
    "city": ["NYC", "LA", "NYC", "Chicago"]
})
 
# Basic inspection
print(df.head())          # first 5 rows
print(df.tail(2))         # last 2 rows
print(df.shape)           # (4, 4)
print(df.dtypes)          # column data types
print(df.describe())      # statistical summary
print(df.info())          # memory + null info
 
# Selecting
print(df["name"])                    # Series
print(df[["name", "score"]])         # DataFrame
print(df.loc[0])                     # row by label
print(df.iloc[1:3])                  # rows by position
print(df.loc[df["age"] > 28])        # filter rows
 
# Adding / modifying columns
df["grade"] = df["score"].apply(lambda x: "A" if x >= 90 else "B")
df["age_group"] = pd.cut(df["age"], bins=[0,25,35,100], labels=["young","mid","senior"])
 
# Aggregation
print(df.groupby("city")["score"].mean())
print(df["score"].mean())
print(df["city"].value_counts())
 
# Sorting
df.sort_values("score", ascending=False, inplace=True)
 
# Handling missing values
df2 = df.copy()
df2.loc[0, "score"] = None
print(df2.isnull().sum())            # count nulls per column
df2["score"].fillna(df2["score"].mean(), inplace=True)  # fill with mean
df2.dropna(inplace=True)             # drop rows with any null
 
# Merging / joining
df_a = pd.DataFrame({"id": [1,2,3], "val": ["a","b","c"]})
df_b = pd.DataFrame({"id": [2,3,4], "info": ["x","y","z"]})
merged = pd.merge(df_a, df_b, on="id", how="inner")
 
# Read/write files
df.to_csv("output.csv", index=False)
df_loaded = pd.read_csv("output.csv")
df.to_json("output.json")
df_excel = pd.read_excel("data.xlsx")

Data Visualization

Matplotlib — Core Plotting

import matplotlib.pyplot as plt
import numpy as np
 
x = np.linspace(0, 10, 100)
 
# Line plot
plt.figure(figsize=(8, 4))
plt.plot(x, np.sin(x), label="sin(x)", color="blue")
plt.plot(x, np.cos(x), label="cos(x)", color="red", linestyle="--")
plt.title("Trig Functions")
plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
 
# Subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].hist(np.random.randn(1000), bins=30, color="steelblue")
axes[0].set_title("Histogram")
axes[1].scatter(np.random.rand(50), np.random.rand(50), alpha=0.6)
axes[1].set_title("Scatter Plot")
plt.show()
 
# Bar chart
categories = ["A", "B", "C", "D"]
values = [23, 45, 12, 67]
plt.bar(categories, values, color="coral")
plt.show()

Seaborn — Statistical Visualization

import seaborn as sns
import pandas as pd
 
# Load built-in dataset
tips = sns.load_dataset("tips")
 
# Distribution plot
sns.histplot(tips["total_bill"], kde=True)
plt.show()
 
# Box plot
sns.boxplot(x="day", y="total_bill", data=tips)
plt.show()
 
# Scatter with regression line
sns.regplot(x="total_bill", y="tip", data=tips)
plt.show()
 
# Heatmap (correlation matrix)
corr = tips.select_dtypes(include="number").corr()
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f")
plt.show()
 
# Pair plot — all pairwise relationships
sns.pairplot(tips, hue="sex")
plt.show()

Plotly — Interactive Visualization

import plotly.express as px
 
df = px.data.gapminder()
 
# Interactive scatter
fig = px.scatter(df[df.year == 2007],
                 x="gdpPercap", y="lifeExp",
                 size="pop", color="continent",
                 hover_name="country",
                 log_x=True, title="GDP vs Life Expectancy")
fig.show()
 
# Animated chart over time
fig = px.scatter(df, x="gdpPercap", y="lifeExp",
                 animation_frame="year", size="pop",
                 color="continent", log_x=True)
fig.show()

Exploratory Data Analysis (EDA)

  • EDA is the process of analyzing datasets to summarize their main characteristics before modeling.

EDA Checklist

1. Load & inspect data (shape, dtypes, head)
2. Check for missing values
3. Check for duplicates
4. Understand distributions (histograms, boxplots)
5. Check for outliers (IQR method, z-score)
6. Analyze correlations (heatmap)
7. Explore categorical variables (value_counts)
8. Identify target variable distribution

Outlier Detection

import numpy as np
import pandas as pd
 
data = pd.Series([10, 12, 11, 13, 200, 9, 11, 12])
 
# IQR method
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = data[(data < lower) | (data > upper)]
print(outliers)   # 4    200
 
# Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(data))
outliers_z = data[z_scores > 3]

Feature Engineering

  • Feature engineering transforms raw data into features that better represent the underlying problem to the model.

Encoding Categorical Variables

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
 
df = pd.DataFrame({"color": ["red", "blue", "green", "red"]})
 
# Label Encoding (ordinal — use for tree models)
le = LabelEncoder()
df["color_label"] = le.fit_transform(df["color"])
# red=2, blue=0, green=1
 
# One-Hot Encoding (use for linear models / neural nets)
df_ohe = pd.get_dummies(df["color"], prefix="color")
print(df_ohe)
# color_blue  color_green  color_red
#          0            0          1
#          1            0          0

Scaling & Normalization

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import numpy as np
 
X = np.array([[1, 100], [2, 200], [3, 300], [4, 400]])
 
# StandardScaler — zero mean, unit variance (z-score)
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
 
# MinMaxScaler — scales to [0, 1]
mm = MinMaxScaler()
X_mm = mm.fit_transform(X)
 
# RobustScaler — uses median/IQR, robust to outliers
rb = RobustScaler()
X_rb = rb.fit_transform(X)
 
# IMPORTANT: fit on train, transform on test
# scaler.fit(X_train)
# X_train_scaled = scaler.transform(X_train)
# X_test_scaled = scaler.transform(X_test)  # NOT fit_transform!

Feature Selection

from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.ensemble import RandomForestClassifier
 
# Select top K features by statistical test
selector = SelectKBest(f_classif, k=5)
X_new = selector.fit_transform(X, y)
 
# Recursive Feature Elimination
rfe = RFE(RandomForestClassifier(), n_features_to_select=5)
rfe.fit(X, y)
print(rfe.support_)   # boolean mask of selected features
 
# Feature importance from tree models
rf = RandomForestClassifier()
rf.fit(X, y)
importances = pd.Series(rf.feature_importances_, index=feature_names)
importances.sort_values(ascending=False).plot(kind="bar")

Machine Learning

ML Types Overview

Supervised Learning    → labeled data → predict output
  ├── Classification   → discrete output (spam/not spam)
  └── Regression       → continuous output (house price)

Unsupervised Learning  → no labels → find structure
  ├── Clustering       → group similar data
  └── Dimensionality Reduction → compress features

Reinforcement Learning → agent learns via rewards/penalties

Train/Test Split & Cross-Validation

from sklearn.model_selection import train_test_split, cross_val_score, KFold
 
# Basic split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
 
# K-Fold Cross Validation (more reliable evaluation)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring="accuracy")
print(f"CV Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")

Regression Algorithms

from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
 
# Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
 
print(f"MSE:  {mean_squared_error(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
print(f"R²:   {r2_score(y_test, y_pred):.4f}")
 
# Ridge (L2 regularization — penalizes large weights)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)
 
# Lasso (L1 regularization — drives some weights to zero)
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
 
# ElasticNet (L1 + L2 combined)
en = ElasticNet(alpha=0.1, l1_ratio=0.5)
en.fit(X_train, y_train)

Classification Algorithms

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
 
# Logistic Regression
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)
 
# Evaluation
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
 
# Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
 
# Gradient Boosting (XGBoost-style)
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)
gb.fit(X_train, y_train)
 
# SVM
svm = SVC(kernel="rbf", C=1.0, gamma="scale")
svm.fit(X_train, y_train)
 
# KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

Clustering (Unsupervised)

from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score
 
# K-Means
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)
print(f"Silhouette Score: {silhouette_score(X, labels):.3f}")
print(f"Cluster Centers:\n{kmeans.cluster_centers_}")
 
# Elbow method — find optimal k
inertias = []
for k in range(1, 11):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X)
    inertias.append(km.inertia_)
# Plot inertias vs k and look for the "elbow"
 
# DBSCAN — density-based, finds arbitrary shapes, handles noise
db = DBSCAN(eps=0.5, min_samples=5)
labels_db = db.fit_predict(X)
# label -1 = noise points

Dimensionality Reduction

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
 
# PCA — linear dimensionality reduction
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
 
# Keep 95% of variance
pca_95 = PCA(n_components=0.95)
X_reduced = pca_95.fit_transform(X)
print(f"Components needed: {pca_95.n_components_}")
 
# t-SNE — non-linear, great for visualization (not for production)
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X)

Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
 
# Grid Search — exhaustive search
param_grid = {
    "n_estimators": [50, 100, 200],
    "max_depth": [None, 5, 10],
    "min_samples_split": [2, 5]
}
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring="accuracy", n_jobs=-1)
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}")
print(f"Best score:  {grid.best_score_:.4f}")
 
# Randomized Search — faster for large spaces
from scipy.stats import randint
param_dist = {"n_estimators": randint(50, 500), "max_depth": randint(3, 20)}
rand = RandomizedSearchCV(RandomForestClassifier(), param_dist, n_iter=20, cv=5)
rand.fit(X_train, y_train)

Model Evaluation & Metrics

Classification Metrics

Accuracy    = (TP + TN) / Total          — overall correctness
Precision   = TP / (TP + FP)             — of predicted positives, how many are correct
Recall      = TP / (TP + FN)             — of actual positives, how many did we catch
F1 Score    = 2 * (Precision * Recall) / (Precision + Recall)
ROC-AUC     — area under ROC curve (1.0 = perfect, 0.5 = random)

Use Precision when false positives are costly (spam filter)
Use Recall when false negatives are costly (cancer detection)
Use F1 when you need balance between both
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, roc_auc_score, roc_curve)
 
print(f"Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred):.4f}")
print(f"F1:        {f1_score(y_test, y_pred):.4f}")
 
# ROC-AUC (needs probability scores)
y_proba = model.predict_proba(X_test)[:, 1]
print(f"ROC-AUC:   {roc_auc_score(y_test, y_proba):.4f}")

Regression Metrics

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
 
print(f"MAE:  {mean_absolute_error(y_test, y_pred):.4f}")
print(f"MSE:  {mean_squared_error(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")
print(f"R²:   {r2_score(y_test, y_pred):.4f}")
# R² = 1.0 is perfect, 0.0 = predicts mean, negative = worse than mean

Bias-Variance Tradeoff

High Bias (Underfitting)   → model too simple, misses patterns
                             → train error high, test error high

High Variance (Overfitting) → model too complex, memorizes noise
                             → train error low, test error high

Goal: find the sweet spot (low bias + low variance)

Fix Underfitting:
  - More complex model
  - More features
  - Less regularization

Fix Overfitting:
  - More training data
  - Regularization (L1/L2, dropout)
  - Simpler model
  - Cross-validation
  - Early stopping

Deep Learning

Neural Network Basics

Input Layer → Hidden Layers → Output Layer

Each neuron: output = activation(weights · inputs + bias)

Common Activations:
  ReLU     → max(0, x)           — hidden layers (most common)
  Sigmoid  → 1/(1+e^-x)          — binary output [0,1]
  Softmax  → e^xi / Σe^xj        — multi-class output
  Tanh     → (e^x - e^-x)/(e^x + e^-x)  — [-1, 1]

Training:
  Forward pass → compute predictions
  Loss function → measure error
  Backpropagation → compute gradients
  Optimizer → update weights

TensorFlow / Keras

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
 
# Build a simple feedforward network
model = keras.Sequential([
    layers.Dense(128, activation="relu", input_shape=(X_train.shape[1],)),
    layers.Dropout(0.3),
    layers.Dense(64, activation="relu"),
    layers.Dropout(0.3),
    layers.Dense(1, activation="sigmoid")   # binary classification
])
 
model.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics=["accuracy"]
)
 
model.summary()
 
# Train
history = model.fit(
    X_train, y_train,
    epochs=50,
    batch_size=32,
    validation_split=0.2,
    callbacks=[keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True)]
)
 
# Evaluate
loss, acc = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {acc:.4f}")
 
# Save & load
model.save("model.keras")
loaded = keras.models.load_model("model.keras")

PyTorch

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
 
# Define model
class Net(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 1)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.3)
        self.sigmoid = nn.Sigmoid()
 
    def forward(self, x):
        x = self.dropout(self.relu(self.fc1(x)))
        x = self.dropout(self.relu(self.fc2(x)))
        return self.sigmoid(self.fc3(x))
 
# Setup
device = "cuda" if torch.cuda.is_available() else "cpu"
model = Net(X_train.shape[1]).to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.BCELoss()
 
# DataLoader
X_t = torch.FloatTensor(X_train).to(device)
y_t = torch.FloatTensor(y_train).unsqueeze(1).to(device)
loader = DataLoader(TensorDataset(X_t, y_t), batch_size=32, shuffle=True)
 
# Training loop
model.train()
for epoch in range(50):
    for X_batch, y_batch in loader:
        optimizer.zero_grad()
        preds = model(X_batch)
        loss = criterion(preds, y_batch)
        loss.backward()
        optimizer.step()
 
# Save
torch.save(model.state_dict(), "model.pth")

CNN — Convolutional Neural Networks (Image Data)

from tensorflow.keras import layers, models
 
# CNN for image classification (e.g., 32x32 RGB images, 10 classes)
cnn = models.Sequential([
    layers.Conv2D(32, (3,3), activation="relu", input_shape=(32,32,3)),
    layers.MaxPooling2D((2,2)),
    layers.Conv2D(64, (3,3), activation="relu"),
    layers.MaxPooling2D((2,2)),
    layers.Conv2D(64, (3,3), activation="relu"),
    layers.Flatten(),
    layers.Dense(64, activation="relu"),
    layers.Dense(10, activation="softmax")
])
 
cnn.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

Transfer Learning

from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras import layers, models
 
# Load pretrained base (ImageNet weights)
base = MobileNetV2(input_shape=(224,224,3), include_top=False, weights="imagenet")
base.trainable = False  # freeze base layers
 
# Add custom head
model = models.Sequential([
    base,
    layers.GlobalAveragePooling2D(),
    layers.Dense(128, activation="relu"),
    layers.Dense(num_classes, activation="softmax")
])
 
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
# Fine-tune: unfreeze top layers of base after initial training

Natural Language Processing (NLP)

Text Preprocessing

import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
 
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
 
text = "Data Science is Amazing! It's changing the world in 2024."
 
# Lowercase
text = text.lower()
 
# Remove punctuation & numbers
text = re.sub(r"[^a-z\s]", "", text)
 
# Tokenize
tokens = word_tokenize(text)
# ['data', 'science', 'is', 'amazing', 'its', 'changing', 'the', 'world', 'in']
 
# Remove stopwords
stop_words = set(stopwords.words("english"))
tokens = [t for t in tokens if t not in stop_words]
# ['data', 'science', 'amazing', 'changing', 'world']
 
# Stemming (crude — cuts suffix)
stemmer = PorterStemmer()
stemmed = [stemmer.stem(t) for t in tokens]
 
# Lemmatization (better — uses vocabulary)
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(t) for t in tokens]

TF-IDF Vectorization

from sklearn.feature_extraction.text import TfidfVectorizer
 
corpus = [
    "data science is great",
    "machine learning is part of data science",
    "deep learning is a subset of machine learning"
]
 
tfidf = TfidfVectorizer(max_features=20, stop_words="english")
X = tfidf.fit_transform(corpus)
 
print(tfidf.get_feature_names_out())
print(X.toarray())

Transformers & Hugging Face

from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
 
# Sentiment analysis (zero-shot, no training needed)
classifier = pipeline("sentiment-analysis")
result = classifier("Data Science is an amazing field!")
print(result)  # [{'label': 'POSITIVE', 'score': 0.9998}]
 
# Text generation
generator = pipeline("text-generation", model="gpt2")
output = generator("Data science helps us", max_length=50)
 
# Named Entity Recognition
ner = pipeline("ner", grouped_entities=True)
entities = ner("Guido van Rossum created Python at CWI in Amsterdam.")
 
# Fine-tune a BERT model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

Big Data Tools

Apache Spark (PySpark)

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, count
 
spark = SparkSession.builder.appName("DataScience").getOrCreate()
 
# Read data
df = spark.read.csv("data.csv", header=True, inferSchema=True)
 
# DataFrame operations (similar to pandas but distributed)
df.printSchema()
df.show(5)
df.describe().show()
 
# Filter & aggregate
result = (df.filter(col("age") > 25)
            .groupBy("city")
            .agg(avg("salary").alias("avg_salary"),
                 count("*").alias("count"))
            .orderBy("avg_salary", ascending=False))
result.show()
 
# ML with Spark MLlib
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import VectorAssembler
 
assembler = VectorAssembler(inputCols=["age", "salary"], outputCol="features")
df_ml = assembler.transform(df)
rf = RandomForestClassifier(featuresCol="features", labelCol="label")
model = rf.fit(df_ml)

MLOps — Production Machine Learning

  • MLOps applies DevOps principles to machine learning: automating, monitoring, and maintaining ML models in production.

ML Pipeline

Data Ingestion → Preprocessing → Training → Evaluation → Deployment → Monitoring
            ↑_______________________________________________|
                            (retraining loop)

Scikit-learn Pipelines

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
 
# Bundle preprocessing + model into one object
pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model", RandomForestClassifier(n_estimators=100))
])
 
# Fit and predict — preprocessing is applied automatically
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
 
# Cross-validate the whole pipeline (no data leakage)
scores = cross_val_score(pipe, X, y, cv=5)
print(f"CV: {scores.mean():.3f} ± {scores.std():.3f}")
 
# Save pipeline
import joblib
joblib.dump(pipe, "pipeline.pkl")
loaded_pipe = joblib.load("pipeline.pkl")

Model Tracking with MLflow

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
 
mlflow.set_experiment("my-experiment")
 
with mlflow.start_run():
    # Log parameters
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 5)
 
    # Train
    rf = RandomForestClassifier(n_estimators=100, max_depth=5)
    rf.fit(X_train, y_train)
 
    # Log metrics
    acc = accuracy_score(y_test, rf.predict(X_test))
    mlflow.log_metric("accuracy", acc)
 
    # Log model
    mlflow.sklearn.log_model(rf, "model")
 
# View UI: mlflow ui  (then open http://localhost:5000)

Model Serving with FastAPI

from fastapi import FastAPI
from pydantic import BaseModel
import joblib
import numpy as np
 
app = FastAPI()
model = joblib.load("pipeline.pkl")
 
class InputData(BaseModel):
    features: list[float]
 
@app.post("/predict")
def predict(data: InputData):
    X = np.array(data.features).reshape(1, -1)
    prediction = model.predict(X)[0]
    probability = model.predict_proba(X)[0].tolist()
    return {"prediction": int(prediction), "probability": probability}
 
# Run: uvicorn main:app --reload
# Test: POST http://localhost:8000/predict
#        {"features": [1.2, 3.4, 5.6, 7.8]}

Advanced Topics

Time Series Analysis

import pandas as pd
import numpy as np
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.seasonal import seasonal_decompose
 
# Create time series
dates = pd.date_range("2020-01-01", periods=100, freq="D")
ts = pd.Series(np.random.randn(100).cumsum(), index=dates)
 
# Decompose into trend + seasonal + residual
decomp = seasonal_decompose(ts, model="additive", period=7)
decomp.plot()
 
# ARIMA model
model = ARIMA(ts, order=(1, 1, 1))  # (p, d, q)
result = model.fit()
forecast = result.forecast(steps=10)
print(forecast)
 
# Rolling statistics
ts.rolling(window=7).mean().plot(label="7-day MA")
ts.rolling(window=30).std().plot(label="30-day Std")

Anomaly Detection

from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
 
# Isolation Forest — tree-based anomaly detection
iso = IsolationForest(contamination=0.05, random_state=42)
labels = iso.fit_predict(X)
# -1 = anomaly, 1 = normal
anomalies = X[labels == -1]
 
# Local Outlier Factor
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05)
labels_lof = lof.fit_predict(X)

Recommendation Systems

# Collaborative Filtering with Surprise library
from surprise import SVD, Dataset, Reader, accuracy
from surprise.model_selection import train_test_split
 
# Load data (userId, itemId, rating)
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df[["userId", "itemId", "rating"]], reader)
 
trainset, testset = train_test_split(data, test_size=0.2)
 
# SVD (Matrix Factorization)
svd = SVD(n_factors=50, n_epochs=20)
svd.fit(trainset)
 
predictions = svd.test(testset)
print(f"RMSE: {accuracy.rmse(predictions):.4f}")
 
# Predict for a specific user-item pair
pred = svd.predict(uid="user_1", iid="item_42")
print(f"Predicted rating: {pred.est:.2f}")

AutoML

# Auto-sklearn — automated ML pipeline selection
# pip install auto-sklearn
import autosklearn.classification
 
automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=120,   # 2 minutes
    per_run_time_limit=30
)
automl.fit(X_train, y_train)
print(automl.leaderboard())
 
# TPOT — genetic algorithm-based AutoML
# pip install tpot
from tpot import TPOTClassifier
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, y_train)
tpot.export("best_pipeline.py")

Tools & Libraries Reference

Core Stack

NumPy        — numerical arrays & math
Pandas       — data manipulation & analysis
Matplotlib   — static plotting
Seaborn      — statistical visualization
Plotly       — interactive visualization
Scikit-learn — classical ML algorithms

Deep Learning

TensorFlow / Keras  — Google's DL framework
PyTorch             — Meta's DL framework (research-friendly)
Hugging Face        — NLP transformers & pretrained models

Big Data & Engineering

Apache Spark (PySpark)  — distributed data processing
Hadoop                  — distributed storage (HDFS)
Kafka                   — real-time data streaming
Airflow                 — workflow orchestration
dbt                     — data transformation in SQL

MLOps

MLflow      — experiment tracking & model registry
DVC         — data version control
FastAPI     — model serving API
Docker      — containerize ML apps (see [[Docker]])
Kubernetes  — scale ML services (see [[Kubernetes]])

Notebooks & IDEs

Jupyter Notebook / JupyterLab  — interactive notebooks
Google Colab                   — free GPU notebooks
VS Code + Python extension     — full IDE experience
Kaggle Notebooks               — competition environment

Learning Roadmap

Beginner Path

1. Python basics          → [[Python]]
2. NumPy & Pandas         → arrays, DataFrames
3. Matplotlib & Seaborn   → basic visualization
4. Statistics basics      → mean, std, distributions
5. First ML model         → scikit-learn LinearRegression
6. Kaggle Titanic         → first competition

Intermediate Path

1. Full EDA workflow
2. Feature engineering
3. Classification & regression algorithms
4. Model evaluation & cross-validation
5. Hyperparameter tuning
6. SQL for data (see [[MySQL]] / [[PostgreSQL]])
7. Kaggle competitions (top 25%)

Advanced Path

1. Deep learning (TensorFlow / PyTorch)
2. NLP & Transformers (Hugging Face)
3. Computer Vision (CNNs, YOLO)
4. Time series & anomaly detection
5. Big Data (PySpark)
6. MLOps (MLflow, FastAPI, Docker)
7. Research papers & custom architectures

More Learn

Github & Webs

Master Playlists YouTube 📺 Free