Data Science is an interdisciplinary field that uses scientific methods, algorithms, and systems to extract knowledge and insights from structured and unstructured data.
It combines statistics, programming, domain knowledge, and machine learning to solve real-world problems.
The Data Science Workflow
Problem Definition
↓
Data Collection
↓
Data Cleaning & Wrangling
↓
Exploratory Data Analysis (EDA)
↓
Feature Engineering
↓
Model Building & Training
↓
Model Evaluation
↓
Deployment & Monitoring
How: Emerged from statistics and computer science in the early 2000s. The term “Data Scientist” was popularized by DJ Patil and Jeff Hammerbacher around 2008.
Who: Key figures include John Tukey (EDA), Leo Breiman (Random Forests), Geoffrey Hinton (Deep Learning), Yann LeCun, Yoshua Bengio.
Why: Explosion of digital data (Big Data) required new tools and methods beyond traditional statistics to extract value at scale.
Timeline:
1960s — Statistical computing begins
1977 — John Tukey coins “Exploratory Data Analysis”
1990s — Data mining and knowledge discovery emerge
2001 — William Cleveland proposes “Data Science” as a discipline
2008 — “Data Scientist” role defined at Facebook/LinkedIn
2012 — Harvard Business Review calls it “The Sexiest Job of the 21st Century”
2015+ — Deep learning revolution; TensorFlow, PyTorch released
2020+ — AutoML, MLOps, LLMs become mainstream
Introduction
Advantages
Turns raw data into actionable insights and business value
Powers AI/ML applications across every industry
High demand and well-compensated career path
Open-source ecosystem (Python, R) lowers barrier to entry
Applicable to virtually every domain: healthcare, finance, gaming, science
Disadvantages
Requires strong math background (linear algebra, calculus, statistics)
Data quality issues consume most of the work (80% cleaning, 20% modeling)
Models can be black boxes — hard to interpret and explain
Privacy and ethical concerns with personal data
Computationally expensive for large-scale training
Data Science vs Related Fields
Data Analyst → Describes what happened (reports, dashboards)
Data Scientist → Predicts what will happen (models, ML)
Data Engineer → Builds pipelines to move/store data
ML Engineer → Deploys and scales ML models in production
AI Researcher → Advances the theory and algorithms
Mathematics & Statistics Foundations
Linear Algebra
Core to understanding ML algorithms and neural networks.
Accuracy = (TP + TN) / Total — overall correctness
Precision = TP / (TP + FP) — of predicted positives, how many are correct
Recall = TP / (TP + FN) — of actual positives, how many did we catch
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)
ROC-AUC — area under ROC curve (1.0 = perfect, 0.5 = random)
Use Precision when false positives are costly (spam filter)
Use Recall when false negatives are costly (cancer detection)
Use F1 when you need balance between both
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_scoreimport numpy as npprint(f"MAE: {mean_absolute_error(y_test, y_pred):.4f}")print(f"MSE: {mean_squared_error(y_test, y_pred):.4f}")print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")print(f"R²: {r2_score(y_test, y_pred):.4f}")# R² = 1.0 is perfect, 0.0 = predicts mean, negative = worse than mean
Bias-Variance Tradeoff
High Bias (Underfitting) → model too simple, misses patterns
→ train error high, test error high
High Variance (Overfitting) → model too complex, memorizes noise
→ train error low, test error high
Goal: find the sweet spot (low bias + low variance)
Fix Underfitting:
- More complex model
- More features
- Less regularization
Fix Overfitting:
- More training data
- Regularization (L1/L2, dropout)
- Simpler model
- Cross-validation
- Early stopping
from tensorflow.keras.applications import MobileNetV2from tensorflow.keras import layers, models# Load pretrained base (ImageNet weights)base = MobileNetV2(input_shape=(224,224,3), include_top=False, weights="imagenet")base.trainable = False # freeze base layers# Add custom headmodel = models.Sequential([ base, layers.GlobalAveragePooling2D(), layers.Dense(128, activation="relu"), layers.Dense(num_classes, activation="softmax")])model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])# Fine-tune: unfreeze top layers of base after initial training
Natural Language Processing (NLP)
Text Preprocessing
import reimport nltkfrom nltk.tokenize import word_tokenizefrom nltk.corpus import stopwordsfrom nltk.stem import PorterStemmer, WordNetLemmatizernltk.download("punkt")nltk.download("stopwords")nltk.download("wordnet")text = "Data Science is Amazing! It's changing the world in 2024."# Lowercasetext = text.lower()# Remove punctuation & numberstext = re.sub(r"[^a-z\s]", "", text)# Tokenizetokens = word_tokenize(text)# ['data', 'science', 'is', 'amazing', 'its', 'changing', 'the', 'world', 'in']# Remove stopwordsstop_words = set(stopwords.words("english"))tokens = [t for t in tokens if t not in stop_words]# ['data', 'science', 'amazing', 'changing', 'world']# Stemming (crude — cuts suffix)stemmer = PorterStemmer()stemmed = [stemmer.stem(t) for t in tokens]# Lemmatization (better — uses vocabulary)lemmatizer = WordNetLemmatizer()lemmatized = [lemmatizer.lemmatize(t) for t in tokens]
TF-IDF Vectorization
from sklearn.feature_extraction.text import TfidfVectorizercorpus = [ "data science is great", "machine learning is part of data science", "deep learning is a subset of machine learning"]tfidf = TfidfVectorizer(max_features=20, stop_words="english")X = tfidf.fit_transform(corpus)print(tfidf.get_feature_names_out())print(X.toarray())
Transformers & Hugging Face
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification# Sentiment analysis (zero-shot, no training needed)classifier = pipeline("sentiment-analysis")result = classifier("Data Science is an amazing field!")print(result) # [{'label': 'POSITIVE', 'score': 0.9998}]# Text generationgenerator = pipeline("text-generation", model="gpt2")output = generator("Data science helps us", max_length=50)# Named Entity Recognitionner = pipeline("ner", grouped_entities=True)entities = ner("Guido van Rossum created Python at CWI in Amsterdam.")# Fine-tune a BERT modeltokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
MLOps applies DevOps principles to machine learning: automating, monitoring, and maintaining ML models in production.
ML Pipeline
Data Ingestion → Preprocessing → Training → Evaluation → Deployment → Monitoring
↑_______________________________________________|
(retraining loop)
Scikit-learn Pipelines
from sklearn.pipeline import Pipelinefrom sklearn.preprocessing import StandardScalerfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import cross_val_score# Bundle preprocessing + model into one objectpipe = Pipeline([ ("scaler", StandardScaler()), ("model", RandomForestClassifier(n_estimators=100))])# Fit and predict — preprocessing is applied automaticallypipe.fit(X_train, y_train)y_pred = pipe.predict(X_test)# Cross-validate the whole pipeline (no data leakage)scores = cross_val_score(pipe, X, y, cv=5)print(f"CV: {scores.mean():.3f} ± {scores.std():.3f}")# Save pipelineimport joblibjoblib.dump(pipe, "pipeline.pkl")loaded_pipe = joblib.load("pipeline.pkl")
Apache Spark (PySpark) — distributed data processing
Hadoop — distributed storage (HDFS)
Kafka — real-time data streaming
Airflow — workflow orchestration
dbt — data transformation in SQL
MLOps
MLflow — experiment tracking & model registry
DVC — data version control
FastAPI — model serving API
Docker — containerize ML apps (see [[Docker]])
Kubernetes — scale ML services (see [[Kubernetes]])
Notebooks & IDEs
Jupyter Notebook / JupyterLab — interactive notebooks
Google Colab — free GPU notebooks
VS Code + Python extension — full IDE experience
Kaggle Notebooks — competition environment
Learning Roadmap
Beginner Path
1. Python basics → [[Python]]
2. NumPy & Pandas → arrays, DataFrames
3. Matplotlib & Seaborn → basic visualization
4. Statistics basics → mean, std, distributions
5. First ML model → scikit-learn LinearRegression
6. Kaggle Titanic → first competition
Intermediate Path
1. Full EDA workflow
2. Feature engineering
3. Classification & regression algorithms
4. Model evaluation & cross-validation
5. Hyperparameter tuning
6. SQL for data (see [[MySQL]] / [[PostgreSQL]])
7. Kaggle competitions (top 25%)
Advanced Path
1. Deep learning (TensorFlow / PyTorch)
2. NLP & Transformers (Hugging Face)
3. Computer Vision (CNNs, YOLO)
4. Time series & anomaly detection
5. Big Data (PySpark)
6. MLOps (MLflow, FastAPI, Docker)
7. Research papers & custom architectures