Data Mining and Big Data Analytics Overview

1. Data Cleaning and Processing

1.1 Data Cleaning

Process of detecting, correcting, and removing errors, inconsistencies, and inaccuracies in data to improve quality for analysis.

Common Issues

Missing values – caused by data entry errors or incomplete collection
Noise – random error or variance in a dataset
Inconsistencies – conflicting data (e.g., different formats, duplicates)
Outliers – extreme values that deviate from others

Handling Missing Data

Ignore the record (if few and random)
Fill manually (if possible)
Global constant (e.g., “Unknown”)
Attribute mean/median/mode
Attribute mean for class (conditional mean)
Predictive models (e.g., regression, k-NN, ML-based imputation)

Noise Handling

Binning – sort values and smooth via bin means/medians
Regression – fit curve and smooth data
Clustering – detect and remove outliers

Inconsistency Resolution

Standardization – uniform formats (e.g., date, units)
Deduplication – remove duplicate records
Cross-validation – use external references for correction

1.2 Data Processing

Transforming raw data into a suitable format for mining or analysis.

Steps

Data Integration – combining data from multiple sources
- Handle schema conflicts, redundancy, entity identification
Data Transformation – convert data into suitable forms
- Normalization: scale values (min-max, z-score, decimal scaling)
- Aggregation: combine data (e.g., sum, avg)
- Generalization: replace detailed data with higher-level concepts
- Feature construction: create new attributes from existing ones
Data Reduction – reduce data volume but retain integrity
- Dimensionality reduction (PCA, LDA)
- Numerosity reduction (sampling, clustering)
- Data compression
Discretization – convert continuous data into categorical intervals
- Equal-width or Equal-frequency binning

Objective Improve data quality, consistency, and efficiency of mining algorithms.

2. Association and Correlation Analysis

2.1 Association Analysis

Process of discovering interesting relationships or patterns (associations) among items in large datasets.
Used mainly in market basket analysis (e.g., {Milk → Bread}).

Basic Concepts

Itemset: Set of items (e.g., {Milk, Bread})
Support (s): Frequency of itemset occurrence
Support(A→B) = P(A ∪ B)
Confidence (c): Strength of implication
Confidence(A→B) = P(B | A)
Lift: Measure of independence between A and B
Lift(A→B) = P(A∪B) / (P(A) × P(B))
- Lift > 1: Positive correlation (A and B occur together more often)
- Lift = 1: Independent
- Lift < 1: Negative correlation

Steps in Association Rule Mining

Find Frequent Itemsets (using support threshold)
Generate Association Rules (using confidence threshold)
Prune Rules based on interestingness measures (lift, conviction, leverage)

Algorithms

Apriori Algorithm: Uses iterative candidate generation and pruning
FP-Growth: Builds a compact tree (FP-tree) to avoid candidate generation
Eclat: Uses depth-first search with transaction ID sets

2.2 Correlation Analysis

Determines how strongly pairs of attributes are related.

Measures

Pearson Correlation Coefficient (r):
r = cov(X,Y) / (σX × σY)
- r = 1 → Perfect positive
- r = -1 → Perfect negative
- r = 0 → No linear correlation
Spearman Rank Correlation: Non-parametric, uses ranks instead of values
Chi-Square Test: Tests independence between categorical variables

Difference Between Association and Correlation

Aspect	Association	Correlation
Purpose	Find co-occurrence patterns	Measure strength & direction of linear relationship
Data Type	Categorical/transactional	Numerical (or ranked)
Output	Rules (A→B)	Coefficient (r)
Dependency	Asymmetric (A→B)	Symmetric (X↔Y)

Objective

Association: Discover hidden relationships between items.
Correlation: Quantify strength and direction of relationships between variables.

3. Clustering Algorithms and Cluster Analysis

3.1 Cluster Analysis

Unsupervised learning technique used to group similar data objects into clusters such that:

Intra-cluster similarity is high
Inter-cluster similarity is low

Applications

Market segmentation, image analysis, anomaly detection, document classification, etc.

Types of Clustering

Partitioning Methods – divide data into k clusters
- Example: k-Means, k-Medoids
Hierarchical Methods – build a hierarchy of clusters
- Example: Agglomerative, Divisive
Density-Based Methods – clusters are dense regions separated by sparse areas
- Example: DBSCAN, OPTICS
Grid-Based Methods – quantize data space into finite cells
- Example: STING, CLIQUE
Model-Based Methods – assume data is generated from a model
- Example: EM Algorithm, Gaussian Mixture Models (GMM)

3.2 Major Clustering Algorithms

1. k-Means Clustering

Idea: Partition data into k clusters by minimizing within-cluster variance.
Steps:

Choose k initial centroids
Assign each point to nearest centroid
Recompute centroids as mean of assigned points
Repeat until centroids stabilize
Pros: Simple, fast
Cons: Sensitive to initial points, assumes spherical clusters, needs k

2. k-Medoids (PAM)

Similar to k-Means but uses actual data points (medoids) instead of means, reducing sensitivity to outliers.
Pros: Robust to noise/outliers
Cons: Higher computational cost

3. Hierarchical Clustering

Builds hierarchy of clusters without preset k.

Agglomerative (bottom-up): Start with singletons → merge closest clusters
Divisive (top-down): Start with one cluster → split recursively
Linkage Methods:
Single (min distance)
Complete (max distance)
Average (mean distance)
Produces: Dendrogram (tree structure)

4. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Idea: Clusters = dense regions separated by low-density areas
Parameters:

ε (epsilon): radius of neighborhood
MinPts: minimum points to form dense region
Pros: Detects arbitrary-shaped clusters, handles noise
Cons: Sensitive to ε and MinPts selection

5. Gaussian Mixture Model (GMM)

Idea: Assumes data generated from mixture of Gaussian distributions
Uses: Expectation-Maximization (EM) algorithm to estimate parameters
Pros: Soft clustering (probabilistic)
Cons: Requires assumption of Gaussian shape

3.3 Cluster Evaluation

Internal Metrics: Measure compactness/separation
- Silhouette Coefficient, Davies–Bouldin Index
External Metrics: Compare with known labels
- Rand Index, Purity, F-measure

Objective Identify natural groupings in data for pattern discovery, data summarization, and insight extraction.

4. Classification

Definition

Supervised learning technique that assigns data items into predefined categories (classes) based on input features.
Used for predictive modeling (e.g., spam detection, disease diagnosis).

Basic Process

Data Preparation: Clean, transform, and split data into training and testing sets
Model Training: Learn mapping function from input features → class label
Model Testing: Evaluate on unseen data
Prediction: Assign class labels to new instances

Key Concepts

Training Set: Data used to build the model
Testing Set: Data used to validate the model
Confusion Matrix: Shows actual vs predicted classes
Accuracy: (TP + TN) / (TP + TN + FP + FN)
Precision, Recall, F1-score: Performance metrics for imbalanced datasets

4.1 Major Classification Algorithms

1. Decision Tree

Idea: Recursively splits data based on attribute values to maximize class purity.
Splitting Measures: Information Gain, Gain Ratio, Gini Index
Pros: Easy interpretation
Cons: Prone to overfitting

2. Naïve Bayes

Idea: Applies Bayes’ theorem assuming independent features.
P(C|X) = [P(X|C) × P(C)] / P(X)
Pros: Fast, works well for text/data classification
Cons: Independence assumption often unrealistic

3. k-Nearest Neighbors (k-NN)

Idea: Assigns class based on majority vote of k nearest neighbors.
Pros: Simple, non-parametric
Cons: Slow for large datasets, sensitive to noise

4. Support Vector Machine (SVM)

Idea: Finds optimal hyperplane that maximizes margin between classes.
Pros: Effective in high-dimensional space
Cons: Poor performance on noisy or overlapping data

5. Neural Networks

Idea: Multiple interconnected layers learn complex nonlinear decision boundaries.
Pros: High accuracy for complex data (images, speech)
Cons: Requires large data and computation

6. Random Forest

Idea: Ensemble of multiple decision trees; prediction via majority voting.
Pros: Reduces overfitting, robust
Cons: Less interpretable

Model Evaluation Metrics

Accuracy – overall correctness
Precision & Recall – useful for class imbalance
F1-Score – harmonic mean of precision and recall
ROC Curve & AUC – trade-off between TPR and FPR

Objective

Build a predictive model that accurately assigns unseen data to correct predefined categories.

5. Introduction to Big Data

Definition

Big Data refers to extremely large and complex datasets that traditional data processing tools cannot handle efficiently in terms of storage, processing, and analysis.
It involves data generated from multiple sources like social media, sensors, IoT devices, logs, and transactions.

Characteristics (5 V’s of Big Data)

Volume – Huge amount of data (terabytes to petabytes)
Velocity – Rapid rate of data generation and processing
Variety – Different data types: structured, semi-structured, unstructured
Veracity – Data uncertainty, inconsistency, and reliability issues
Value – Extracting meaningful insights and business value

5.1 Types of Big Data

Structured: Organized data in fixed schema (e.g., SQL tables)
Semi-Structured: Partial organization (e.g., JSON, XML)
Unstructured: No predefined format (e.g., text, images, videos)

5.2 Sources of Big Data

Social media platforms (Twitter, Facebook)
IoT sensors and smart devices
E-commerce and transaction systems
Scientific research instruments
Server logs, clickstreams, mobile apps

5.3 Big Data Architecture Components

Data Sources – Origin of raw data
Data Ingestion – Collecting and importing data (e.g., Kafka, Flume)
Storage Layer – Distributed storage (e.g., HDFS, NoSQL databases)
Processing Layer – Batch/stream processing (e.g., Hadoop MapReduce, Spark)
Analytics Layer – Machine learning, data mining, visualization
Visualization Layer – Tools like Tableau, Power BI for insights

5.4 Technologies for Big Data

Storage: HDFS, HBase, Cassandra
Processing: Hadoop, Spark, Flink
Querying: Hive, Pig, Impala
Streaming: Kafka, Storm
Machine Learning: MLlib, Mahout

Challenges

Data quality and cleaning
Scalability and performance
Security and privacy
Data integration and interoperabilit
Real-time analytics requirements

Objective Efficiently store, process, and analyze massive, heterogeneous, and fast-changing datasets to derive actionable insights and informed decisions.

6 Introduction to Big Data Applications

Definition

Big Data Applications use large-scale data analytics techniques to extract meaningful insights, patterns, and predictions from massive datasets across various domains.

6.1 Key Areas of Application

1. Business and Marketing

Customer behavior analysis and segmentation
Personalized recommendations (Amazon, Netflix)
Dynamic pricing and targeted advertising
Supply chain and inventory optimization

2. Finance and Banking

Fraud detection using real-time analytics
Credit scoring and risk management
Algorithmic and high-frequency trading
Customer churn and sentiment analysis

3. Healthcare

Predictive diagnosis and patient monitoring
Drug discovery and genomics research
Hospital resource optimization
Disease outbreak prediction (epidemiology)

4. Government and Public Sector

Smart cities and infrastructure management
Crime prediction and surveillance analytics
Policy planning and social welfare optimization
Disaster management and response systems

5. Telecommunication

Network traffic analysis and optimization
Churn prediction and customer profiling
Fault detection and predictive maintenance

6. Energy and Utilities

Smart grid monitoring and load forecasting
Predictive maintenance of power equipment
Optimization of renewable energy generation

7. Education

Adaptive learning and personalized content
Student performance prediction and dropout analysis
Academic research using large-scale data

8. Transportation and Logistics

Route and traffic optimization (Google Maps, Uber)
Fleet management and predictive maintenance
Autonomous vehicle analytics

9. Social Media and Web

Sentiment analysis and trend prediction
User behavior modeling
Influencer and community detection

Benefits

Enhanced decision-making
Real-time insights and automation
Cost reduction through optimization
Innovation and competitive advantage

Objective

Leverage massive and diverse datasets to enable data-driven decision-making, improve efficiency, and create intelligent, predictive, and adaptive systems across industries.

7. Introduction to Big Data Applications using Machine Learning

Definition

Integration of Machine Learning (ML) with Big Data enables automatic pattern discovery, prediction, and decision-making from massive and complex datasets that cannot be processed by traditional algorithms.

7.1 Relationship Between Big Data and Machine Learning

Big Data provides large-scale, high-dimensional, and diverse datasets.
Machine Learning provides algorithms that learn from this data to predict or classify outcomes.
Together they form the foundation of AI-driven analytics and automation.

Key Components

Data Collection and Storage: Using Hadoop, Spark, HDFS, NoSQL
Data Preprocessing: Cleaning, feature selection, and dimensionality reduction
Model Training: Using ML algorithms (supervised, unsupervised, reinforcement)
Model Evaluation: Performance metrics and validation
Deployment and Real-Time Prediction: Integration with applications and streaming data

7.2 Common Machine Learning Techniques in Big Data

Supervised Learning: Classification, regression (e.g., fraud detection, sentiment analysis)
Unsupervised Learning: Clustering, association (e.g., market segmentation, anomaly detection)
Reinforcement Learning: Decision optimization in dynamic environments
Deep Learning: Handling unstructured data like images, audio, text (CNNs, RNNs)

7.3 Big Data Platforms Supporting ML

Apache Spark MLlib – scalable machine learning library
Hadoop Mahout – distributed ML algorithms
TensorFlow on Big Data – large-scale deep learning
H2O.ai, Databricks, AWS Sagemaker – integrated ML and Big Data environments

7.4 Major Applications

1. Predictive Analytics

Forecasting demand, stock prices, or failures using regression and neural networks.

2. Fraud Detection

Analyzing transaction patterns using anomaly detection and classification models.

3. Recommendation Systems

Using collaborative filtering and deep learning to recommend products or media.

4. Healthcare Analytics

Disease prediction, medical image classification, and personalized treatment models.

5. Real-Time Analytics

Streaming ML for IoT data, social media trends, and financial markets using Spark Streaming or Flink.

6. Natural Language Processing

Text mining, sentiment analysis, and chatbot training on large textual datasets.

Benefits

Scalability and automation in data-driven decisions
Real-time insights from streaming and batch data
Improved accuracy with continuous learning models

Objective Combine Big Data infrastructure with Machine Learning intelligence to build scalable, adaptive, and predictive systems capable of transforming massive datasets into actionable knowledge.

8. Introduction to Analytics Engines like Spark, Hadoop MapReduce etc.

Definition

Analytics engines are distributed computing frameworks designed to store, process, and analyze massive datasets efficiently across clusters of machines.
They provide the backbone for Big Data analytics by enabling scalable, parallel, and fault-tolerant computation.

8.1 Hadoop MapReduce

Overview

A programming model and processing engine for large-scale data analysis in the Hadoop ecosystem.
Processes data in batch mode using the Map and Reduce functions over distributed nodes.

Architecture

HDFS (Hadoop Distributed File System): Storage layer that splits data into blocks and distributes them across nodes.
YARN (Yet Another Resource Negotiator): Resource management and job scheduling layer.
MapReduce Engine: Computation layer for parallel data processing.

Working

Map Phase: Input data is split and processed into key-value pairs.
Shuffle and Sort: Intermediate data grouped by keys.
Reduce Phase: Aggregates or summarizes values with the same key.

Pros

Scalable, reliable, fault-tolerant
Handles huge datasets on commodity hardware

Cons

High latency (batch-only)
Inefficient for iterative or real-time tasks

8.2 Apache Spark

Overview

Fast, in-memory, distributed data processing engine for batch and stream analytics.
Provides APIs in Scala, Python, Java, R.

Core Components

Spark Core: Task scheduling, memory management, fault recovery
Spark SQL: Querying structured data using SQL/DataFrames
Spark Streaming: Real-time stream processing
MLlib: Machine learning library for scalable algorithms
GraphX: Graph computation and analytics

Advantages over MapReduce

In-memory computation (much faster)
Supports iterative algorithms and streaming
Unified engine for batch, streaming, and ML workloads

Execution Model

Data represented as RDDs (Resilient Distributed Datasets)
Supports lazy evaluation and fault tolerance through lineage

8.3 Other Analytics Engines

Apache Flink

Stream-first engine for real-time, low-latency analytics
Handles both batch and event-driven processing

Apache Storm

Real-time stream processing system
Suitable for continuous event analytics and monitoring

Presto / Trino

Distributed SQL query engine for interactive analytics on large datasets

Apache Hive

SQL-like querying over Hadoop, converts queries to MapReduce or Spark jobs

Comparison

Feature	Hadoop MapReduce	Apache Spark	Apache Flink
Processing Type	Batch	Batch + Stream	Stream-first
Speed	Slow (disk-based)	Fast (in-memory)	Real-time
Ease of Use	Complex	Simple APIs	Moderate
ML Support	External (Mahout)	Built-in (MLlib)	Limited

Objective

Enable large-scale, parallel, and fault-tolerant analytics on massive datasets by providing scalable computation frameworks that power modern Big Data ecosystems.