Software Engineering Overview

1. Introduction to Analytics Engines like Spark, Hadoop MapReduce etc.

Definition

Analytics engines are distributed computing frameworks designed to store, process, and analyze massive datasets efficiently across clusters of machines.
They provide the backbone for Big Data analytics by enabling scalable, parallel, and fault-tolerant computation.

1.1 Hadoop MapReduce

Overview

A programming model and processing engine for large-scale data analysis in the Hadoop ecosystem.
Processes data in batch mode using the Map and Reduce functions over distributed nodes.

Architecture

HDFS (Hadoop Distributed File System): Storage layer that splits data into blocks and distributes them across nodes.
YARN (Yet Another Resource Negotiator): Resource management and job scheduling layer.
MapReduce Engine: Computation layer for parallel data processing.

Working

Map Phase: Input data is split and processed into key-value pairs.
Shuffle and Sort: Intermediate data grouped by keys.
Reduce Phase: Aggregates or summarizes values with the same key.

Pros

Scalable, reliable, fault-tolerant
Handles huge datasets on commodity hardware

Cons

High latency (batch-only)
Inefficient for iterative or real-time tasks

1.2 Apache Spark

Overview

Fast, in-memory, distributed data processing engine for batch and stream analytics.
Provides APIs in Scala, Python, Java, R.

Core Components

Spark Core: Task scheduling, memory management, fault recovery
Spark SQL: Querying structured data using SQL/DataFrames
Spark Streaming: Real-time stream processing
MLlib: Machine learning library for scalable algorithms
GraphX: Graph computation and analytics

Advantages over MapReduce

In-memory computation (much faster)
Supports iterative algorithms and streaming
Unified engine for batch, streaming, and ML workloads

Execution Model

Data represented as RDDs (Resilient Distributed Datasets)
Supports lazy evaluation and fault tolerance through lineage

1.3 Other Analytics Engines

Apache Flink

Stream-first engine for real-time, low-latency analytics
Handles both batch and event-driven processing

Apache Storm

Real-time stream processing system
Suitable for continuous event analytics and monitoring

Presto / Trino

Distributed SQL query engine for interactive analytics on large datasets

Apache Hive

SQL-like querying over Hadoop, converts queries to MapReduce or Spark jobs

Comparison

Feature	Hadoop MapReduce	Apache Spark	Apache Flink
Processing Type	Batch	Batch + Stream	Stream-first
Speed	Slow (disk-based)	Fast (in-memory)	Real-time
Ease of Use	Complex	Simple APIs	Moderate
ML Support	External (Mahout)	Built-in (MLlib)	Limited

Objective

Enable large-scale, parallel, and fault-tolerant analytics on massive datasets by providing scalable computation frameworks that power modern Big Data ecosystems.

2. Software Requirement Gathering

Definition:
Process of collecting and defining what the software must do and the constraints under which it must operate.

Objectives:

Understand user needs and expectations
Define functional and non-functional requirements clearly
Avoid ambiguity, inconsistency, and incompleteness

Types of Requirements:

Functional Requirements: Describe what the system should do (features, operations, inputs/outputs).
Non-Functional Requirements: Define system attributes like performance, reliability, scalability, and security.
Domain Requirements: Specific to the industry or environment in which software will operate.

Requirement Gathering Techniques:

Interviews: One-to-one discussions with stakeholders.
Questionnaires/Surveys: Useful for large user groups.
Brainstorming: Collective idea generation from stakeholders and experts.
Workshops: Interactive sessions to resolve conflicts and finalize priorities.
Observation: Studying users’ actual work environment.
Document Analysis: Reviewing existing system documents and reports.
Prototyping: Building a working model to clarify unclear requirements.

Outputs of Requirement Gathering:

SRS (Software Requirement Specification): Formal document containing all requirements
Use Cases / User Stories: Describe user interactions with the system.
Requirement Traceability Matrix (RTM): Maps each requirement to its corresponding design and test cases.

Key Qualities of Good Requirements:

Complete, Consistent, Unambiguous, Verifiable, and Feasible

3. System Design Principles

Definition:
System design defines the architecture, components, interfaces, and data for a system to satisfy specified requirements. Design principles ensure software is scalable, maintainable, and efficient.

Major Design Principles:

Modularity:
System is divided into smaller independent modules for easier development and maintenance.
Abstraction:
Hides implementation details; focuses on what a module does, not how.
Encapsulation:
Bundles data and related operations; protects data from unauthorized access.
Separation of Concerns:
Each module handles a specific functionality or responsibility.
Cohesion:
Measures how strongly related the functions within a module are.
- High Cohesion is desirable.
Coupling:
Measures interdependence among modules.
- Low Coupling is desirable.
Information Hiding:
Internal details of modules are hidden; only necessary interfaces are exposed.
Reusability:
Design components to be reusable across different systems.
Scalability:
System should handle increased workload or data without performance degradation.
Flexibility & Maintainability:
Design should allow easy updates, enhancements, and error corrections.
Fault Tolerance:
System should continue to operate correctly even in presence of faults.

Design Levels:

High-Level Design (HLD): Defines system architecture, modules, and their relationships.
Low-Level Design (LLD): Specifies logic of individual components, algorithms, and data structures.

4. Software Testing and Quality Management

Definition:
Software Testing is the process of evaluating software to ensure it meets specified requirements and is defect-free.
Quality Management ensures that software development and maintenance processes produce high-quality products.

Objectives:

Detect defects early
Ensure software reliability, performance, and correctness
Validate that system meets user expectations

Levels of Testing:

Unit Testing: Tests individual components or modules.
Integration Testing: Tests interactions between integrated modules.
System Testing: Tests the complete system as a whole.
Acceptance Testing: Conducted by users to verify the system meets business needs.

Types of Testing:

Functional Testing: Verifies functionalities as per requirements.
Non-Functional Testing: Checks performance, usability, reliability, security, etc.
Regression Testing: Ensures new changes don’t affect existing functionality.
Smoke Testing: Initial quick testing to check basic stability.
Alpha & Beta Testing: Pre-release testing by internal users and real users respectively.

Software Quality Management (SQM):
Ensures that software processes and products conform to defined quality standards.

Components of SQM:

Quality Assurance (QA): Process-oriented; ensures standards and procedures are followed.
Quality Control (QC): Product-oriented; identifies defects in the final product.
Quality Planning: Defines quality goals, metrics, and required activities.

Quality Standards and Models:

ISO 9001: Focuses on quality management systems.
CMMI (Capability Maturity Model Integration): Measures process maturity levels.
Six Sigma: Reduces defects through statistical analysis.

Key Metrics:

Defect density
Mean Time to Failure (MTTF)
Test coverage
Customer satisfaction index

5. Software Project Management

Definition:
Application of knowledge, skills, tools, and techniques to plan, execute, and control software projects to meet requirements within time, cost, and quality constraints.

Objectives:

Deliver software on schedule and within budget
Ensure desired quality and performance
Manage risks, resources, and communication effectively

Main Activities:

Project Planning:
- Define scope, objectives, and deliverables
- Estimate effort, cost, and schedule
- Identify risks and prepare mitigation plans
Project Scheduling:
- Create work breakdown structure (WBS)
- Use tools like Gantt Charts, PERT, and CPM for task sequencing and time estimation
Resource Management:
- Allocate human, financial, and technical resources efficiently
Risk Management:
- Identify, analyze, and control potential project risks
Project Execution and Monitoring:
- Track progress using metrics like effort variance, cost variance, and schedule variance
- Apply Earned Value Analysis (EVA)
Project Closure:
- Final deliverables, documentation, evaluation, and lessons learned

Project Estimation Techniques:

Algorithmic Models: COCOMO (Constructive Cost Model)
Expert Judgment: Based on experience
Delphi Technique: Consensus-based estimation among experts
Function Point Analysis (FPA): Based on system functionality

Project Management Tools:
MS Project, JIRA, Trello, Asana

Software Project Manager Responsibilities:

Planning and scheduling
Team coordination
Progress tracking and reporting
Ensuring quality and stakeholder satisfaction