Tuesday, February 10, 2026

A Beginner’s Guide to Data Analysis: From NumPy to Statistics


 Data analysis is the foundation of data science, machine learning, and AI. Before building models, we must understand data, clean it, analyze patterns, and draw conclusions.

This blog walks you through the entire data analysis pipeline, step by step, using simple language and practical intuition.

1. NumPy Arrays and Functions

What is NumPy?

NumPy (Numerical Python) is a core Python library used for numerical computing. It provides support for fast, efficient arrays and mathematical operations.

NumPy Arrays

A NumPy array is similar to a Python list, but:

  • It stores data of the same type

  • It is faster and memory-efficient

  • It supports vectorized operations

import numpy as np arr = np.array([1, 2, 3, 4, 5])

Common NumPy Functions

  • np.mean(arr) → Average

  • np.sum(arr) → Sum

  • np.min(arr) / np.max(arr)

  • np.reshape()

  • np.arange(), np.linspace()

Example:

np.mean(arr) # 3.0

Why NumPy matters:
Most data science libraries (Pandas, Scikit-learn, TensorFlow) are built on NumPy.


2. Pandas Series and DataFrames

What is Pandas?

Pandas is a Python library used for data manipulation and analysis. It works with labeled data and tabular structures.

Pandas Series

A Series is a one-dimensional labeled array.

import pandas as pd s = pd.Series([10, 20, 30], index=["a", "b", "c"])

Think of it as:

A column of data with labels


Pandas DataFrame

A DataFrame is a 2D table (rows and columns), similar to an Excel sheet or SQL table.

data = { "Name": ["Alice", "Bob"], "Age": [25, 30] } df = pd.DataFrame(data)

Why DataFrames are powerful:

  • Handle missing values

  • Filter and transform data easily

  • Work directly with files (CSV, Excel, SQL)


3. Common Pandas Functions

Viewing Data

  • df.head() → First 5 rows

  • df.tail() → Last 5 rows

  • df.info() → Structure and data types

  • df.describe() → Statistical summary

Selecting Data

df["Age"] # Column df.loc[0] # Row by label df.iloc[0] # Row by index

Filtering

df[df["Age"] > 25]

Modifying Data

  • df.rename()

  • df.drop()

  • df.sort_values()

  • df.apply()


4. Saving and Loading Datasets Using Pandas

Loading Data

df = pd.read_csv("data.csv") df = pd.read_excel("data.xlsx")

Saving Data

df.to_csv("output.csv", index=False) df.to_excel("output.xlsx", index=False)

Why this matters:
You can move data easily between Python, Excel, databases, and ML pipelines.


5. Data Visualization

(Data tells stories better when visualized)

Matplotlib

The base plotting library in Python.

import matplotlib.pyplot as plt plt.plot([1,2,3], [4,5,6]) plt.show()

Best for:

  • Simple plots

  • Full customization


Seaborn

Built on Matplotlib, but more beautiful and statistical.

import seaborn as sns sns.histplot(df["Age"])

Best for:

  • Distribution plots

  • Heatmaps

  • Relationship plots


Plotly

An interactive visualization library.

import plotly.express as px px.scatter(df, x="Age", y="Salary")

Best for:

  • Dashboards

  • Hover, zoom, interactive charts


6. Introduction to Inferential Statistics

What is Inferential Statistics?

It helps us:

Draw conclusions about a population using a sample

Examples:

  • Predict election results from a survey

  • Test if a new medicine works

  • Decide if a marketing change increased sales


7. Fundamentals of Probability Distributions

A probability distribution describes how values are spread.

Common Distributions

Normal Distribution

  • Bell-shaped curve

  • Mean = Median = Mode

  • Used in heights, marks, errors

Uniform Distribution

  • All outcomes equally likely

Binomial Distribution

  • Success/failure experiments

  • Example: coin tosses

Understanding distributions helps in:

  • Hypothesis testing

  • Model assumptions

  • Risk analysis


8. The Central Limit Theorem (CLT)

Simple Explanation

Even if the original data is not normal:

The distribution of sample means becomes normal as sample size increases.

Why CLT is important:

  • Makes statistical tests possible

  • Justifies use of mean and standard deviation

  • Foundation of inferential statistics


9. Hypothesis Testing

What is Hypothesis Testing?

A formal method to decide:

Is an observed result real or due to chance?

Key Terms

  • Null Hypothesis (H₀): No effect

  • Alternative Hypothesis (H₁): There is an effect

  • p-value: Probability of observing data assuming H₀ is true

Example:

  • H₀: New website has no impact on sales

  • H₁: New website increases sales

If p-value < 0.05, we reject H₀.


10. Univariate Analysis

Analysis of one variable.

Examples

  • Mean salary

  • Age distribution

  • Frequency of categories

Common Techniques

  • Histogram

  • Boxplot

  • Descriptive statistics

Purpose:

  • Understand distribution

  • Identify outliers

  • Detect skewness


11. Bivariate Analysis

Analysis of two variables together.

Examples

  • Age vs Salary

  • Study hours vs marks

Techniques

  • Scatter plots

  • Correlation

  • Grouped bar charts

Purpose:

  • Find relationships

  • Detect trends

  • Understand dependencies


12. Missing Value Treatment

Why Missing Values Occur

  • Data entry errors

  • Survey non-responses

  • Sensor failures

Common Techniques

  • Remove rows/columns

  • Fill with mean/median/mode

  • Forward fill / backward fill

df.fillna(df.mean())

Choosing the right method depends on:

  • Amount of missing data

  • Importance of the column


13. Outlier Treatment

What Are Outliers?

Extreme values that differ significantly from others.

Example:

  • Salary = ₹10,00,000 when most earn ₹30,000

Detection Methods

  • Boxplots

  • Z-score

  • IQR (Interquartile Range)

Treatment Options

  • Remove outliers

  • Cap values

  • Transform data (log scale)

Outliers can:

  • Skew averages

  • Mislead models

  • Hide true patterns


Final Thoughts

This blog covered the entire data analysis journey:

  • From NumPy and Pandas

  • To visualization

  • To statistics and data cleaning

If you master these concepts, you will have a strong foundation for data science, machine learning, and AI.

No comments:

Search This Blog