When working with machine learning, one of the first challenges you’ll encounter is how to handle categorical data. Algorithms love numbers, but your dataset may have values like "Red", "Blue", "Green", or "Low", "Medium", "High".
How do we turn these into something a model can understand?
This is where LabelEncoder and OneHotEncoder come in. Both are popular encoding techniques, but they solve different problems. Let’s dive in.
Get this AI Course to start learning AI easily. Use the discount code QPT. Contact me to learn AI, including RAG, MCP, and AI Agents
🔹 What is LabelEncoder?
The LabelEncoder assigns each unique category in a feature to a unique integer.
Example:
Suppose you have a column Color = [Red, Blue, Green].
LabelEncoder will map them like this:
Red → 2
Blue → 0
Green → 1
So your data becomes: [2, 0, 1].
👉 Good for: Categories with a natural order (e.g., "Low" < "Medium" < "High").
👉 Problem: If you use it on unordered categories like colors, the model may think Red > Green > Blue, which doesn’t make sense.
🔹 What is OneHotEncoder?
The OneHotEncoder creates binary columns for each category.
Instead of numbers, it uses 0 and 1 to represent the presence or absence of a category.
Example:
For Color = [Red, Blue, Green], OneHotEncoder creates:
👉 Good for: Unordered categories (nominal data), like colors, city names, product types.
👉 Problem: It increases the number of columns (known as dimensionality). If you have 1,000 unique cities, you’ll end up with 1,000 new columns!
🔹 LabelEncoder vs OneHotEncoder: Key Differences
🔹 Example in Python
Here’s a quick demo with scikit-learn:
Output:
✅ Label Encoding
Color Color_Label
0 Red 2
1 Blue 0
2 Green 1
3 Red 2
✅ One Hot Encoding
Color_Blue Color_Green Color_Red
0 0.0 0.0 1.0
1 1.0 0.0 0.0
2 0.0 1.0 0.0
3 0.0 0.0 1.0
🔹 When Should You Use Each?
Use LabelEncoder when your categories have a natural order:
Example: Education Level = [Primary < Secondary < College < Postgraduate]
Use OneHotEncoder when your categories are just names with no order:
Example: Color = [Red, Blue, Green]
Example: City = [Delhi, Mumbai, Chennai]
🔹 The Takeaway
LabelEncoder is simple and compact, but can create false assumptions if misused.
OneHotEncoder avoids false order but can create lots of new columns.
The best choice depends on the type of categorical data you have.
If you’re building models with categorical features, mastering these two encoders is essential. They’re small steps in preprocessing, but they make a big difference in how well your model learns!
👉 Next Step: In real projects, you might also hear about Target Encoding or Embedding Layers for handling large categorical variables. These are more advanced techniques, but LabelEncoder and OneHotEncoder are the perfect place to start.
No comments:
Post a Comment