Lightnews — Scholar-powered news

Daniel

@indent4.bsky.social

Follow me for insights on real-world ML🚀

If this thread helped, drop a like 📷 or repost 📷

February 21, 2025 at 5:31 PM

Daniel

@indent4.bsky.social

Why use the CPO Pattern?

✅ Scalability – Add new models & components easily.
✅ Maintainability – Elements are decoupled making it easy to change, test and debug individual parts without breaking the entire system.
✅ Reusability – Reuse components across different ML projects.

February 21, 2025 at 5:31 PM

Daniel

@indent4.bsky.social

3️⃣ Orchestrators 🤖
The orchestrator is responsible for managing the pipeline execution.
This allows us to easily switch components (e.g., swap XGBoost for RandomForest).

February 21, 2025 at 5:31 PM

Daniel

@indent4.bsky.social

2️⃣ Pipelines 🛤️
A TrainingPipeline ties together components into a single workflow.
Pipelines ensure a clear flow of data, from raw input to final evaluation.

February 21, 2025 at 5:31 PM

Daniel

@indent4.bsky.social

1️⃣ Components 📦
Each key function (data import, preprocessing, training, evaluation) is wrapped in its own class.
This makes it reusable, testable, and modular.

Example: A DataImporter class to load data!

February 21, 2025 at 5:31 PM

Daniel

@indent4.bsky.social

DataFramed, Super Data Science, The Data Scientist Show, Ken's Nearest Neighbours

February 19, 2025 at 3:34 PM

Daniel

@indent4.bsky.social

Follow me for insights on real-world ML🚀

If this thread helped, drop a like 📷 or repost 📷🔁

February 19, 2025 at 3:32 PM

Daniel

@indent4.bsky.social

EDA isn’t just a formality—it’s the difference between a strong ML model and garbage results. 🚀

✅ Understand your data
✅ Spot issues before modeling
✅ Find strong predictors

These steps serve as a starting point. Analyse further depending on the needs of your project.

February 19, 2025 at 3:32 PM

Daniel

@indent4.bsky.social

💡 Why?

👉Detect multicollinearity (too many correlated features = model confusion)
👉Uncover hidden patterns
👉Reduce dimensionality if needed (PCA can help)

📊 Example: If two features are 99% correlated, you probably don’t need both in your model.

February 19, 2025 at 3:32 PM

Daniel

@indent4.bsky.social

4️⃣ Multivariate Analysis 🔄 (Relationships Between Features)

Time to go deeper and check how features interact with each other.

🔥 Correlation heatmaps (sns.heatmap(df.corr(), annot=True))
🔥 Pairplots (sns.pairplot(df))
🔥 PCA (if needed) (PCA().fit_transform(df))

February 19, 2025 at 3:32 PM

Daniel

@indent4.bsky.social

💡 Why?

👉 Find strong predictors 📈
👉 Identify non-linear relationships (Choose a non-linear model)
👉 Detect leakage (some features might be too correlated with the target!)
📊 Example: A high correlation might mean a strong predictor… or data leakage. Always check!

February 19, 2025 at 3:32 PM

Daniel

@indent4.bsky.social

3️⃣ Bivariate Analysis 📊

See how features relate to the target variable

📈 Numerical:
Correlation heatmaps (df.corr())
Scatter plots

📊 Categorical:
Box plots (sns.boxplot(x='category', y='target', data=df))
Grouped means (df.groupby('category')['target'].mean())

February 19, 2025 at 3:32 PM

Daniel

@indent4.bsky.social

💡 Why?

👉 Detect skewness (maybe you need log transformations)
👉Spot outliers
👉 Identify imbalanced categories

📊 Example: If a feature is heavily skewed, your linear models might struggle—fix it early!

February 19, 2025 at 3:32 PM

Daniel

@indent4.bsky.social

2️⃣ Univariate Analysis 🔍

Let’s get to know each feature individually.

🔍 What to check?

📊 Numerical: Histograms, box plots (df['feature'].hist())

📊 Categorical: Value counts, bar plots (df['feature'].value_counts())

📊 Outliers: Box plots (sns.boxplot(x=df['feature']))

February 19, 2025 at 3:32 PM

Daniel

@indent4.bsky.social

💡 Why? This helps spot early issues (wrong dtypes, missing data, duplicates, incorrect values) before they ruin your model.

📊 Example: Checking for missing values can reveal if you need imputation or feature removal.

February 19, 2025 at 3:32 PM

Daniel

@indent4.bsky.social

1️⃣ Understanding Data Structure & Metadata 🏗️

Before diving in, know your data.

🔍 What to check?
✅ Dataset shape (df.shape)
✅ Data types (df.dtypes)
✅ Descriptive statistics (df.describe())
✅ Missing values (df.isnull().sum())
✅ Unique values & duplicates

February 19, 2025 at 3:32 PM

Daniel

@indent4.bsky.social

Should you separate your SQL servers into different resource groups? This depends on your specific use case. if you have a small number of SQL databases and they are tightly integrated with your app then keeping them in the same resource group as your app might be the way to go.

February 10, 2025 at 12:07 AM

Daniel

@indent4.bsky.social

It's recommended to keep Data Lakes and Blob Storage in a different resource group than your application or project resource group. This limits attack potential security issues and eases resource & cost management.

February 10, 2025 at 12:07 AM

Daniel

@indent4.bsky.social

It's a good idea to keep data associated with different projects in dedicated Blob storage containers.

February 10, 2025 at 12:06 AM

Daniel

@indent4.bsky.social

Create dedicated resource groups for each large project or application

February 10, 2025 at 12:06 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news