Rahul Jain
rahulj51.bsky.social
Rahul Jain
@rahulj51.bsky.social
Software, Data and Analytics Engineering. Principal Engineer at Bobsled.

Dividing my time between Berlin and Bangalore.

Formerly led Data and Eng at Beat Mobility, Omio and Thoughtworks.
April 26, 2025 at 2:05 PM
Cursor having an identity crisis.
April 26, 2025 at 2:04 PM
It takes about two weeks of using these fancy table formats to know that what's documented is just the tip of the iceberg. Most of the operational knowledge is undocumented and buried in GitHub issues and slack threads.
February 1, 2025 at 2:24 AM
I will make my types so so safe this year. You just watch.
January 11, 2025 at 2:25 AM
Spent two days fiddling with all kinds of spark settings to solve a network timeout issue. It finally worked but this is my biggest crib about spark. Most of the time, the only way to fix stuff is trial and error. /1
January 9, 2025 at 2:18 PM
Can someone explain why the Spark dataframes api doesn't have the concept of create-if-not-exists but their sql api does?
January 7, 2025 at 1:39 AM
Your irg may have a set of 20 well written values but even well meaning techies usually care about only two traits in their colleagues:
- Technical skills
- Being nice to each other
January 2, 2025 at 3:59 AM
Attended a big fat Indian wedding and went berserk with the food.
January 1, 2025 at 8:57 AM
I didn't know that Iceberg also creates hive style partitioned folders. This is surprising. I thought the whole idea was to manage partitions at metadata level.
December 31, 2024 at 10:02 AM
The worst resumé writing advice is that it should be contained within 1-2 pages.

Add more pages. Tell your story. But tell it well.
December 23, 2024 at 9:14 AM
2 years of Typescript made me a better Python programmer.
December 18, 2024 at 3:16 PM
A SQL command I'd love to see is

```
ALTER TABLE <table2> MERGE SCHEMA <table1>
WHEN MATCHED CHANGE DATATYPE
WHEN NOT MATCHED BY SOURCE DROP COLUMN
WHEN NOT MATCHED BY TARGET ADD COLUMN
```
December 11, 2024 at 12:48 PM
What Python Type checker is everyone using - mypy or pyright?
December 4, 2024 at 12:15 PM
Indian grey mornings have a post-apocalyptic feel to them. A brownish, heavy, noisy grey as opposed to the silent steel grey of Berlin.
November 30, 2024 at 5:01 AM
Data lake architectures are uncannily similar to farm life simulators.
November 29, 2024 at 4:37 AM
There is something about certain coding practices that are deeply satisfying to a programmer at a dopamine release level - which explains why they go OCD about these. Functional programming, TDD, Refactoring and Types design - all have this quality. It's like an itch you can't stop scratching.
November 28, 2024 at 2:09 AM
The majority of data orgs in the world are still woefully unaware of the latest in data Tech. They are still sending each other csvs over ftp and have never heard of Airflow.
November 23, 2024 at 2:25 AM
Depending on who you talk to, there are two versions of data democracy vis-a-vis technology:

1. My data, my choice. I want to be able to use my choice of tool/tech for my data.

2. Inclusivity. Eveyone should have easy access to data. Therefore, I'll choose generic tools that anyone can use easily
November 21, 2024 at 7:23 AM
A good tip for data engineering architecture is to treat a "Table" as a logical entity and not tie it with the way it is materialized or mapped to other subsystems. A table can be anything - a set of files, a view, the result of a query. Its definition and physical manifestation should be decoupled.
November 18, 2024 at 3:00 AM
Does anyone know if BigQuery uses table metadata to speed-up MIN/MAX of a partition column? It looks like it always does a column scan.
November 7, 2024 at 5:12 PM
Data pipelines are difficult to generalize because a lot depends on the specific characteristics of a table - there is no one size fits all. Cloud warehouses offer high level abstractions which work but cost $$$. Lakes offer hundreds of levers to pull but at the cost of generalizability.
November 4, 2024 at 7:17 PM
The poor support of multi-table transactions in the DE world is somewhat limiting. We have just gotten used to not having it and replacing it with data integrity checks and manual rollbacks. But with table-formats and catalogs, this shouldn't be too difficult to implement (or so i think).
November 2, 2024 at 9:59 AM
What's a good way to visually organize a flow chart?
November 1, 2024 at 7:13 PM
Happy Diwali
October 31, 2024 at 6:03 AM
It's distracting to review code when it deviates from the original intent and applies arbitrary refactors because of the "leave the campsite cleaner" practice that SEs follow.

Clean your code by all means but don't sneak it in with critical changes.
October 29, 2024 at 6:24 AM