jon esperanza
banner
jonathanesperanza.com
jon esperanza
@jonathanesperanza.com
data and ml @creditkarma
jonathanesperanza.com
a big idea i like here:

declarative infra to meet needs of an individual use case, in this case a feature.
December 7, 2024 at 11:42 PM
Chronon supports the following feature computations:
- groupBy -- aggregation compute
- join -- join compute
- stagingQuery -- arbitrary compute as spark SQL

time based aggregation and windowing are first-class concepts, along with SQL primitives.
December 7, 2024 at 10:13 PM
> users declare their feature once and Chronon generates all infra needed to continuously turn raw data into features for both training and serving

seems Chronon orchestrates pipelines for declared features

infra: kafka, spark, hive, airflow, and a customizable key-value store
December 7, 2024 at 10:10 PM
takeaways:
- frontend events are crucial for real-time activity needs, when combined with backend data it leads to complete intelligent decision making
- in-memory processing minimizes cost and latency
November 23, 2024 at 5:22 AM
addressing challenges:
- use streaming process with in-memory state ➡️ low latency and no storage costs
- manage Flink state in RocksDB state backend
- use Flink high availability mode with persisted checkpoints in S3
November 23, 2024 at 5:15 AM
challenges faced:
- batch process for sessionization not feasible due to cost and latency
- Flink needs to keep an hours worth of user activities for each user as its state, hundreds of GB
- job failure and recovery of Flink
November 23, 2024 at 5:15 AM
Apache BookKeeper:
- ack quorum (AQ): if minimum number of bookies ack an entry, it has been fully replicated
- write quorum (WQ): data written sequentially to a ledger ➡️ each entry is distributed among a subset of bookies
- guarantees once entry meets AQ ➡️ replicated to bookies via WQ
November 21, 2024 at 4:40 AM
the main challenge with this solution is maintaining consistency across multiple nodes
- latency delays
- concurrency issues
- synchronization challenges
the author points out that Apache Bookkeeper is able to solve this, so i'll take a look at that
November 21, 2024 at 4:10 AM
November 18, 2024 at 4:04 AM
the premise of atproto is very clear:
authentic decentralization of data for real-world applications that must scale
November 18, 2024 at 4:01 AM
takeaways from atproto specs:
- unified data model and personal data server allow me to plug my own "trusted agent" into the bluesky network (self-hosting)
- Lexicons are the building blocks of atproto as they drive APIs and schemas
- cryptography is used to sign commits to data repos
November 18, 2024 at 3:56 AM