Lightnews — Scholar-powered news

jon esperanza

a big idea i like here:

declarative infra to meet needs of an individual use case, in this case a feature.

December 7, 2024 at 11:42 PM

jon esperanza

Chronon supports the following feature computations:
- groupBy -- aggregation compute
- join -- join compute
- stagingQuery -- arbitrary compute as spark SQL

time based aggregation and windowing are first-class concepts, along with SQL primitives.

December 7, 2024 at 10:13 PM

jon esperanza

@jonathanesperanza.com

> users declare their feature once and Chronon generates all infra needed to continuously turn raw data into features for both training and serving

seems Chronon orchestrates pipelines for declared features

infra: kafka, spark, hive, airflow, and a customizable key-value store

December 7, 2024 at 10:10 PM

jon esperanza

@jonathanesperanza.com

takeaways:
- frontend events are crucial for real-time activity needs, when combined with backend data it leads to complete intelligent decision making
- in-memory processing minimizes cost and latency

November 23, 2024 at 5:22 AM

jon esperanza

@jonathanesperanza.com

addressing challenges:
- use streaming process with in-memory state ➡️ low latency and no storage costs
- manage Flink state in RocksDB state backend
- use Flink high availability mode with persisted checkpoints in S3

November 23, 2024 at 5:15 AM

jon esperanza

@jonathanesperanza.com

challenges faced:
- batch process for sessionization not feasible due to cost and latency
- Flink needs to keep an hours worth of user activities for each user as its state, hundreds of GB
- job failure and recovery of Flink

November 23, 2024 at 5:15 AM

jon esperanza

@jonathanesperanza.com

Apache BookKeeper:
- ack quorum (AQ): if minimum number of bookies ack an entry, it has been fully replicated
- write quorum (WQ): data written sequentially to a ledger ➡️ each entry is distributed among a subset of bookies
- guarantees once entry meets AQ ➡️ replicated to bookies via WQ

November 21, 2024 at 4:40 AM

jon esperanza

@jonathanesperanza.com

the main challenge with this solution is maintaining consistency across multiple nodes
- latency delays
- concurrency issues
- synchronization challenges
the author points out that Apache Bookkeeper is able to solve this, so i'll take a look at that

November 21, 2024 at 4:10 AM

jon esperanza

@jonathanesperanza.com

also really interested in github.com/bluesky-soci...

GitHub - bluesky-social/feed-generator: ATProto Feed Generator Starter Kit

ATProto Feed Generator Starter Kit. Contribute to bluesky-social/feed-generator development by creating an account on GitHub.

github.com

November 18, 2024 at 4:10 AM

jon esperanza

@jonathanesperanza.com

reads for tomorrow
- atproto.com/guides/data-...
- atproto.com/specs/lexicon
- atproto.com/specs/crypto...

November 18, 2024 at 4:04 AM

jon esperanza

@jonathanesperanza.com

the premise of atproto is very clear:
authentic decentralization of data for real-world applications that must scale

November 18, 2024 at 4:01 AM

jon esperanza

@jonathanesperanza.com

takeaways from atproto specs:
- unified data model and personal data server allow me to plug my own "trusted agent" into the bluesky network (self-hosting)
- Lexicons are the building blocks of atproto as they drive APIs and schemas
- cryptography is used to sign commits to data repos

November 18, 2024 at 3:56 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news