Lightnews — Scholar-powered news

WIL

@wrmoura.bsky.social

Tem várias 'regras' sobre o loop for e depende do ambiente tbm. Com a vivência vc vai pegando. No meu outro emprego eu usava for no python para bases de dados enormes e era suave, AGR no meu serviço atual se eu usar um loop for em pyspark pode saber que vai quebrar a sessão kkkkk

October 30, 2025 at 4:25 PM

HackerNoon

@hackernoon.com

Flatten nested JSON and XML dynamically in Spark using a recursive PySpark function for analytics-ready data without hardcoding. #dataengineering

How to Flatten Nested JSON and XML in Apache Spark

hackernoon.com

October 28, 2025 at 4:00 AM

Frango Janeticamente Modificado

@frangodapadaria.bsky.social

Oi amores. Bom dia. Abriu uma vaga para cientista de dados senior na area de recomendações de onde eu trabalho. Precisa de conhecimento em ferramentas de manipulação de dados (sql/pyspark), LLMs (RAG, LLMOps, etc), modelos de linguagem (BERT, hugging face, sentence transformers)

October 27, 2025 at 12:49 PM

JB stan account

@johnbrownstan.bsky.social

How enormous is enormous?

I personally wouldn't use R for this. I would use Python to filter down to what you need and then import into R. For truly enormous CSVs you can use pyspark, or you can read them in line-by-line and only output rows to a new CSV if they pass the filters.

October 15, 2025 at 8:26 PM

JB stan account

@johnbrownstan.bsky.social

Sounds large enough to want to use some sort of lazy loading. Both of the options above are lazy.

I would personally go with reading the file in line-by-line because pyspark can be finicky. It's probably slower but you only need to do it once, right?

October 15, 2025 at 8:42 PM

Eugene Meidinger

@sqlgene.com

I'm super appreciative of @databard.bsky.social explaining PySpark in a way that doesn't assume any experience. So many things I wish I knew 4 months ago.
www.youtube.com/watch?v=2p2S...

ETL Approaches When Using Fabric PySpark - Jared Kuehn

YouTube video by Level Up Your Data

www.youtube.com

September 16, 2025 at 8:23 PM

JobBoardSearch 🔎

@jobboardsearch.com

📢 Henderson Scott is #hiring a Confluent Engineer - Apache Kafka, PySpark, Python!

💰 £500.00 - £550.00
GB
⏰ CONTRACTOR
⏰ Contractor

🔗 http://jbs.ink/nMc2lVyj4p6L

#jobalert #jobsearch #python #spark #devops #sql #kafka #design

February 13, 2025 at 1:12 PM

axjack

@axjack.bsky.social

pyspark...

April 9, 2024 at 8:44 AM

Al Merose (he/him)

@al.merose.com

Ok, just looked at the benchmark overview and I’m a little disappointed. Comparing performance on a dataset of ~30 GBs tells me very little given it could all fit into RAM on commodity hardware. Like, the difference is up to chunking.

Screenshot of a github README about benchmarks that reads: “For this benchmark, we use the full FHVHV dataset stored in Parquet files on S3. The total size of this dataset is 24.7 GiB. The Central Park Weather data ia stored in a single CSV file on S3 and its total size is 514 KiB.
We compared Bodo's performance on this workload to other systems including Dask, Modin on Ray, and PySpark and observed a speedup of 20-240x. The implementations for all of these systems can be found in nyc_taxi. Versions of the packages used are summarized below.”

January 2, 2025 at 3:29 AM

Jordan Tigani

@jrdntgn.bsky.social

A DuckLake reader for PySpark in ~30 lines of code. Despite looking simple, still partitions work across Spark nodes. Zero external dependencies, just the DuckDB JDBC driver. Try that with your favorite lakehouse technology :-)

Hannes Mühleisen @hannes.muehleisen.org · Jul 1

Want to read #ducklake from APACHE SPARK? Check this out: gist.github.com/hannes/395ac... #butdoesitscale #yesitdoes

pyspark-ducklake-2.py

GitHub Gist: instantly share code, notes, and snippets.

gist.github.com

July 2, 2025 at 7:20 PM

AI и ML Новости

@ai-ru.at.thenote.app

Клиент 360: Обнаружение мошенничества в финтехе с помощью PySpark и машинного обучения

"Каждый банк использует Customer 360 для хранения записей о клиентах в едином виде, и он также может быть использован для обнаружения мошенничества.

Что такое Customer 360?

Customer 360 - это со…

#ai #ml #news

Customer 360: Fraud Detection in Fintech With PySpark and ML

dzone.com

May 17, 2025 at 4:06 PM

Feed AI bot

@cloud-bs-bot.bsky.social

AWS Clean RoomsがPySparkのエラーメッセージ設定に対応。参加者が承認すれば、詳細なエラーメッセージを表示でき、分析の高速化とトラブルシューティングの効率化を実現。データ共有せずに協調分析が可能になる。

aws.amazon.com/about-aws/wh...

August 20, 2025 at 11:19 PM

EduCativ

@educativ.bsky.social

Banamex - Desarrollador de Modelos Analíticos para Machine Learning (Pyspark) – C11 - Ciudad De Mexico Distrito Federal Mexico Job educativ.net/jobs/job/46173...

October 2, 2025 at 5:52 AM

o mitoclock

@radioradiogra.bsky.social

Sei fazer um pipeline de ETL em arquitetura medalhão no PySpark, pra depois criar uma ABT e rodar um modelinho de XGBoost, filho da PULTA

japonillo @japonillo.bsky.social · Jun 22

refletindo

November 7, 2025 at 9:28 AM

Lucas

@qfoiluks.bsky.social

Fiz todos os testes passarem, tudo bonito, tudo lindo.

Agora porra da JVM rodando por trás do pyspark começa a encrencar comigo e eu não faço a mais puta ideia de como configurar essa bosta.

January 31, 2025 at 12:56 AM

Nata

@nataya.bsky.social

- ¿Qué es un sistema distribuido?
- ¿Has usado Spark? ¿Qué es un RDD? ¿Cuándo usarias PySpark y cuándo Pandas?

November 28, 2024 at 11:07 AM

Maynara Gouvea

@maynara.bsky.social

Hoje eu tava aqui pensando sobre como eu aprendi rápido pyspark. Era algo que eu flertava há anos, mas sempre achava q n era pra mim. E como também não usava no dia a dia, nunca sabia se daria conta. Muita loucura

September 18, 2024 at 3:49 PM

Štěpán Rešl

@stepanresl.bsky.social

This quick test in #MicrosoftFabric showed me a lot of interesting behaviors:
- There are rendering differences between Python and PySpark notebooks.
- You can use "temp" storage allocated to the notebook for many crazy ideas, even for rendering videos 😂 (if you then store it into Lakehouse)

January 25, 2025 at 11:00 AM

Azure Weekly

@azureweekly.endj.in

#MicrosoftFabric and PySpark: Coding tricks to improve your solutions by Dennes Torres #Azure https://www.sqlservercentral.com/articles/microsoft-fabric-and-pyspark-coding-tricks-to-improve-your-solutions?utm_source=bluesky&utm_medium=social&utm_campaign=azureweekly

Microsoft Fabric and PySpark: Coding tricks to improve your solutions - Simple Talk

PySpark has some unconventional syntaxes which provide power to the development process, making it easier. We talked about loops before. Let’s discover

www.sqlservercentral.com

October 8, 2024 at 9:50 AM

Hacker & Security News

@hacker.at.thenote.app

How to Fix Data Skew in Apache Spark with the Salting Technique

Learn how to fix data skew in Apache Spark using the salting technique for improved performance and balanced partitions in Scala and PySpark.

#hackernews #news

How to Fix Data Skew in Apache Spark with the Salting Technique

Learn how to fix data skew in Apache Spark using the salting technique for improved performance and balanced partitions in Scala and PySpark.

hackernoon.com

June 28, 2025 at 10:39 PM

findsuperdeals.bsky.social

@findsuperdeals.bsky.social

Databricks Architect – L1

Job title: Databricks Architect - L1 Company: Wipro Job description: information, visit us at www.wipro.com. Databricks Architect · Should have minimum of 10+ years of experience... · Must have skills - DataBricks, Delta Lake, pyspark or scala spark, Unity Catalog · Good…

Databricks Architect – L1

Job title: Databricks Architect - L1 Company: Wipro Job description: information, visit us at www.wipro.com. Databricks Architect · Should have minimum of 10+ years of experience... · Must have skills - DataBricks, Delta Lake, pyspark or scala spark, Unity Catalog · Good to have skills - Azure and/or AWS Cloud... Expected salary: Location: Bangalore, Karnataka Job date: Sun, 01 Jun 2025 06:05:40 GMT Apply for the job now!

findsuperdeals.shop

June 3, 2025 at 1:21 PM

Awakari

@bluesky.awakari.com

Azure Databricks / Python / Pyspark Senior Data Engineer @ Exusia Description Sr Data Engineers & Tech Leads – Python/Pyspark/Databricks Department: Sales and Delivery Team - EmpowerIndustry:...

Result Details

Origin

aijobs.net

May 4, 2025 at 6:34 AM