Lightnews — Scholar-powered news

Aécio Santos

@aeciosan.bsky.social

Research Engineer at New York University. Interested in dataset search & discovery, sketching, data management, nlp, and information retrieval.

Posts Replies Media Videos

Aécio Santos

@aeciosan.bsky.social

We also introduce GDC-SM, a new benchmark for evaluating schema matching algorithms on real-world biomedical data. Labeling this dataset was a huge collaborative effort between our group and colleagues at the NYU School of Medicine. The benchmark is available on Zenodo: zenodo.org/records/1496...

GDC-SM: The GDC Schema Matching Benchmark

GDC-SM is a schema matching evaluation benchmark based on a real data harmonization scenario that is common in biomedical research: pooling datasets from multiple studies to increase the number of pat...

zenodo.org

August 7, 2025 at 6:03 AM

Aécio Santos

@aeciosan.bsky.social

@sigmod2025.bsky.social #SIGMOD #SIGMID2025

June 21, 2025 at 5:00 PM

Aécio Santos

@aeciosan.bsky.social

We also introduce Harmonia, our proof-of-concept prototype that implements this vision. It orchestrates specialized data integration algorithms and works with the user to create reproducible pipelines, boosting schema matching F1-score from 0.78 to 1.00 in our preliminary evaluation! #AI #LLMAgents

June 21, 2025 at 4:34 PM

Aécio Santos

@aeciosan.bsky.social

Data harmonization is a major bottleneck in many scientific fields. In our new paper, we present a vision for using LLM-based agents to streamline this slow, manual process of reconciling mismatched schemas and terms.

June 21, 2025 at 4:34 PM

Aécio Santos

@aeciosan.bsky.social

Maybe you will find this one-page proof for priority sampling, with applications to distinct elements and inner product sketches convenient to cover in class: epubs.siam.org/doi/abs/10.1... (full disclosure: I'm co-author and proofs are mainly due to Daliri and Musco)

2024 Symposium on Simplicity in Algorithms (SOSA) | Simple Analysis of Priority Sampling

Abstract We prove a tight upper bound on the variance of the priority sampling method (aka sequential Poisson sampling). Our proof is significantly shorter and simpler than the original proof given by...

epubs.siam.org

March 13, 2025 at 3:14 PM

Aécio Santos

@aeciosan.bsky.social

Models are sensitive to minor changes in format, e.g., simply repeating the column name multiple times led to improvements in zero-shot settings. However, fine tuning seems to decrease the performance differences.

March 6, 2025 at 1:24 PM

Aécio Santos

@aeciosan.bsky.social

Very interesting! We also experimented with different column serialization formats and found somewhat similar results in a schema matching task (arxiv.org/abs/2412.08194).

Magneto: Combining Small and Large Language Models for Schema Matching

Recent advances in language models opened new opportunities to address complex schema matching tasks. Schema matching approaches have been proposed that demonstrate the usefulness of language models, ...

arxiv.org

March 6, 2025 at 1:23 PM

Aécio Santos

@aeciosan.bsky.social

a recent paper discusses this: db.cs.cmu.edu/papers/2024/... the main reason for graph DBs success is the limitation of SQL for querying graphs, although relational DBs seem to be catching up since the addition of property graphs in the latest SQL 2023 standard.

db.cs.cmu.edu

December 17, 2024 at 1:51 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news