Aécio Santos
aeciosan.bsky.social
Aécio Santos
@aeciosan.bsky.social
Research Engineer at New York University. Interested in dataset search & discovery, sketching, data management, nlp, and information retrieval.
We also introduce GDC-SM, a new benchmark for evaluating schema matching algorithms on real-world biomedical data. Labeling this dataset was a huge collaborative effort between our group and colleagues at the NYU School of Medicine. The benchmark is available on Zenodo: zenodo.org/records/1496...
GDC-SM: The GDC Schema Matching Benchmark
GDC-SM is a schema matching evaluation benchmark based on a real data harmonization scenario that is common in biomedical research: pooling datasets from multiple studies to increase the number of pat...
zenodo.org
August 7, 2025 at 6:03 AM
June 21, 2025 at 5:00 PM
We also introduce Harmonia, our proof-of-concept prototype that implements this vision. It orchestrates specialized data integration algorithms and works with the user to create reproducible pipelines, boosting schema matching F1-score from 0.78 to 1.00 in our preliminary evaluation! #AI #LLMAgents
June 21, 2025 at 4:34 PM
Data harmonization is a major bottleneck in many scientific fields. In our new paper, we present a vision for using LLM-based agents to streamline this slow, manual process of reconciling mismatched schemas and terms.
June 21, 2025 at 4:34 PM
Maybe you will find this one-page proof for priority sampling, with applications to distinct elements and inner product sketches convenient to cover in class: epubs.siam.org/doi/abs/10.1... (full disclosure: I'm co-author and proofs are mainly due to Daliri and Musco)
2024 Symposium on Simplicity in Algorithms (SOSA) | Simple Analysis of Priority Sampling
Abstract We prove a tight upper bound on the variance of the priority sampling method (aka sequential Poisson sampling). Our proof is significantly shorter and simpler than the original proof given by...
epubs.siam.org
March 13, 2025 at 3:14 PM
Models are sensitive to minor changes in format, e.g., simply repeating the column name multiple times led to improvements in zero-shot settings. However, fine tuning seems to decrease the performance differences.
March 6, 2025 at 1:24 PM
Very interesting! We also experimented with different column serialization formats and found somewhat similar results in a schema matching task (arxiv.org/abs/2412.08194).
Magneto: Combining Small and Large Language Models for Schema Matching
Recent advances in language models opened new opportunities to address complex schema matching tasks. Schema matching approaches have been proposed that demonstrate the usefulness of language models, ...
arxiv.org
March 6, 2025 at 1:23 PM
a recent paper discusses this: db.cs.cmu.edu/papers/2024/... the main reason for graph DBs success is the limitation of SQL for querying graphs, although relational DBs seem to be catching up since the addition of property graphs in the latest SQL 2023 standard.
db.cs.cmu.edu
December 17, 2024 at 1:51 AM