Julien Guinot
juj-guinot.bsky.social
Julien Guinot
@juj-guinot.bsky.social
PhD student @Queen Mary / UMG, Music and AI

Human-centric representation learning for cool music AI applications :^)
11/11 : Some future work areas include scalability to more variant concepts, and moving from obvious concepts like key and tempo to more high-level notions which could be interesting for retrieval. More to come!
December 30, 2024 at 5:30 PM
10/11 : Not only that, LOEV++ allows users to control what attribute is more important for retrieval! By searching in the time-variant space, we get better retrieval results for tempo retrieval. Users can search for similar songs by specifying what kind of similarity they want.
December 30, 2024 at 5:29 PM
9/11 : To take it one step further, we propose LOEV++. Instead of splitting the network at the projection heads, we split it before, creating individual latent spaces containing the augmentation information in a disentangled way. We show that this further improves performance.
December 30, 2024 at 5:28 PM
8/11 : We do this by simply tracking which augmentations are applied and modifying the targets accordingly. We show through downstream probing that this forces the encoder to *not* discard key and tempo information, while keeping potent representations for general tasks.
December 30, 2024 at 5:28 PM
7/11 Our approach is simple: We keep an all-invariant projection head, but build two more projection heads, Pitch-variant and Time-variant. Each head has its own contrastive objective: In the pitch-variant head, views that have been augmented with pitch shifting are treated as *negatives*.
December 30, 2024 at 5:27 PM
6/11 So, there is a tradeoff - coming from the applied augmentations - between general and task-specific performance. LOEV, aims to fix this. We focus on two augmentations, Time Stretching (TS) and Pitch Shifting (PS), which are explicitly related to the musical notions of Tempo and Key.
December 30, 2024 at 5:26 PM
In music, this can be catastrophic. Take the example of a song in the key of A Major. Apply a pitch shifting augmentation to it, and you end up with two different keys! A contrastive model will still map them in the same latent spot. This can cause the key space to completely collapse.
December 30, 2024 at 5:25 PM
4/11: It has been shown that stronger augmentations lead to generally better performance on downstream tasks. But what happens when a downstream task needs representations to be variant to a certain transformation?
December 30, 2024 at 5:23 PM
3/11 In doing so, contrastive models effectively learn invariances. By learning to map augmented data points to the same spot in the latent space. They learn to be *invariant* to the augmentations.
December 30, 2024 at 5:23 PM
2/11 : Unimodal contrastive learning uses augmentations to produce different views of samples. The model then learns to push views from the same sample together in the latent space, and repel views from different samples. This allows models to internalize semantic information without supervision.
December 30, 2024 at 5:22 PM

1/11 : In this work, we propose a simple way to mitigate the loss of information due to learned invariances in contrastive learning for music. This information loss can be catastrophic for downstream tasks and LOEV is a very cheap lunch to fix that!
December 30, 2024 at 5:22 PM