Introduction
Welcome to the demo website of the paper “Byte Pair Encoding for Symbolic Music” (EMNLP 2023).
Byte Pair Encoding (BPE) is a data compression technique. It converts the most recurrent successive bytes\footnote{For symbolic music in our case, the basic “tokens” (e.g. describing note and time attributes) can be seen as the base characters (bytes) that will be merged.} in a corpus into newly created ones. For instance, in the character sequence aabaabaacaa
, the sub-sequence aa
occurs three times and is the most recurrent one. Learning and applying BPE on this sequence would replace aa
with a new symbol, e.g., d
, resulting in a compressed sequence dbdbdcd
. The latter can be reduced again by replacing the db
subsequence, giving eedcd
. In practice BPE is learned on a corpus until the vocabulary reaches a target size.
BPE is nowadays largely used in the NLP field as it allows to encode rare words and segmenting unknown or composed words as sequences of sub-word units. Other token aggregation, or vocabulary building techniques exist. The two other most commonly used are Unigram or WordPiece, which operations share similarities with BPE.
For natural language, bytes are the distinct characters composing the text. Its application to symbolic music has however not yet been studied.
For symbolic music, the “bytes” can be seen as the distinct note and time attributes in this paper. In this context, BPE can allow to represent a note, or even a succession of notes, that is very recurrent in the dataset, as a single token. For instance, a note that would be tokenized as the succession of tokens Pitch_D3
, Velocity_60
, Duration_2.0
could be replaced by a single new one.
Rare note (and attributes) can still be tokenized as non-BPE tokens. The same logic applies to time tokens, that can also be associated to note tokens.
In this paper, we show that BPE can address two main concerns about how symbolic music was previously tokenized:
When applied on symbolic music, BPE will allow to drastically reduce the sequence length, while creating new tokens that can represent whole notes, and sequences of notes. The model’s efficiency is then greatly improved, while bringing more information per tokens. It greatly improves the quality of generation, while improving up to three times the inference speed.
BPE is fully implemented within MidiTok, allowing you to easily benefit from this method on top of most of the existing tokenizations.
We recap the main results from the paper.
Models weights are not currently shared during the reviewing process, as we cannot share them in a way that would not break anonymity. We will made them publicly available upon acceptance, along with the training files. We also plan to publish the weights of the best performing models on the Hugging Face hub.
BPE allows to improve the quality of the generation, while increasing the inference speed. Generated examples are shown in the last sections.
One of the main limitation of the previous work on symbolic music modeling is the suboptimal usage of the embedding space of LMs. Most of them use models with embeddings represented from 512 to 1024 dimensions, for vocabularies of less than 500 tokens. One can easily see here the suboptimal situation of using as much dimensions to represent so few points. The same models, when used with natural language, use to learn up to 50k embeddings on a range of 512 to 1024 dimensions.
BPE allows to make a better use of the embedding space, using more space while having better distributed embeddings (isotropy).
We display below the 2D and 3D UMAP projections of the embedding matrices of all the models. For each window, the seven representations correspond to: no BPE, BPE 1k, BPE 5k, BPE 10k, BPE 20k, PVm and PVDm.
The pages correspond to Pretrained + TSD, Pretrained + REMI, Generator + TSD and Generator + REMI.
We made listenable generated samples below, that have been cherry-picked so that all tracks show enough diversity. You can download all samples used for the human evaluations as MIDI files. On these files, the tracks are not shuffled and corresponds by order to: no BPE, BPE 1k, BPE 5k, BPE 10k, BPE 20k, PVm, PVDm, CP Word and Octuple. (the two latter are only present for REMI files) The same goes for the audio samples below.
Recall: tracks corresponds by order to: no BPE, BPE 1k, BPE 5k, BPE 10k, BPE 20k, PVm, PVDm, CP Word and Octuple
(ACL url/doi/pages will be added once the proceeding will be published)
@inproceedings{bpe-symbolic-music,
title = "Byte Pair Encoding for Symbolic Music",
author = "Fradet, Nathan and
Gutowski, Nicolas and
Chhel, Fabien and
Briot, Jean-Pierre",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/2301.11975",
}