Byte Pair Encoding for Symbolic Music

Contents

Introduction

Welcome to the demo website of the paper “Byte Pair Encoding for Symbolic Music” (EMNLP 2023).

[Paper] [Github]

BPE

Byte Pair Encoding (BPE) is a data compression technique. It converts the most recurrent successive bytes\footnote{For symbolic music in our case, the basic “tokens” (e.g. describing note and time attributes) can be seen as the base characters (bytes) that will be merged.} in a corpus into newly created ones. For instance, in the character sequence aabaabaacaa, the sub-sequence aa occurs three times and is the most recurrent one. Learning and applying BPE on this sequence would replace aa with a new symbol, e.g., d, resulting in a compressed sequence dbdbdcd. The latter can be reduced again by replacing the db subsequence, giving eedcd. In practice BPE is learned on a corpus until the vocabulary reaches a target size.

BPE is nowadays largely used in the NLP field as it allows to encode rare words and segmenting unknown or composed words as sequences of sub-word units. Other token aggregation, or vocabulary building techniques exist. The two other most commonly used are Unigram or WordPiece, which operations share similarities with BPE.

For natural language, bytes are the distinct characters composing the text. Its application to symbolic music has however not yet been studied. For symbolic music, the “bytes” can be seen as the distinct note and time attributes in this paper. In this context, BPE can allow to represent a note, or even a succession of notes, that is very recurrent in the dataset, as a single token. For instance, a note that would be tokenized as the succession of tokens Pitch_D3, Velocity_60, Duration_2.0 could be replaced by a single new one. Rare note (and attributes) can still be tokenized as non-BPE tokens. The same logic applies to time tokens, that can also be associated to note tokens.

In this paper, we show that BPE can address two main concerns about how symbolic music was previously tokenized:

  1. The fairly long sequence length resulting by using one token per note attribute (e.g. pitch, duration) and time events. Long sequences is problematic as the time and space complexity of Transformer models grows quadratically with the input sequence.
  2. The poor usage of the model’s embedding space. Language models first project tokens into a learned embedding space, in which the embeddings (continuous representations of the tokens) are learnt to represent their semantic information. This is an essential feature of such models, as it allows them to capture the meaning of the tokens and data. In symbolic music, the tokens usually only represent note attribute values or time values, which do not carry much information other than their absolute value. And vocabularies range often between 200 and 500 tokens, which are then represented on 512 to 1024 dimensions. In such conditions, the embedding space is misused and the potential of the model is poorly exploited.

When applied on symbolic music, BPE will allow to drastically reduce the sequence length, while creating new tokens that can represent whole notes, and sequences of notes. The model’s efficiency is then greatly improved, while bringing more information per tokens. It greatly improves the quality of generation, while improving up to three times the inference speed.

BPE is fully implemented within MidiTok, allowing you to easily benefit from this method on top of most of the existing tokenizations.

Main results

We recap the main results from the paper.

Model weights

Models weights are not currently shared during the reviewing process, as we cannot share them in a way that would not break anonymity. We will made them publicly available upon acceptance, along with the training files. We also plan to publish the weights of the best performing models on the Hugging Face hub.

Generation

BPE allows to improve the quality of the generation, while increasing the inference speed. Generated examples are shown in the last sections.

Generation quality
Metrics of generated results. TSE results are all scaled at e−3 for better readability. Hum stand for human, - for non-concerned (i.e. 0).

Inference speed
Inference speeds on a V100 GPU and proportion of vocabulary sampled during generation. For tok/sec, the results account for basic tokens of note attributes and time. Tok/sec for Octuple is not showed as the equivalent number of base tokens is not clearly calculable.

Classification

Classification accuracy
Average accuracy of classification models.

Embedding space

One of the main limitation of the previous work on symbolic music modeling is the suboptimal usage of the embedding space of LMs. Most of them use models with embeddings represented from 512 to 1024 dimensions, for vocabularies of less than 500 tokens. One can easily see here the suboptimal situation of using as much dimensions to represent so few points. The same models, when used with natural language, use to learn up to 50k embeddings on a range of 512 to 1024 dimensions.

BPE allows to make a better use of the embedding space, using more space while having better distributed embeddings (isotropy).

Isotropy
Isocore, and intrinsic dimension (ID) estimations. Gen. corresponds to the causal generative models, Pt. to the pretrained bidirectional models.

We display below the 2D and 3D UMAP projections of the embedding matrices of all the models. For each window, the seven representations correspond to: no BPE, BPE 1k, BPE 5k, BPE 10k, BPE 20k, PVm and PVDm.

Generator (GPT2) + TSD

    / [pdf]
    / [pdf]

Generator (GPT2) + REMI

    / [pdf]
    / [pdf]

Pretrained (BERT) + TSD

    / [pdf]
    / [pdf]

Pretrained (BERT) + REMI

    / [pdf]
    / [pdf]

Singular values

The pages correspond to Pretrained + TSD, Pretrained + REMI, Generator + TSD and Generator + REMI.

    / [pdf]

Generated examples

We made listenable generated samples below, that have been cherry-picked so that all tracks show enough diversity. You can download all samples used for the human evaluations as MIDI files. On these files, the tracks are not shuffled and corresponds by order to: no BPE, BPE 1k, BPE 5k, BPE 10k, BPE 20k, PVm, PVDm, CP Word and Octuple. (the two latter are only present for REMI files) The same goes for the audio samples below.

TSD

File 27

File 41

File 52

File 55

REMI

Recall: tracks corresponds by order to: no BPE, BPE 1k, BPE 5k, BPE 10k, BPE 20k, PVm, PVDm, CP Word and Octuple

File 12

File 33

File 47

File 54

Citation

(ACL url/doi/pages will be added once the proceeding will be published)

@inproceedings{bpe-symbolic-music,
    title = "Byte Pair Encoding for Symbolic Music",
    author = "Fradet, Nathan  and
      Gutowski, Nicolas  and
      Chhel, Fabien  and
      Briot, Jean-Pierre",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2301.11975",
}