Modern deep learning approaches usually transform inputs into a modality-specific form. Researchers from Apple have developed a new approach bypassing this step and directly training transformer-based deep learning models on raw file bytes, thus enabling the development of models that can operate on multiple input modalities. The presented ByteFormer model shows excellent performance in the image domain and has applications in privacy-preserving inference.
Link to paper: https://arxiv.org/pdf/2306.00238.pdf