Notable References
- A framework for contrastive and generative learning of audio representations
- Deep residual learning for image recognition
- Audio set: An ontology and human-labeled dataset for audio events
- Progen: Language modeling for protein generation
- Language models are few-shot learners
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Music Transformer
- VideoBERT: A joint model for video and language representation learning
- Video action transformer network
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- Image Transformer
- Jukebox: A Generative Model for Music
- Neural Style Transfer for Audio Spectrograms
- Neuralogram: A Deep Neural Network Based Representation for Audio Signals
- Neural Discrete Representation Learning
- wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
- vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations
- Conditional End-to-End Audio Transforms
- Audio-linguistic embeddings for spoken sentences
- Attention Is All You Need
- FSD50K: An Open Dataset of Human-Labeled Sound Events
- SciPy 1.0: fundamental algorithms for scientific computing in Python
- OpenNMT: Open-source toolkit for neural machine translation
- Removing noise from music using local trigonometric bases and wavelet packets
- Frequency estimation from waveforms using multi-layered neural networks
- ImageNet: A large-scale hierarchical image database
- Language through a prism: A spectral approach for multiscale language representations
- TensorFlow: A system for large-scale machine learning
- Adam: A Method for Stochastic Optimization
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
- Generating Long Sequences with Sparse Transformers