huggingface
diff --git a/‎.github/workflows/ci.yml
Lines changed: 15 additions & 3 deletions b/‎.github/workflows/ci.yml
Lines changed: 15 additions & 3 deletions
diff --git a/‎docs/source/about_dataset_features.mdx
Lines changed: 8 additions & 11 deletions b/‎docs/source/about_dataset_features.mdx
Lines changed: 8 additions & 11 deletions
diff --git a/‎docs/source/audio_dataset.mdx
Lines changed: 3 additions & 9 deletions b/‎docs/source/audio_dataset.mdx
Lines changed: 3 additions & 9 deletions
diff --git a/‎docs/source/audio_load.mdx
Lines changed: 2 additions & 6 deletions b/‎docs/source/audio_load.mdx
Lines changed: 2 additions & 6 deletions
diff --git a/‎docs/source/audio_process.mdx
Lines changed: 30 additions & 21 deletions b/‎docs/source/audio_process.mdx
Lines changed: 30 additions & 21 deletions
diff --git a/‎docs/source/create_dataset.mdx
Lines changed: 6 additions & 6 deletions b/‎docs/source/create_dataset.mdx
Lines changed: 6 additions & 6 deletions
diff --git a/‎docs/source/installation.md
Lines changed: 1 addition & 13 deletions b/‎docs/source/installation.md
Lines changed: 1 addition & 13 deletions
@@ -44,15 +44,17 @@ jobs:
       - uses: actions/checkout@v4
         with:
           fetch-depth: 0
+      - name: Setup FFmpeg
+        if: ${{ matrix.os == 'ubuntu-latest' }}
+        run: |
+          sudo apt update
+          sudo apt install -y ffmpeg 
       - name: Set up Python 3.9
         uses: actions/setup-python@v5
         with:
           python-version: "3.9"
       - name: Upgrade pip
         run: python -m pip install --upgrade pip
-      - name: Pin setuptools-scm
-        if: ${{ matrix.os == 'ubuntu-latest' }}
-        run: echo "installing pinned version of setuptools-scm to fix seqeval installation on 3.7" && pip install "setuptools-scm==6.4.2"
       - name: Install uv
         run: pip install --upgrade uv
       - name: Install dependencies
@@ -80,6 +82,11 @@ jobs:
       - uses: actions/checkout@v4
         with:
           fetch-depth: 0
+      - name: Setup FFmpeg
+        if: ${{ matrix.os == 'ubuntu-latest' }}
+        run: |
+          sudo apt update
+          sudo apt install -y ffmpeg 
       - name: Set up Python 3.11
         uses: actions/setup-python@v5
         with:
@@ -107,6 +114,11 @@ jobs:
       - uses: actions/checkout@v4
         with:
           fetch-depth: 0
+      - name: Setup FFmpeg
+        if: ${{ matrix.os == 'ubuntu-latest' }}
+        run: |
+          sudo apt update
+          sudo apt install -y ffmpeg 
       - name: Set up Python 3.11
         uses: actions/setup-python@v5
         with:
 
@@ -53,7 +53,7 @@ See the [flatten](./process#flatten) section to learn how you can extract the ne
 
 </Tip>
 
-The array feature type is useful for creating arrays of various sizes. You can create arrays with two dimensions using [`Array2D`], and even arrays with five dimensions using [`Array5D`]. 
+The array feature type is useful for creating arrays of various sizes. You can create arrays with two dimensions using [`Array2D`], and even arrays with five dimensions using [`Array5D`].
 
 ```py
 >>> features = Features({'a': Array2D(shape=(1, 3), dtype='int32')})
@@ -69,9 +69,9 @@ The array type also allows the first dimension of the array to be dynamic. This
 
 Audio datasets have a column with type [`Audio`], which contains three important fields:
 
-* `array`: the decoded audio data represented as a 1-dimensional array.
-* `path`: the path to the downloaded audio file.
-* `sampling_rate`: the sampling rate of the audio data.
+- `array`: the decoded audio data represented as a 1-dimensional array.
+- `path`: the path to the downloaded audio file.
+- `sampling_rate`: the sampling rate of the audio data.
 
 When you load an audio dataset and call the audio column, the [`Audio`] feature automatically decodes and resamples the audio file:
 
@@ -80,10 +80,7 @@ When you load an audio dataset and call the audio column, the [`Audio`] feature
 
 >>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
 >>> dataset[0]["audio"]
-{'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
-         0.        ,  0.        ], dtype=float32),
- 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
- 'sampling_rate': 8000}
+<datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
 ```
 
 <Tip warning={true}>
@@ -92,7 +89,7 @@ Index into an audio dataset using the row index first and then the `audio` colum
 
 </Tip>
 
-With `decode=False`, the [`Audio`] type simply gives you the path or the bytes of the audio file, without decoding it into an `array`, 
+With `decode=False`, the [`Audio`] type simply gives you the path or the bytes of the audio file, without decoding it into an torchcodec `AudioDecoder` object,
 
 ```py
 >>> dataset = load_dataset("PolyAI/minds14", "en-US", split="train").cast_column("audio", Audio(decode=False))
@@ -126,7 +123,7 @@ Index into an image dataset using the row index first and then the `image` colum
 
 </Tip>
 
-With `decode=False`, the [`Image`] type simply gives you the path or the bytes of the image file, without decoding it into an `PIL.Image`, 
+With `decode=False`, the [`Image`] type simply gives you the path or the bytes of the image file, without decoding it into an `PIL.Image`,
 
 ```py
 >>> dataset = load_dataset("AI-Lab-Makerere/beans", split="train").cast_column("image", Image(decode=False))
@@ -146,4 +143,4 @@ You can also define a dataset of images from numpy arrays:
 And in this case the numpy arrays are encoded into PNG (or TIFF if the pixels values precision is important).
 
 For multi-channels arrays like RGB or RGBA, only uint8 is supported. If you use a larger precision, you get a warning and the array is downcasted to uint8.
-For gray-scale images you can use the integer or float precision you want as long as it is compatible with `Pillow`. A warning is shown if your image integer or float precision is too high, and in this case the array is downcated: an int64 array is downcasted to int32, and a float64 array is downcasted to float32. 
+For gray-scale images you can use the integer or float precision you want as long as it is compatible with `Pillow`. A warning is shown if your image integer or float precision is too high, and in this case the array is downcated: an int64 array is downcasted to int32, and a float64 array is downcasted to float32.
@@ -10,10 +10,9 @@ dataset = load_dataset("<username>/my_dataset")
 
 There are several methods for creating and sharing an audio dataset:
 
-* Create an audio dataset from local files in python with [`Dataset.push_to_hub`]. This is an easy way that requires only a few steps in python.
-
-* Create an audio dataset repository with the `AudioFolder` builder. This is a no-code solution for quickly creating an audio dataset with several thousand audio files.
+- Create an audio dataset from local files in python with [`Dataset.push_to_hub`]. This is an easy way that requires only a few steps in python.
 
+- Create an audio dataset repository with the `AudioFolder` builder. This is a no-code solution for quickly creating an audio dataset with several thousand audio files.
 
 <Tip>
 
@@ -28,10 +27,7 @@ You can load your own dataset using the paths to your audio files. Use the [`~Da
 ```py
 >>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", "path/to/audio_2", ..., "path/to/audio_n"]}).cast_column("audio", Audio())
 >>> audio_dataset[0]["audio"]
-{'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
-         0.        ,  0.        ], dtype=float32),
- 'path': 'path/to/audio_1',
- 'sampling_rate': 16000}
+<datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
 ```
 
 Then upload the dataset to the Hugging Face Hub using [`Dataset.push_to_hub`]:
@@ -51,7 +47,6 @@ my_dataset/
 
 ## AudioFolder
 
-
 The `AudioFolder` is a dataset builder designed to quickly load an audio dataset with several thousand audio files without requiring you to write any code.
 
 <Tip>
@@ -101,7 +96,6 @@ If all audio files are contained in a single directory or if they are not on the
 
 </Tip>
 
-
 If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your folder. This lets you quickly create datasets for different computer vision tasks like text captioning or object detection. You can also use a JSONL file `metadata.jsonl` or a Parquet file `metadata.parquet`.
 
 ```
 
@@ -8,18 +8,14 @@ Audio decoding is based on the [`soundfile`](https://github.com/bastibe/python-s
 To work with audio datasets, you need to have the `audio` dependencies installed.
 Check out the [installation](./installation#audio) guide to learn how to install it.
 
-
 ## Local files
 
 You can load your own dataset using the paths to your audio files. Use the [`~Dataset.cast_column`] function to take a column of audio file paths, and cast it to the [`Audio`] feature:
 
 ```py
 >>> audio_dataset = Dataset.from_dict({"audio": ["path/to/audio_1", "path/to/audio_2", ..., "path/to/audio_n"]}).cast_column("audio", Audio())
 >>> audio_dataset[0]["audio"]
-{'array': array([ 0.        ,  0.00024414, -0.00024414, ..., -0.00024414,
-         0.        ,  0.        ], dtype=float32),
- 'path': 'path/to/audio_1',
- 'sampling_rate': 16000}
+<datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
 ```
 
 ## AudioFolder
@@ -99,7 +95,7 @@ For a guide on how to load any type of dataset, take a look at the <a class="und
 
 ## Audio decoding
 
-By default, audio files are decoded sequentially as NumPy arrays when you iterate on a dataset.
+By default, audio files are decoded sequentially as torchcodec [`AudioDecoder`](https://docs.pytorch.org/torchcodec/stable/generated/torchcodec.decoders.AudioDecoder.html#torchcodec.decoders.AudioDecoder) objects when you iterate on a dataset.
 However it is possible to speed up the dataset significantly using multithreaded decoding:
 
 ```python
 
@@ -7,7 +7,6 @@ This guide shows specific methods for processing audio datasets. Learn how to:
 
 For a guide on how to process any type of dataset, take a look at the <a class="underline decoration-sky-400 decoration-2 font-semibold" href="./process">general process guide</a>.
 
-
 ## Cast
 
 The [`~Dataset.cast_column`] function is used to cast a column to another feature to be decoded. When you use this function with the [`Audio`] feature, you can resample the sampling rate:
@@ -22,16 +21,26 @@ The [`~Dataset.cast_column`] function is used to cast a column to another featur
 Audio files are decoded and resampled on-the-fly, so the next time you access an example, the audio file is resampled to 16kHz:
 
 ```py
->>> dataset[0]["audio"]
-{'array': array([ 2.3443763e-05,  2.1729663e-04,  2.2145823e-04, ...,
-         3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),
- 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
- 'sampling_rate': 16000}
+>>> audio = dataset[0]["audio"]
+<datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
+>>> audio = audio_dataset[0]["audio"]
+>>> samples = audio.get_all_samples()
+>>> samples.data
+tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  2.3447e-06,
+         -1.9127e-04, -5.3330e-05]]
+>>> samples.sample_rate
+16000
 ```
 
 <div class="flex justify-center">
-    <img class="block dark:hidden" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/resample.gif"/>
-    <img class="hidden dark:block" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/resample-dark.gif"/>
+  <img
+    class="block dark:hidden"
+    src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/resample.gif"
+  />
+  <img
+    class="hidden dark:block"
+    src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/resample-dark.gif"
+  />
 </div>
 
 ## Map
@@ -40,30 +49,30 @@ The [`~Dataset.map`] function helps preprocess your entire dataset at once. Depe
 
 - For pretrained speech recognition models, load a feature extractor and tokenizer and combine them in a `processor`:
 
-    ```py
-    >>> from transformers import AutoTokenizer, AutoFeatureExtractor, AutoProcessor
+  ```py
+  >>> from transformers import AutoTokenizer, AutoFeatureExtractor, AutoProcessor
 
-    >>> model_checkpoint = "facebook/wav2vec2-large-xlsr-53"
-    # after defining a vocab.json file you can instantiate a tokenizer object:
-    >>> tokenizer = AutoTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
-    >>> feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
-    >>> processor = AutoProcessor.from_pretrained(feature_extractor=feature_extractor, tokenizer=tokenizer)
-    ```
+  >>> model_checkpoint = "facebook/wav2vec2-large-xlsr-53"
+  # after defining a vocab.json file you can instantiate a tokenizer object:
+  >>> tokenizer = AutoTokenizer("./vocab.json", unk_token="[UNK]", pad_token="[PAD]", word_delimiter_token="|")
+  >>> feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)
+  >>> processor = AutoProcessor.from_pretrained(feature_extractor=feature_extractor, tokenizer=tokenizer)
+  ```
 
 - For fine-tuned speech recognition models, you only need to load a `processor`:
 
-    ```py
-    >>> from transformers import AutoProcessor
+  ```py
+  >>> from transformers import AutoProcessor
 
-    >>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
-    ```
+  >>> processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
+  ```
 
 When you use [`~Dataset.map`] with your preprocessing function, include the `audio` column to ensure you're actually resampling the audio data:
 
 ```py
 >>> def prepare_dataset(batch):
 ...     audio = batch["audio"]
-...     batch["input_values"] = processor(audio["array"], sampling_rate=audio["sampling_rate"]).input_values[0]
+...     batch["input_values"] = processor(audio.get_all_samples().data, sampling_rate=audio["sampling_rate"]).input_values[0]
 ...     batch["input_length"] = len(batch["input_values"])
 ...     with processor.as_target_processor():
 ...         batch["labels"] = processor(batch["sentence"]).input_ids
 
@@ -4,8 +4,8 @@ Sometimes, you may need to create a dataset if you're working with your own data
 
 In this tutorial, you'll learn how to use 🤗 Datasets low-code methods for creating all types of datasets:
 
-* Folder-based builders for quickly creating an image or audio dataset
-* `from_` methods for creating datasets from local files
+- Folder-based builders for quickly creating an image or audio dataset
+- `from_` methods for creating datasets from local files
 
 ## File-based builders
 
@@ -24,10 +24,10 @@ To get the list of supported formats and code examples, follow this guide [here]
 
 There are two folder-based builders, [`ImageFolder`] and [`AudioFolder`]. These are low-code methods for quickly creating an image or speech and audio dataset with several thousand examples. They are great for rapidly prototyping computer vision and speech models before scaling to a larger dataset. Folder-based builders takes your data and automatically generates the dataset's features, splits, and labels. Under the hood:
 
-* [`ImageFolder`] uses the [`~datasets.Image`] feature to decode an image file. Many image extension formats are supported, such as jpg and png, but other formats are also supported. You can check the complete [list](https://github.com/huggingface/datasets/blob/b5672a956d5de864e6f5550e493527d962d6ae55/src/datasets/packaged_modules/imagefolder/imagefolder.py#L39) of supported image extensions.
-* [`AudioFolder`] uses the [`~datasets.Audio`] feature to decode an audio file. Audio extensions such as wav and mp3 are supported, and you can check the complete [list](https://github.com/huggingface/datasets/blob/b5672a956d5de864e6f5550e493527d962d6ae55/src/datasets/packaged_modules/audiofolder/audiofolder.py#L39) of supported audio extensions.
+- [`ImageFolder`] uses the [`~datasets.Image`] feature to decode an image file. Many image extension formats are supported, such as jpg and png, but other formats are also supported. You can check the complete [list](https://github.com/huggingface/datasets/blob/b5672a956d5de864e6f5550e493527d962d6ae55/src/datasets/packaged_modules/imagefolder/imagefolder.py#L39) of supported image extensions.
+- [`AudioFolder`] uses the [`~datasets.Audio`] feature to decode an audio file. Extensions such as wav, mp3, and even mp4 are supported, and you can check the complete [list](https://ffmpeg.org/ffmpeg-formats.html) of supported audio extensions. Decoding is done via ffmpeg.
 
-The dataset splits are generated from the repository structure, and the label names are automatically inferred from the directory name. 
+The dataset splits are generated from the repository structure, and the label names are automatically inferred from the directory name.
 
 For example, if your image dataset (it is the same for an audio dataset) is stored like this:
 
@@ -44,7 +44,7 @@ pokemon/test/water/wartortle.png
 Then this is how the folder-based builder generates an example:
 
 <div class="flex justify-center">
-    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/folder-based-builder.png"/>
+  <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/folder-based-builder.png" />
 </div>
 
 Create the image dataset by specifying `imagefolder` in [`load_dataset`]:
 
@@ -30,7 +30,7 @@ You should install 🤗 Datasets in a [virtual environment](https://docs.python.
    ```bash
    # Activate the virtual environment
    source .env/bin/activate
-   
+
    # Deactivate the virtual environment
    source .env/bin/deactivate
    ```
@@ -65,18 +65,6 @@ To work with audio datasets, you need to install the [`Audio`] feature as an ext
 pip install datasets[audio]
 ```
 
-<Tip warning={true}>
-
-To decode mp3 files, you need to have at least version 1.1.0 of the `libsndfile` system library. Usually, it's bundled with the python [`soundfile`](https://github.com/bastibe/python-soundfile) package, which is installed as an extra audio dependency for 🤗 Datasets.
-For Linux, the required version of `libsndfile` is bundled with `soundfile` starting from version 0.12.0. You can run the following command to determine which version of `libsndfile` is being used by `soundfile`:
-
-```bash
-python -c "import soundfile; print(soundfile.__libsndfile_version__)"
-```
-
-</Tip>
-
-
 ## Vision
 
 To work with image datasets, you need to install the [`Image`] feature as an extra dependency: