qinglong-captioner (2.7)

A Python toolkit for generating video captions using the Lance database format and Gemini API for automatic captioning.

Changlog

2.7

We've added a script for batch image datasets! It includes pre-scaling resolution and alignment functionality for image pairs! If only one path is entered, it will only process the size of the images, and you can set the maximum values for the longest and shortest edges to scale!

If two paths are entered, it will process image pairs, used for training matching of some editing models.

2.6

A new watermark detection script has been added, initially supporting two watermark detection models, which can quickly classify images in the dataset into watermarked/unwatermarked categories. It will generate two folders, and data separation is done through symbolic links. If needed, you can copy the corresponding folder to transfer data without deleting it, and it does not occupy additional space. (As symbolic links require permissions, you must run PowerShell as admin.)

Finally, it will generate a JSON file report listing the watermark detection results for all images in the original path, including detection values and results. The watermark threshold can be modified in the script to correspondingly change the detection results.

2.5

We officially support the tags highlight captions feature! Currently unlocked in the pixtral model, and we are considering adding it to other models such as gemini in the future.

What are tags highlight?

As is well known, non-state-of-the-art VLMs have some inaccuracies, so first use wdtagger for tags annotation, and then input the tags annotation to the VLM for assistance, which can improve accuracy.

Currently, the tags have been categorized, and it is also possible to quickly check the annotation quality (e.g., purple is for character names and copyright, red is for clothing, brown is for body features, light yellow is for actions, etc.)

The annotation quality obtained in the end is comparable to some closed-source models!

Additionally, we have added check parameters, which can specify the parent folder as the character name to designate the character's name, as well as specify the check for the tags highlight rate. Generally, good captions should have a highlight rate of over 35%.

You can also specify different highlight rates to change the default standard.

How to use？ just use 3、tagger.ps1 first for generate tags for your image datasets,

then use 4、run.ps1 with pixtral apikey

2.4

We support Gemini image caption and rating. It also supports gemini2.5-flash-preview-04-17.

However, after testing, the flash version has poor effects and image review, it is recommended to use the pro version

flash↑

pro ↑

2.3

Well, we forgot to release version 2.2, so we directly released version 2.3!

Version 2.3 updated the GLM4V model for video captions

2.2

Version 2.2 has updated TensorRT for accelerating local ONNX model WDtagger.

After testing, it takes 30 minutes to mark 10,000 samples with the standard CUDA tag,

while using TensorRT, it can be completed in just 15 to 20 minutes.

However, the first time using it will take a longer time to compile.

If TensorRT fails, it will automatically revert to CUDA without worry.

If it prompts that TensorRT librarys are missing, it may be missing some parts

Please install version 10.7.x manually from here

2.1

Added support for Gemini 2.5 Pro Exp. Now uses 600 seconds cut video by default.

2.0 Big Update！

Now we support video segmentation! A new video segmentation module has been added, which detects key timestamps based on scene changes and then outputs the corresponding images and video clips! Export an HTML for reference, the effect is very significant!

We have also added subtitle alignment algorithms, which automatically align Gemini's timestamp subtitles to the millisecond level after detecting scene change frames (there are still some errors, but the effect is much better).

Finally, we added the image output feature of the latest gemini-2.0-flash-exp model!

You can customize the task, add the task name in the config.toml, which will automatically handle the corresponding images (and then label them)

Currently, some simple task descriptions are as follows: Welcome the community to continuously optimize these task prompts and provide contributions!

qinglong-captions/config/config.toml

Lines 239 to 249 in 12b7750

    
           [prompts.task] 
        
           removal_watermark = "Your task is to remove the watermark from the image." 
        
           removal_watermark_then_caption = "Remove any text watermark(include English、Chinese、Japanese and any other language text watermark)、logo、watermark from this image, then generate captions about the new image." 
        
           removal_logo = "Your task is to remove the logo from the image." 
        
           removal_background = "Your task is to remove the background from the image." 
        
           removal_text = "Your task is to remove the text from the image."

1.9

Now with Mistral OCR functionality! Utilizing Mistral's advanced OCR capabilities to extract text information from videos and images.

This feature is particularly useful when processing media files containing subtitles, signs, or other text elements, enhancing the accuracy and completeness of captions.

The OCR functionality is integrated into the existing workflow and can be used without additional configuration.

1.8

Now added WDtagger！ Even if you cannot use the GPU, you can also use the CPU for labeling.

It has multi-threading and various optimizations, processing large-scale data quickly.

Using ONNX processing, model acceleration.

Code reference@kohya-ss https://github.com/sdbds/sd-scripts/blob/main/finetune/tag_images_by_wd14_tagger.py

Version 2.0 will add dual caption functionality, input wdtagger's taggers, then output natural language

1.7

Now we support the qwen-VL series video caption model!

qwen-vl-max-latest
qwen2.5-vl-72b-instruct
qwen2.5-vl-7b-instruct
qwen2.5-vl-3b-instruct

qwen2.5-vl has 2 seconds ~ 10 mins, qwen-vl-max-latest has 1 min limit. These models are not good at capturing timestamps; it is recommended to use segmented video clips for captions and to modify the prompts.

Video upload feature requires an application to be submitted to the official, please submit the application here.

We consider adding local model inference in the future, such as qwen2.5-vl-7b-instruct, etc.

Additionally, now using streaming inference to output logs, you can see the model's real-time output before the complete output is displayed.

1.6

Now the Google gemini SDK has been updated, and the new version of the SDK is suitable for the new model of gemini 2.0!

The new SDK is more powerful and mainly supports the function of verifying uploaded videos.

If you want to repeatedly tag the same video and no longer need to upload it repeatedly, the video name and file size/hash will be automatically verified.

At the same time, the millisecond-level alignment function has been updated. After the subtitles of long video segmentation are merged, the timeline is automatically aligned to milliseconds, which is very neat!

Features

Automatic video/audio/image description using Google's Gemini API or only image with pixtral-large 124B
Export captions in SRT format
Support for multiple video formats
Batch processing with progress tracking
Maintains original directory structure
Configurable through TOML files
Lance database integration for efficient data management

Modules

Dataset Import (`lanceImport.py`)

Import videos into Lance database format
Preserve original directory structure
Support for both single directory and paired directory structures

Dataset Export (`lanceexport.py`)

Extract videos and captions from Lance datasets
Maintains original file structure
Exports captions as SRT files in the same directory as source videos
Auto Clip with SRT timestamps

Auto Captioning (`captioner.py` & `api_handler.py`)

Automatic video scene description using Gemini API or Pixtral API
Batch processing support
SRT format output with timestamps
Robust error handling and retry mechanisms
Progress tracking for batch operations

Configuration (`config.py` & `config.toml`)

API prompt configuration management
Customizable batch processing parameters
Default schema includes file paths and metadata

Installation

Give unrestricted script access to powershell so venv can work:

Open an administrator powershell window
Type Set-ExecutionPolicy Unrestricted and answer A
Close admin powershell window

Windows

Run the following PowerShell script:

./1、install-uv-qinglong.ps1

Linux

First install PowerShell:

sudo sh ./0、install pwsh.sh

Then run the installation script using PowerShell:

sudo pwsh ./1、install-uv-qinglong.ps1

use sudo pwsh if you in Linux.

TensorRT (Optional)

windows need to install TensorRT-libs manually from here. TensorRT can faster use WD14Tagger (not effect API part) Now we use 10.9 version

Usage

video example: https://files.catbox.moe/8fudnf.mp4

Just put Video or audio files into datasets folders

Importing Media

Use the PowerShell script to import your videos:

./lanceImport.ps1

Exporting Media

Use the PowerShell script to export data from Lance format:

./lanceExport.ps1

Auto Captioning

Use the PowerShell script to generate captions for your videos:

./run.ps1

Note: You'll need to configure your Gemini API key in run.ps1 before using the auto-captioning feature. Pixtral API key optional for image caption.

Now we support step-1.5v-mini optional for video captioner.

Now we support qwen-VL series optional for video captioner.

Now we support Mistral OCR optional for PDF and image OCR.

Now we support GLM series optional for video captioner.

$dataset_path = "./datasets"
$gemini_api_key = ""
$gemini_model_path = "gemini-2.0-pro-exp-02-05"
$pixtral_api_key = ""
$pixtral_model_path = "pixtral-large-2411"
$step_api_key = ""
$step_model_path = "step-1.5v-mini"
$qwenVL_api_key = ""
$qwenVL_model_path = "qwen-vl-max-latest" # qwen2.5-vl-72b-instruct<10mins qwen-vl-max-latest <1min
$glm_api_key = ""
$glm_model_path = "GLM-4V-Plus-0111"
$dir_name = $true
$mode = "long"
$not_clip_with_caption = $false              # Not clip with caption | 不根据caption裁剪
$wait_time = 1
$max_retries = 100
$segment_time = 600
$ocr = $false
$document_image = $true
$scene_detector = "AdaptiveDetector" # from ["ContentDetector","AdaptiveDetector","HashDetector","HistogramDetector","ThresholdDetector"]
$scene_threshold = 0.0 # default value ["ContentDetector": 27.0, "AdaptiveDetector": 3.0, "HashDetector": 0.395, "HistogramDetector": 0.05, "ThresholdDetector": 12]
$scene_min_len = 15
$scene_luma_only = $false

青龙数据集工具 (2.7)

基于 Lance 数据库格式的视频自动字幕生成工具，使用 Gemini API 进行场景描述生成。

功能特点

使用 Google Gemini API 进行视频场景自动描述
导出 SRT 格式字幕文件
支持多种视频格式
批量处理并显示进度
保持原始目录结构
通过 TOML 文件配置
集成 Lance 数据库实现高效数据管理

模块说明

数据集导入 (`lanceImport.py`)

将视频导入 Lance 数据库格式
保持原始目录结构
支持单目录和配对目录结构

数据集导出 (`lanceexport.py`)

从 Lance 数据集中提取视频和字幕
保持原有文件结构
在源视频所在目录导出 SRT 格式字幕

自动字幕生成 (`captioner.py` & `api_handler.py`)

使用 Gemini API 进行视频场景描述
支持批量处理
生成带时间戳的 SRT 格式字幕
健壮的错误处理和重试机制
批处理进度跟踪

配置模块 (`config.py` & `config.toml`)

API 配置管理
可自定义批处理参数
默认结构包含文件路径和元数据

安装方法

Windows 系统

运行以下 PowerShell 脚本：

./1、install-uv-qinglong.ps1

Linux 系统

首先安装 PowerShell：

sudo sh ./0、install pwsh.sh

然后使用 PowerShell 运行安装脚本：

pwsh ./1、install-uv-qinglong.ps1

使用方法

把媒体文件放到datasets文件夹下

导入视频

使用 PowerShell 脚本导入视频：

./lanceImport.ps1

导出数据

使用 PowerShell 脚本从 Lance 格式导出数据：

./lanceExport.ps1

自动字幕生成

使用 PowerShell 脚本为视频生成字幕：

./run.ps1

注意：使用自动字幕生成功能前，需要在 run.ps1 中配置 Gemini API 密钥。 Pixtral API 秘钥可选为图片打标。

现在我们支持使用阶跃星辰的视频模型进行视频标注。

现在我们支持使用通义千问VL的视频模型进行视频标注。

现在我们支持使用Mistral OCR的OCR功能进行图片字幕生成。

现在我们支持使用智谱GLM的视频模型进行视频标注。

$dataset_path = "./datasets"
$gemini_api_key = ""
$gemini_model_path = "gemini-2.0-pro-exp-02-05"
$pixtral_api_key = ""
$pixtral_model_path = "pixtral-large-2411"
$step_api_key = ""
$step_model_path = "step-1.5v-mini"
$qwenVL_api_key = ""
$qwenVL_model_path = "qwen-vl-max-latest" # qwen2.5-vl-72b-instruct<10mins qwen-vl-max-latest <1min
$glm_api_key = ""
$glm_model_path = "GLM-4V-Plus-0111"
$dir_name = $true
$mode = "long"
$not_clip_with_caption = $false              # Not clip with caption | 不根据caption裁剪
$wait_time = 1
$max_retries = 100
$segment_time = 600
$ocr = $false
$document_image = $true
$scene_detector = "AdaptiveDetector" # from ["ContentDetector","AdaptiveDetector","HashDetector","HistogramDetector","ThresholdDetector"]
$scene_threshold = 0.0 # default value ["ContentDetector": 27.0, "AdaptiveDetector": 3.0, "HashDetector": 0.395, "HistogramDetector": 0.05, "ThresholdDetector": 12]
$scene_min_len = 15
$scene_luma_only = $false

Name		Name	Last commit message	Last commit date
Latest commit History 162 Commits
config		config
datasets		datasets
module		module
utils		utils
.gitignore		.gitignore
0、install pwsh.sh		0、install pwsh.sh
1、install-uv-qinglong.ps1		1、install-uv-qinglong.ps1
2.1、image_watermark_detect.ps1		2.1、image_watermark_detect.ps1
2.2、preprocess_images.ps1		2.2、preprocess_images.ps1
2、video_spliter.ps1		2、video_spliter.ps1
3、tagger.ps1		3、tagger.ps1
4、run.ps1		4、run.ps1
LICENSE		LICENSE
README.md		README.md
lanceExport.ps1		lanceExport.ps1
lanceImport.ps1		lanceImport.ps1
requirements-uv-linux.txt		requirements-uv-linux.txt
requirements-uv.txt		requirements-uv.txt
requirements.txt		requirements.txt

	[prompts.task]

	removal_watermark = "Your task is to remove the watermark from the image."

	removal_watermark_then_caption = "Remove any text watermark(include English、Chinese、Japanese and any other language text watermark)、logo、watermark from this image, then generate captions about the new image."

	removal_logo = "Your task is to remove the logo from the image."

	removal_background = "Your task is to remove the background from the image."

	removal_text = "Your task is to remove the text from the image."

License

sdbds/qinglong-captions

Folders and files

Latest commit

History

Repository files navigation

qinglong-captioner (2.7)

Changlog

2.7

2.6

2.5

2.4

2.3

2.2

2.1

2.0 Big Update！

1.9

1.8

1.7

1.6

Features

Modules

Dataset Import (lanceImport.py)

Dataset Export (lanceexport.py)

Auto Captioning (captioner.py & api_handler.py)

Configuration (config.py & config.toml)

Installation

Windows

Linux

TensorRT (Optional)

Usage

Just put Video or audio files into datasets folders

Importing Media

Exporting Media

Auto Captioning

青龙数据集工具 (2.7)

功能特点

模块说明

数据集导入 (lanceImport.py)

数据集导出 (lanceexport.py)

自动字幕生成 (captioner.py & api_handler.py)

配置模块 (config.py & config.toml)

安装方法

Windows 系统

Linux 系统

使用方法

把媒体文件放到datasets文件夹下

导入视频

导出数据

自动字幕生成

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 13

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Dataset Import (`lanceImport.py`)

Dataset Export (`lanceexport.py`)

Auto Captioning (`captioner.py` & `api_handler.py`)

Configuration (`config.py` & `config.toml`)

数据集导入 (`lanceImport.py`)

数据集导出 (`lanceexport.py`)

自动字幕生成 (`captioner.py` & `api_handler.py`)

配置模块 (`config.py` & `config.toml`)

Packages