Skip to content

Tool HTTP request, returned application/octet-stream, file type not recognized #15262

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
5 tasks done
vanche1212 opened this issue Mar 8, 2025 · 4 comments · Fixed by #15329
Closed
5 tasks done

Tool HTTP request, returned application/octet-stream, file type not recognized #15262

vanche1212 opened this issue Mar 8, 2025 · 4 comments · Fixed by #15329

Comments

@vanche1212
Copy link

Self Checks

  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

1.0.0

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

curl --location --request POST 'http://10.27.24.233:8081/api/v1/convert/pdf/word'
--form 'fileInput=@"/Users/soya/Desktop/1219926043.pdf"'
--form 'outputFormat="docx"'
The above is the request curl,postman the returned information is as follows:
Content-Type | application/octet-stream
Content-Disposition | form-data; name="attachment"; filename="1219926043.docx"
Content-Length | 116257

The workflow is as follows:

Image

The debug trace information is as follows:
{
"status_code": 200,
"body": "",
"headers": {
"date": "Sat, 08 Mar 2025 04:42:33 GMT",
"content-type": "application/octet-stream",
"content-disposition": "form-data; name="attachment"; filename="1219926043.docx"",
"content-length": "116257",
"cache-status": "ac5c6a2e8fde;detail=mismatch",
"via": "1.1 ac5c6a2e8fde (squid/6.10)",
"connection": "keep-alive"
},
"files": [
{
"dify_model_identity": "dify__file",
"id": null,
"tenant_id": "5185fd0a-047c-4abf-b27f-f91ed93f6c0f",
"type": "custom",
"transfer_method": "tool_file",
"remote_url": null,
"related_id": "71aef328-2b8a-4808-b49d-5dd1079f900f",
"filename": "2ff8edd9bb0d4557a2c68d2a02e98271.bin",
"extension": ".bin",
"mime_type": "application/octet-stream",
"size": 116257,
"url": "http://10.27.24.230//files/tools/71aef328-2b8a-4808-b49d-5dd1079f900f.bin?timestamp=1741408957&nonce=d0a6a819f20afe7291b99d35a98c1082&sign=T9Qu-AmUwlsvOJmomm6CrrxJaZCmQEkAl1Vc2Fwp6ng="
}
]
}

✔️ Expected Behavior

{
"status_code": 200,
"body": "",
"headers": {
"date": "Sat, 08 Mar 2025 04:42:33 GMT",
"content-type": "application/octet-stream",
"content-disposition": "form-data; name="attachment"; filename="1219926043.docx"",
"content-length": "116257",
"cache-status": "ac5c6a2e8fde;detail=mismatch",
"via": "1.1 ac5c6a2e8fde (squid/6.10)",
"connection": "keep-alive"
},
"files": [
{
"dify_model_identity": "dify__file",
"id": null,
"tenant_id": "5185fd0a-047c-4abf-b27f-f91ed93f6c0f",
"type": "custom",
"transfer_method": "tool_file",
"remote_url": null,
"related_id": "71aef328-2b8a-4808-b49d-5dd1079f900f",
"filename": "2ff8edd9bb0d4557a2c68d2a02e98271.docx",
"extension": ".docx",
"mime_type": "application/octet-stream",
"size": 116257,
"url": "http://10.27.24.230//files/tools/71aef328-2b8a-4808-b49d-5dd1079f900f.bin?timestamp=1741408957&nonce=d0a6a819f20afe7291b99d35a98c1082&sign=T9Qu-AmUwlsvOJmomm6CrrxJaZCmQEkAl1Vc2Fwp6ng="
}
]
}

❌ Actual Behavior

{
"status_code": 200,
"body": "",
"headers": {
"date": "Sat, 08 Mar 2025 04:42:33 GMT",
"content-type": "application/octet-stream",
"content-disposition": "form-data; name="attachment"; filename="1219926043.docx"",
"content-length": "116257",
"cache-status": "ac5c6a2e8fde;detail=mismatch",
"via": "1.1 ac5c6a2e8fde (squid/6.10)",
"connection": "keep-alive"
},
"files": [
{
"dify_model_identity": "dify__file",
"id": null,
"tenant_id": "5185fd0a-047c-4abf-b27f-f91ed93f6c0f",
"type": "custom",
"transfer_method": "tool_file",
"remote_url": null,
"related_id": "71aef328-2b8a-4808-b49d-5dd1079f900f",
"filename": "2ff8edd9bb0d4557a2c68d2a02e98271.bin",
"extension": ".bin",
"mime_type": "application/octet-stream",
"size": 116257,
"url": "http://10.27.24.230//files/tools/71aef328-2b8a-4808-b49d-5dd1079f900f.bin?timestamp=1741408957&nonce=d0a6a819f20afe7291b99d35a98c1082&sign=T9Qu-AmUwlsvOJmomm6CrrxJaZCmQEkAl1Vc2Fwp6ng="
}
]
}

@vanche1212
Copy link
Author

How do I get the tool http request recognition extension to docx?

Copy link

dosubot bot commented Mar 8, 2025

To configure the tool HTTP request to recognize the file extension as .docx instead of .bin when the content type is application/octet-stream, you can implement a detection process that checks the Content-Disposition header for the filename and adjusts the file extension accordingly. Here's a Python snippet that demonstrates this approach:

import os
import re
import mimetypes
from uuid import uuid4
import magic

def guess_file_info_from_response(response):
    url = str(response.url)
    # Try to extract filename from URL
    parsed_url = urllib.parse.urlparse(url)
    url_path = parsed_url.path
    filename = os.path.basename(url_path)

    # If filename couldn't be extracted, use Content-Disposition header
    if not filename:
        content_disposition = response.headers.get("Content-Disposition")
        if content_disposition:
            filename_match = re.search(r'filename="?(.+)"?', content_disposition)
            if filename_match:
                filename = filename_match.group(1)

    # If still no filename, generate a unique one
    if not filename:
        unique_name = str(uuid4())
        filename = f"{unique_name}"

    # Guess MIME type from filename first, then URL
    mimetype, _ = mimetypes.guess_type(filename)
    if mimetype is None:
        mimetype, _ = mimetypes.guess_type(url)
    if mimetype is None:
        # If guessing fails, use Content-Type from response headers
        mimetype = response.headers.get("Content-Type", "application/octet-stream")

    # Use python-magic to guess MIME type if still unknown or generic
    if mimetype == "application/octet-stream" and magic is not None:
        try:
            mimetype = magic.from_buffer(response.content[:1024], mime=True)
        except magic.MagicException:
            pass

    extension = os.path.splitext(filename)[1]

    # Ensure filename has an extension
    if not extension:
        extension = mimetypes.guess_extension(mimetype) or ".bin"
        filename = f"{filename}{extension}"

    # Handle special cases for extensions
    if extension == ".bin" and mimetype == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
        extension = ".docx"
        filename = f"{os.path.splitext(filename)[0]}.docx"

    return {
        "filename": filename,
        "extension": extension,
        "mimetype": mimetype,
        "size": int(response.headers.get("Content-Length", -1)),
    }

This code uses the python-magic library to perform content-based MIME type detection and adjusts the file extension to .docx if necessary [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@vanche1212
Copy link
Author

To configure the tool HTTP request to recognize the file extension as .docx instead of .bin when the content type is application/octet-stream, you can implement a detection process that checks the Content-Disposition header for the filename and adjusts the file extension accordingly. Here's a Python snippet that demonstrates this approach:要配置工具 HTTP 请求,使其在内容类型为application/octet-stream时将文件扩展名识别为.docx而不是.bin ,您可以实施一个检测过程,检查文件名的Content-Disposition标头并相应地调整文件扩展名。以下是演示此方法的 Python 代码片段:

import os
import re
import mimetypes
from uuid import uuid4
import magic

def guess_file_info_from_response(response):
url = str(response.url)
# Try to extract filename from URL
parsed_url = urllib.parse.urlparse(url)
url_path = parsed_url.path
filename = os.path.basename(url_path)

# If filename couldn't be extracted, use Content-Disposition header
if not filename:
    content_disposition = response.headers.get("Content-Disposition")
    if content_disposition:
        filename_match = re.search(r'filename="?(.+)"?', content_disposition)
        if filename_match:
            filename = filename_match.group(1)

# If still no filename, generate a unique one
if not filename:
    unique_name = str(uuid4())
    filename = f"{unique_name}"

# Guess MIME type from filename first, then URL
mimetype, _ = mimetypes.guess_type(filename)
if mimetype is None:
    mimetype, _ = mimetypes.guess_type(url)
if mimetype is None:
    # If guessing fails, use Content-Type from response headers
    mimetype = response.headers.get("Content-Type", "application/octet-stream")

# Use python-magic to guess MIME type if still unknown or generic
if mimetype == "application/octet-stream" and magic is not None:
    try:
        mimetype = magic.from_buffer(response.content[:1024], mime=True)
    except magic.MagicException:
        pass

extension = os.path.splitext(filename)[1]

# Ensure filename has an extension
if not extension:
    extension = mimetypes.guess_extension(mimetype) or ".bin"
    filename = f"{filename}{extension}"

# Handle special cases for extensions
if extension == ".bin" and mimetype == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
    extension = ".docx"
    filename = f"{os.path.splitext(filename)[0]}.docx"

return {
    "filename": filename,
    "extension": extension,
    "mimetype": mimetype,
    "size": int(response.headers.get("Content-Length", -1)),
}

This code uses the python-magic library to perform content-based MIME type detection and adjusts the file extension to .docx if necessary [1].此代码使用python-magic库执行基于内容的 MIME 类型检测,并在必要时将文件扩展名调整为.docx [1]

To continue talking to Dosu, mention @dosu.要继续与Dosu交谈,请提及 。

Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other帮助 Dosu 学习!提供反馈:很好的回应|不相关的答案|不正确的来源|太冗长|幻觉|错误报告|其他

How to configure the detection process?

Copy link

dosubot bot commented Mar 8, 2025

To configure the detection process to ensure the tool HTTP request recognizes the file extension as .docx instead of .bin when the content type is application/octet-stream, you can follow these steps:

  1. Extract the Filename: First, try to extract the filename from the URL or the Content-Disposition header.
  2. Determine MIME Type: Use mimetypes.guess_type() to determine the MIME type based on the filename and URL.
  3. Content-Based Detection: If the MIME type is still unknown or generic (application/octet-stream), use the python-magic library to perform content-based MIME type detection.
  4. Adjust File Extension: Ensure the filename has the correct extension. Specifically, if the MIME type indicates a .docx file, adjust the extension accordingly.

Here's a Python snippet demonstrating this approach:

import os
import re
import mimetypes
from uuid import uuid4
import magic

def guess_file_info_from_response(response):
    url = str(response.url)
    parsed_url = urllib.parse.urlparse(url)
    url_path = parsed_url.path
    filename = os.path.basename(url_path)

    if not filename:
        content_disposition = response.headers.get("Content-Disposition")
        if content_disposition:
            filename_match = re.search(r'filename="?(.+)"?', content_disposition)
            if filename_match:
                filename = filename_match.group(1)

    if not filename:
        unique_name = str(uuid4())
        filename = f"{unique_name}"

    mimetype, _ = mimetypes.guess_type(filename)
    if mimetype is None:
        mimetype, _ = mimetypes.guess_type(url)
    if mimetype is None:
        mimetype = response.headers.get("Content-Type", "application/octet-stream")

    if mimetype == "application/octet-stream" and magic is not None:
        try:
            mimetype = magic.from_buffer(response.content[:1024], mime=True)
        except magic.MagicException:
            pass

    extension = os.path.splitext(filename)[1]

    if not extension:
        extension = mimetypes.guess_extension(mimetype) or ".bin"
        filename = f"{filename}{extension}"

    if extension == ".bin" and mimetype == "application/vnd.openxmlformats-officedocument.wordprocessingml.document":
        extension = ".docx"
        filename = f"{os.path.splitext(filename)[0]}.docx"

    return {
        "filename": filename,
        "extension": extension,
        "mimetype": mimetype,
        "size": int(response.headers.get("Content-Length", -1)),
    }

This code ensures that if the MIME type is detected as application/vnd.openxmlformats-officedocument.wordprocessingml.document, the file extension is set to .docx [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant