Skip to content

.docx parsing not working with mammoth using turbopack #72863

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
Abil-Shrestha opened this issue Nov 15, 2024 · 7 comments
Closed
1 task done

.docx parsing not working with mammoth using turbopack #72863

Abil-Shrestha opened this issue Nov 15, 2024 · 7 comments
Labels
bug Issue was opened via the bug report template. locked Turbopack Related to Turbopack with Next.js.

Comments

@Abil-Shrestha
Copy link

Verify canary release

  • I verified that the issue exists in the latest Next.js canary release

Provide environment information

Operating System:
  Platform: darwin
  Arch: arm64
  Version: Darwin Kernel Version 24.1.0: Thu Oct 10 21:02:45 PDT 2024; root:xnu-11215.41.3~2/RELEASE_ARM64_T8112
  Available memory (MB): 16384
  Available CPU cores: 8
Binaries:
  Node: 21.7.2
  npm: 10.5.0
  Yarn: N/A
  pnpm: 8.15.4
Relevant Packages:
  next: 15.0.4-canary.12 // Latest available version is detected (15.0.4-canary.12).
  eslint-config-next: N/A
  react: 19.0.0-rc-69d4b800-20241021
  react-dom: 19.0.0-rc-69d4b800-20241021
  typescript: 5.6.3
Next.js Config:
  output: N/A

Which example does this report relate to?

with-turbopack

What browser are you using? (if relevant)

Brave

How are you deploying your application? (if relevant)

No response

Describe the Bug

Parsing the .docx file with mammoth library works when i'm working with webpack but when i'm on --turbo working with turbopack it gives me this error :
Error processing document: Error: Bug : uncompressed data size mismatch
I've attached the screenshot of the error

Screenshot 2024-11-15 at 2 30 39 PM

Expected Behavior

Expected to parse the text from .docx file using the mammoth library.

To Reproduce

/api/process-docx

import { NextRequest, NextResponse } from 'next/server';
import { DocxLoader } from "@langchain/community/document_loaders/fs/docx";

export const config = {
  runtime: 'nodejs',
};

async function extractTextFromDocx(file: File): Promise<string> {
  const buffer = await file.arrayBuffer();
  const blob = new Blob([buffer]);

  const loader = new DocxLoader(blob);
  const docs = await loader.load();
  return docs.map(doc => doc.pageContent).join('\n');
}

export async function POST(req: NextRequest) {
  try {
    const formData = await req.formData();
    const file = formData.get('file') as File | null;

    if (!file) {
      return NextResponse.json({ error: 'Missing file' }, { status: 400 });
    }

    if (file.type !== 'application/vnd.openxmlformats-officedocument.wordprocessingml.document') {
      return NextResponse.json({ error: 'Unsupported file type. Only DOCX files are accepted.' }, { status: 400 });
    }

    // Extract text from DOCX file
    const text = await extractTextFromDocx(file);

    return NextResponse.json({ 
      success: true, 
      extractedText: text,
    });

  } catch (error) {
    console.error('Error processing DOCX:', error);
    return NextResponse.json({ error: 'Failed to process DOCX' }, { status: 500 });
  }
}

page.tsx:

'use client'

import { useState, useCallback } from 'react'
import { useDropzone } from 'react-dropzone'
import { Alert, AlertDescription, AlertTitle } from "@/components/ui/alert"
import { AlertCircle } from 'lucide-react'
import { Card, CardContent } from "@/components/ui/card"

export default function Home() {
  const [parsedText, setParsedText] = useState<string | null>(null)
  const [error, setError] = useState<string | null>(null)
  const [isLoading, setIsLoading] = useState(false)

  const onDrop = useCallback(async (acceptedFiles: File[]) => {
    const file = acceptedFiles[0]
    if (file.type !== 'application/vnd.openxmlformats-officedocument.wordprocessingml.document') {
      setError('Please upload a .docx file')
      return
    }

    setIsLoading(true)
    setError(null)

    const formData = new FormData()
    formData.append('file', file)

    try {
      const response = await fetch('/api/process-docx', {
        method: 'POST',
        body: formData,
      })

      if (!response.ok) {
        throw new Error('Failed to process document')
      }

      const result = await response.json()
      setParsedText(result.extractedText)
    } catch (err) {
      setError(`An error occurred while processing the document: ${err.message}`)
    } finally {
      setIsLoading(false)
    }
  }, [])

  const { getRootProps, getInputProps, isDragActive } = useDropzone({ onDrop })

  return (
    <div className="container mx-auto p-4">
      <h1 className="text-3xl font-bold mb-4">Upload DOCX File</h1>
      <Card>
        <CardContent>
          <div {...getRootProps()} className="border-2 border-dashed border-gray-300 rounded-lg p-8 text-center cursor-pointer">
            <input {...getInputProps()} />
            {isDragActive ? (
              <p>Drop the DOCX file here ...</p>
            ) : (
              <p>Drag 'n' drop a DOCX file here, or click to select a file</p>
            )}
          </div>
        </CardContent>
      </Card>
      {isLoading && <p className="mt-4">Processing document...</p>}
      {parsedText && (
        <div className="mt-4">
          <h2 className="text-xl font-semibold mb-2">Extracted Text:</h2>
          <pre className="bg-gray-100 p-4 rounded-lg whitespace-pre-wrap">{parsedText}</pre>
        </div>
      )}
      {error && (
        <Alert variant="destructive" className="mt-4">
          <AlertCircle className="h-4 w-4" />
          <AlertTitle>Error</AlertTitle>
          <AlertDescription>{error}</AlertDescription>
        </Alert>
      )}
    </div>
  )
}
@Abil-Shrestha Abil-Shrestha added the examples Issue was opened via the examples template. label Nov 15, 2024
@samcx samcx added please add a complete reproduction Please add a complete reproduction. and removed examples Issue was opened via the examples template. labels Nov 15, 2024
Copy link
Contributor

We cannot recreate the issue with the provided information. Please add a reproduction in order for us to be able to investigate.

Why was this issue marked with the please add a complete reproduction label?

To be able to investigate, we need access to a reproduction to identify what triggered the issue. We prefer a link to a public GitHub repository (template for App Router, template for Pages Router), but you can also use these templates: CodeSandbox: App Router or CodeSandbox: Pages Router.

To make sure the issue is resolved as quickly as possible, please make sure that the reproduction is as minimal as possible. This means that you should remove unnecessary code, files, and dependencies that do not contribute to the issue. Ensure your reproduction does not depend on secrets, 3rd party registries, private dependencies, or any other data that cannot be made public. Avoid a reproduction including a whole monorepo (unless relevant to the issue). The easier it is to reproduce the issue, the quicker we can help.

Please test your reproduction against the latest version of Next.js (next@canary) to make sure your issue has not already been fixed.

If you cannot create a clean reproduction, another way you can help the maintainers' job is to pinpoint the canary version of next that introduced the issue. Check out our releases, and try to find the first canary release that introduced the issue. This will help us narrow down the scope of the issue, and possibly point to the PR/code change that introduced it. You can install a specific version of next by running npm install next@<version>.

I added a link, why was it still marked?

Ensure the link is pointing to a codebase that is accessible (e.g. not a private repository). "example.com", "n/a", "will add later", etc. are not acceptable links -- we need to see a public codebase. See the above section for accepted links.

What happens if I don't provide a sufficient minimal reproduction?

Issues with the please add a complete reproduction label that receives no meaningful activity (e.g. new comments with a reproduction link) are automatically closed and locked after 30 days.

If your issue has not been resolved in that time and it has been closed/locked, please open a new issue with the required reproduction.

I did not open this issue, but it is relevant to me, what can I do to help?

Anyone experiencing the same issue is welcome to provide a minimal reproduction following the above steps. Furthermore, you can upvote the issue using the 👍 reaction on the topmost comment (please do not comment "I have the same issue" without reproduction steps). Then, we can sort issues by votes to prioritize.

I think my reproduction is good enough, why aren't you looking into it quicker?

We look into every Next.js issue and constantly monitor open issues for new comments.

However, sometimes we might miss one or two due to the popularity/high traffic of the repository. We apologize, and kindly ask you to refrain from tagging core maintainers, as that will usually not result in increased priority.

Upvoting issues to show your interest will help us prioritize and address them as quickly as possible. That said, every issue is important to us, and if an issue gets closed by accident, we encourage you to open a new one linking to the old issue and we will look into it.

Useful Resources

@samcx samcx closed this as completed Nov 15, 2024
@digiorgiu
Copy link
Contributor

Screenshot 2024-11-18 alle 15 19 36

Having the same exact issue

@mischnic mischnic reopened this Nov 19, 2024
@mischnic mischnic added Turbopack Related to Turbopack with Next.js. and removed please add a complete reproduction Please add a complete reproduction. labels Nov 19, 2024
@mischnic
Copy link
Contributor

I took your reproduction and tested it with a random docx, but it worked fine.
So either there is something wrong with that reproduction, or my docx file didn't trigger this.

@mischnic mischnic added the bug Issue was opened via the bug report template. label Nov 19, 2024
@Abil-Shrestha
Copy link
Author

Thank you taking the time to look at it. I have created a codesandbox that mimicks the same behaviour. Heres the link : https://codesandbox.io/p/devbox/zp44fd

after some digging around i found the error is related to JSzip Stuk/jszip#777 (comment)
Turning off turbo resolve this issue and so is the case for many others: mwilliamson/mammoth.js#419 (comment)

it'd be nice to have turbo work with it cause the speed improvement is insane.

@mischnic
Copy link
Contributor

mischnic commented Nov 25, 2024

@Abil-Shrestha Thank you. Looks like that was already fixed with #72608, so in 15.0.4-canary.18 or newer (run pnpm add next@canary to get that).
Could you test with that?

@digiorgiu
Copy link
Contributor

I just checked using next@canary and it is working now! Thank you @mischnic

Copy link
Contributor

This closed issue has been automatically locked because it had no new activity for 2 weeks. If you are running into a similar issue, please create a new issue with the steps to reproduce. Thank you.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Issue was opened via the bug report template. locked Turbopack Related to Turbopack with Next.js.
Projects
None yet
Development

No branches or pull requests

4 participants