Skip to content

XML extractor gets triggered on HTML page #293

Open
@NGTmeaty

Description

@NGTmeaty

URL.GetMIMEType() in IsXML appears to be set to text/xml; charset=utf-8 on http://laborculture.org. This is incorrect based on the headers and content.

time=2025-05-21T18:39:48.614-04:00 level=INFO msg="url archived" worker_id=0 component=archiver.archive url=http://laborculture.org/ seed_id=03b36 item_id=03b36 depth=0 hops=0 status=200
time=2025-05-21T18:39:48.614-04:00 level=ERROR msg="unable to extract assets" component=postprocessor.extractAssets err="xml: encoding \"ISO-8859-1\" declared but Decoder.CharsetReader is nil" item=03b36
time=2025-05-21T18:39:48.614-04:00 level=ERROR msg="unable to extract assets" component=postprocessor.postprocess.postprocessItem err="xml: encoding \"ISO-8859-1\" declared but Decoder.CharsetReader is nil" item_id=03b36

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3normal prioritybugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions