Turn off DTD validation for HTML 4.1 loose.dtd?
Questions about XML that are not covered by the other forums should go here.
-
- Posts: 47
- Joined: Mon Jun 26, 2006 10:40 pm
Turn off DTD validation for HTML 4.1 loose.dtd?
Hi all,
I'm processing a set of hOCR files that were generated with the following <!DOCTYPE>. Saxon-EE has issues with the DTD due to it being an SGML DTD (yes? I'm not sure).
I can strip the <!DOCTYPE> line from the HTML, but I was curious why turning off the 'DTD validation of the source' option (Options > Preferences > XSLT-FO-XQUERY > XSLT > Saxon > Saxon-HE/PE/EE) doesn't seem to have an effect.
Is this just how it is (Saxon recognizes the document as SGML and looks for the DTD), or is there a work-around that doesn't involve a preprocessing step?
Thanks!
PS I'm converting these hOCR files to XML and the error I'm getting is:
I'm processing a set of hOCR files that were generated with the following <!DOCTYPE>
Code: Select all
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
I can strip the <!DOCTYPE> line from the HTML, but I was curious why turning off the 'DTD validation of the source' option (Options > Preferences > XSLT-FO-XQUERY > XSLT > Saxon > Saxon-HE/PE/EE) doesn't seem to have an effect.
Is this just how it is (Saxon recognizes the document as SGML and looks for the DTD), or is there a work-around that doesn't involve a preprocessing step?
Thanks!
PS I'm converting these hOCR files to XML and the error I'm getting is:
Code: Select all
System ID: http://www.w3.org/TR/html4/loose.dtd
Severity: fatal
Description: The declaration for the entity "HTML.Version" must end with '>'.
Start location: 31:3
-
- Posts: 2879
- Joined: Tue May 17, 2005 4:01 pm
Re: Turn off DTD validation for HTML 4.1 loose.dtd?
Hi,
Yes that DTD seems to be of the SGML flavor.
The setting 'DTD validation of the source' (default disabled) refers strictly to an optional validation of the source XML. It does not affect the XML parsing (building the XML model) which always makes use of the DTD specified in the DOCTYPE since for XML that is an integral part of the XML model. The Saxon transformation is then applied on the XML model.
So this doesn't actually have to do with Saxon, but with the XML parser (Xerces) that Oxygen configures and uses. Oxygen does not provide a setting for completely bypassing the DOCTYPE during XML parsing.
One possible workaround is be to create an XML catalog that resolves the PUBLIC ID and/or SYSTEM ID to a dummy DTD (or even the real XHTML 1.0 transitional DTD).
catalog.xml
dummy.dtd
Place these two files (catalog and DTD) in the same folder and configure the XML catalog in Options > Preferences, XML / XML Catalog.
You will still get validation errors because of the limited dummy.dtd, but this will allow you to use the HTML as the input of a transformation.
Note that if the HTML is not XML well-formed, this won't help with anything and you're better off importing the HTML with File > Import > HTML File....
Regards,
Adrian
Yes that DTD seems to be of the SGML flavor.
The setting 'DTD validation of the source' (default disabled) refers strictly to an optional validation of the source XML. It does not affect the XML parsing (building the XML model) which always makes use of the DTD specified in the DOCTYPE since for XML that is an integral part of the XML model. The Saxon transformation is then applied on the XML model.
So this doesn't actually have to do with Saxon, but with the XML parser (Xerces) that Oxygen configures and uses. Oxygen does not provide a setting for completely bypassing the DOCTYPE during XML parsing.
One possible workaround is be to create an XML catalog that resolves the PUBLIC ID and/or SYSTEM ID to a dummy DTD (or even the real XHTML 1.0 transitional DTD).
catalog.xml
Code: Select all
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<public publicId="-//W3C//DTD HTML 4.01 Transitional//EN" uri="dummy.dtd"/>
<system systemId="http://www.w3.org/TR/html4/loose.dtd" uri="dummy.dtd"/>
</catalog>
Code: Select all
<!ELEMENT html ANY>
You will still get validation errors because of the limited dummy.dtd, but this will allow you to use the HTML as the input of a transformation.
Note that if the HTML is not XML well-formed, this won't help with anything and you're better off importing the HTML with File > Import > HTML File....
Regards,
Adrian
Adrian Buza
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com
Return to “General XML Questions”
Jump to
- Oxygen XML Editor/Author/Developer
- ↳ Feature Request
- ↳ Common Problems
- ↳ DITA (Editing and Publishing DITA Content)
- ↳ SDK-API, Frameworks - Document Types
- ↳ DocBook
- ↳ TEI
- ↳ XHTML
- ↳ Other Issues
- Oxygen XML Web Author
- ↳ Feature Request
- ↳ Common Problems
- Oxygen Content Fusion
- ↳ Feature Request
- ↳ Common Problems
- Oxygen JSON Editor
- ↳ Feature Request
- ↳ Common Problems
- Oxygen PDF Chemistry
- ↳ Feature Request
- ↳ Common Problems
- Oxygen Feedback
- ↳ Feature Request
- ↳ Common Problems
- Oxygen XML WebHelp
- ↳ Feature Request
- ↳ Common Problems
- XML
- ↳ General XML Questions
- ↳ XSLT and FOP
- ↳ XML Schemas
- ↳ XQuery
- NVDL
- ↳ General NVDL Issues
- ↳ oNVDL Related Issues
- XML Services Market
- ↳ Offer a Service