DOCTYPE not allowed in content,

This should cover W3C XML Schema, Relax NG and DTD related problems.
DanL
Posts: 4
Joined: Wed May 08, 2024 12:40 am

DOCTYPE not allowed in content,

Post by DanL »

Hey, thanks for looking!

I have some lovely MIL-STD-40051 xml that is DTD 6.5 compliant.

I usually work in Arbortext but I'm trying Oxygen again.

I have 200+ individual xml documents, and a "wrapper" file that references them all as entities for publication.

XML from the wrapper:

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE production PUBLIC "-//USA-DOD//DTD -1/2C TM Assembly REV C 6.5 20200930//EN" "../40051c_6_5/40051c_6_5.dtd" [
<!ENTITY g00001 SYSTEM "../xmlfiles/g00001.xml">
<!ENTITY g00003 SYSTEM "../xmlfiles/g00003.xml">
<!ENTITY g00006 SYSTEM "../xmlfiles/g00006.xml">
<!ENTITY o00001 SYSTEM "../xmlfiles/o00001.xml">
<!ENTITY o00002 SYSTEM "../xmlfiles/o00002.xml">
<!ENTITY o00003 SYSTEM "../xmlfiles/o00003.xml">
<!ENTITY o00004 SYSTEM "../xmlfiles/o00004.xml">
<!ENTITY o00005 SYSTEM "../xmlfiles/o00005.xml">
<!ENTITY o00006 SYSTEM "../xmlfiles/o00006.xml">
I'll spare you the list, all the separate documents are declared.

In Arbortext, each document is imported by reference:

Code: Select all

<!-- gim -->
<!-- <!ELEMENT gim (titlepg, ((ginfowp, (bdar-geninfowp | (descwp+, thrywp*) | dmwr_introwp)) | (softginfowp, softsumwp, softeffectwp*, softdiffversionwp*) | (genmaint_ginfowp, descwp) | (pm-ginfowp) | (pms-ginfowp)))> -->
<gim chap-toc="no" chngno="0" revno="0"><titlepg maintlvl="operator">
<name>BIG GREEN TRUCK</name>
</titlepg>
<!--  Intro and theory of operation  g00001 -->&g00001;
<!--  Equipment Description and Data  g00006 -->&g00006;
<!--  Theory of Operation  g00003 -->&g00003;
</gim>
<!-- opim -->
<!-- <!ELEMENT opim (titlepg, ((ctrlindwp+, opusualwp+, opunuwp+, emergencywp*, stowagewp*, eqploadwp*) | dmwr_operationalreqwp*))> -->
<opim chap-toc="no" chngno="0" revno="0"><titlepg maintlvl="operator">
<name>BIG GREEN TRUCK</name>
</titlepg>
<!-- Chapter 2 - Operator Instructions -->
<!-- DESCRIPTION AND USE OF OPERATOR CONTROLS AND INDICATORS per 40051-->
<!--  instrument panel  -->&o00001;
<!--  aux panel  -->&o00059;
<!--  center console  -->&o00091;
<!--  steering col  -->&o00092;
<!--  floor  -->&o00093;
<!--  seat  -->&o00095;
<!--  door  -->&o00094;

When I try to validate in Oxygen Developer, it stops at the first entity reference, &goooo1;
reporting a fatal error as identified by Xerxes,

A DOCTYPE is not allowed in content.

The error is coming from the doctype in the first file being referenced as an entity.

And it's legit, every one of those ~150 XML documents has a DOCTYPE declaration.

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE ginfowp PUBLIC "-//USA-DOD//DTD -1/2C TM Assembly REV C 6.5 20200930//EN"  "../40051c_6_5/40051c_6_5.dtd">
<ginfowp wpno="g00001">


Arbortext does not report any of this as an error and will happily generate a PDF from the "wrapper" document using the xsl style sheets provided with the DTD. I've been working with AT since version 5. I keep trying to cut over to oxygen, but the learning curve is kind of steep and I keep getting frustrated.

There is probably some simple solution, but I cannot for the life of me find it.


The other issue I am seeing is that when oxygen validates individual documents, it flags references to other documents as a validation error.

A generic link to another document <xref wpid="m00001"> should go to XML document m00001 when published, but does not go anywhere when considering the file by itself, and that is OK.

I did figure out how to ignore that validation error (I think).

Any tips on how to get Oxygen to check that would be swell, too.

Thanks in advance for any help.

Dan
Radu
Posts: 9179
Joined: Fri Jul 09, 2004 5:18 pm

Re: DOCTYPE not allowed in content,

Post by Radu »

Hello Dan,
DOCTYPE not allowed in content
Arbortext does not report any of this as an error and will happily generate a PDF from the "wrapper" document using the xsl style sheets provided with the DTD. I've been working with AT since version 5. I keep trying to cut over to oxygen, but the learning curve is kind of steep and I keep getting frustrated.
There is probably some simple solution, but I cannot for the life of me find it.
Oxygen uses the Apache Xerces parser to parse and validate XML documents. The XSLT processors bundled with Oxygen use the same parser.
According to the XML specification:
https://www.w3.org/TR/xml/#intern-replacement
an external entity reference must be expanded to its exact content in the XML document.
I understand from your description of Arbortext's behavior that it skips the DOCTYPE declaration from the reference's file when expanding the reference. I understand why this is useful but this is not correct according to the XML specification, it's probably something only Arbortext is doing.
This particular problem with entity references to files containing DTDs is also described here and the given workaround here is to use xi:includes instead of entity references:
https://www.oxygenxml.com/doc/versions/ ... ities.html
Other than that, I consider Oxygen's behavior correct according to the XML specification.
The other issue I am seeing is that when oxygen validates individual documents, it flags references to other documents as a validation error.
A generic link to another document <xref wpid="m00001"> should go to XML document m00001 when published, but does not go anywhere when considering the file by itself, and that is OK.
So the module file in itself is invalid, it has an idref to a missing ID. But it is valid in the context of a larger XML file which includes multiple other files. Maybe Oxygen's Main Files support may help with this: https://www.oxygenxml.com/doc/ug-editor ... iting.html

Regards,
Radu
Radu Coravu
<oXygen/> XML Editor
http://www.oxygenxml.com
DanL
Posts: 4
Joined: Wed May 08, 2024 12:40 am

Re: DOCTYPE not allowed in content,

Post by DanL »

I am entirely willing to believe that Oxygen is parsing correctly and that this is an error.
I am also willing to believe that Arbortext ignoring it is a kludge for the 40051 user base.
The color difference in the link in the first referenced article was subtle and I missed it the first time through. Second time around, I caught it.

Code: Select all

<xi:include href="a.xml" xpointer="a1"
        xmlns:xi="http://www.w3.org/2001/XInclude"/>
https://www.w3.org/TR/xinclude-11/

Arbortext/WC uses something similar when used used with the Windchill CMS, which is probably doing better parsing.

I will check what we are doing in that environment, and give it a try here. I'll follow up with results.

Thanks so much for the pointer!
Dan
Radu
Posts: 9179
Joined: Fri Jul 09, 2004 5:18 pm

Re: DOCTYPE not allowed in content,

Post by Radu »

Hi Dan,
Right, in this case at least with how the XML parser bundled with Oxygen works, xi:includes would be better.
Regards,
Radu
Radu Coravu
<oXygen/> XML Editor
http://www.oxygenxml.com
DanL
Posts: 4
Joined: Wed May 08, 2024 12:40 am

Re: DOCTYPE not allowed in content,

Post by DanL »

Yay, progress, sort of.

Swapped out the entity references for xi:include.

Code: Select all

<!-- <!ELEMENT gim (titlepg, ((ginfowp, (bdar-geninfowp | (descwp+, thrywp*) | dmwr_introwp)) | (softginfowp, softsumwp, softeffectwp*, softdiffversionwp*) | (genmaint_ginfowp, descwp) | (pm-ginfowp) | (pms-ginfowp)))> -->
<gim chap-toc="no" chngno="0" revno="0">
<titlepg maintlvl="operator"><name>BIG GREEN TRUCK</name></titlepg>
<xi:include href="./xmlfiles/g00001.xml"  xmlns:xi="http://www.w3.org/2001/XInclude"/>
<xi:include href="./xmlfiles/g00006.xml"  xmlns:xi="http://www.w3.org/2001/XInclude"/>
<xi:include href="./xmlfiles/g00003.xml"  xmlns:xi="http://www.w3.org/2001/XInclude"/>
</gim>
That solved my initial DTD NOT ALLOWED error. YAY.

New problem:

Attribute "xml:base" is not allowed to appear in element "ctrlindwp".
According to the specification for XInclude, processors must add an xml:base attribute to elements included from locations with a different base URI.
https://xerces.apache.org/xerces2-j/faq ... html#faq-3

If I understand that FAQ entry, I could fix this in the xml schema (XSD?) if I had a schema, but I don't, there is just a DTD.

Apparently there is also a way to turn this off in Xerxes? But this is all new territory and I cannot figure it out.

I moved the wrapper file into the same directory as all the other XML files, same thing:

Code: Select all

<gim chap-toc="no" chngno="0" revno="0">
<titlepg maintlvl="operator"><name>BIG GREEN TRUCK</name></titlepg>
<xi:include href="./g00001.xml"  xmlns:xi="http://www.w3.org/2001/XInclude"/>
<xi:include href="./g00006.xml"  xmlns:xi="http://www.w3.org/2001/XInclude"/>
<xi:include href="./g00003.xml"  xmlns:xi="http://www.w3.org/2001/XInclude"/>
</gim>

I am also getting this message about every graphic:

ENTITY "G00006_01" is not unparsed.

Graphics are all listed in the DOCTYPE declaration in each file:

Code: Select all

<!ENTITY G00006_01 SYSTEM "../Graphics/G-Introductory/G00006_01.svg" NDATA SVG>
and then referenced in the document:

Code: Select all

<figure><title>Title text, yadda yadda.</title><graphic boardno="G00006_01"></graphic></figure>

Again, any help would be appreciated. I am reasonably clever but I lack the background knowledge.

Thanks,
Dan
Radu
Posts: 9179
Joined: Fri Jul 09, 2004 5:18 pm

Re: DOCTYPE not allowed in content,

Post by Radu »

Hello Dan,

So:
Attribute "xml:base" is not allowed to appear in element "ctrlindwp".
According to the specification for XInclude, processors must add an xml:base attribute to elements included from locations with a different base URI.
If I understand that FAQ entry, I could fix this in the xml schema (XSD?) if I had a schema, but I don't, there is just a DTD.
Apparently there is also a way to turn this off in Xerxes? But this is all new territory and I cannot figure it out.
When the main XML document containing the xi:includes gets validated or processed using XSLT, the xi:includes get expanded in place. When they get expanded, the XML processor adds the "xml:base" attribute to each expanded top level element which was initially defined in the smaller xi:included XML file. It does that in order to make it possible for the XML processor to compute relative references correctly. But this would mean that in order to avoid that validation error in your DTD you would need to declare the "xml:base" attribute as a possible valid attribute on all elements or at least on the elements which are usually top level elements in the xi:included files, like in your case in the ATTLIST definition of the element "ginfowp" you need to add something like:

Code: Select all

xml:base CDATA #IMPLIED
The alternative would be to go to the Oxygen Preferences->"XML / XML Parser" page and disable the "Base URI fixup" checkbox, this should no longer generate those hidden xml:base attributes when validating or processing XML files containing xi:includes. The downside is that if the module XML file has a relative reference to some other location, that relative reference will appear as it was defined in the processed master XML document, without that xml:base which would have defined relative to what folder that reference should have been resolved.
I am also getting this message about every graphic:
ENTITY "G00006_01" is not unparsed.
Graphics are all listed in the DOCTYPE declaration in each file:
<!ENTITY G00006_01 SYSTEM "../Graphics/G-Introductory/G00006_01.svg" NDATA SVG>
and then referenced in the document:
<figure><title>Title text, yadda yadda.</title><graphic boardno="G00006_01"></graphic></figure>
I managed to reproduce this situation on my side.
According to the xi:include specs:
https://www.w3.org/TR/xinclude/#unparsed-entities
Any unparsed entity information item appearing in the references property of an attribute on the included items or any descendant thereof is added to the unparsed entities property of the result infoset's document information item, if it is not a duplicate of an existing member. Duplicates do not appear in the result infoset.
Unparsed entity items with the same name, system identifier, public identifier, declaration base URI, notation name, and notation are considered to be duplicate. An application may also be able to detect that unparsed entities are duplicate through other means. For instance, the URI resulting from combining the system identifier and the declaration base URI is the same.
So I interpret this to mean that if the XML module defines an unparsed entity, when it gets included in the master XML document it should enrich the master XML document's DOCTYPE by declaring also this unparsed entity there.
But the Apache Xerces XML parser that we are using does not seem to follow the specification in this regard.
I added an internal issue to see if we can possibly better analyze the problem and maybe patch the parser to behave closer to the specs, pasting the issue ID below for future reference:

EXM-54482 Unparsed entities are not added to larger infoset when resolving xi:includes

Trying to set a priority for this internal issue, is your general purpose to try and somehow migrate your editing and publishing needs from Arbortext to Oxygen? About how many people from your side would be using Oxygen if it would become feasible for them to use it?

Regards,
Radu
Radu Coravu
<oXygen/> XML Editor
http://www.oxygenxml.com
Post Reply