this proposal is based on the one in a previous comment, simplified a bit in the core and then extended with the hope to maximise interoperability.
the change is to remove the distinction between "mark elements" and "other elements" as a requirement for producers (which is presumably what Word and Calligra Words already do), and make the distinction optional for consumers (because 1. for documents produced according to the simplified algorithm, the distinction does not make a difference, so allowing it should be harmless; 2. existing ODF 1.2 documents written by OOo/LO/AOO rely on this distinction).
a note about nested paragraphs, in case you were wondering: if text
elements are nested, the inner one always occurs inside some other element that doesn't allow character content, so they will be completely removed by step 2 of the algorithm; thus the algorithm does not mix content of nested paragraphs.
i have a prototype patch to adapt the LO ODF export to this for all ODF versions (and also fix the text:meta-field bug that i mentioned in a previous comment), and it appears to work nicely on the whitespace.odt test document; hope this can ship with LO 5.4.
Collapsing white space characters inside a paragraph element is defined by the following algorithm:
1) Descendant <text:ruby> elements are replaced with their <text:ruby-base> child elements.
2) Descendant elements of the paragraph element which are not <text:s>, <text:tab> or <text:line-break> elements and for which the OpenDocument schema does not permit <text:s>, <text:tab> and <text:line-break> as child elements are removed from the paragraph element.
3) Descendant elements of the paragraph element for which the OpenDocument schema permits <text:s>, <text:tab> and <text:line-break> as child elements are replaced by their character data and <text:s>, <text:tab> and <text:line-break> element children.
4) original ODF 1.2 step 1) (U+0009 U+000D U+000A -> U+0020 replacement)
5) original ODF 1.2 step 3) remove leading U+0020
6) original ODF 1.2 step 4) replace many U+0020 with one
7) The remaining <text:s>, <text:tab> and <text:line-break> elements are interpreted as the [UNICODE] white space characters they represent.
OpenDocument producers shall produce paragraph elements that, when consumed according to this algorithm, result in the expected amount of white space.
OpenDocument consumers shall either process white space such that the result is equivalent to the result of the given algorithm, or implement a variation that increases interoperability with popular OpenDocument 1.2 producers. The variation replaces step 2 of the algorithm with steps 2a and 2b:
2a) Descendant elements of the paragraph element that are mark elements (
<text:change> 5.5.7.4
<text:change-end> 5.5.7.3
<text:change-start> 5.5.7.2
<text:bookmark> 6.2.1.2
<text:bookmark-end> 6.2.1.4
<text:bookmark-start> 6.2.1.3
<text:reference-mark> 6.2.2.2
<text:reference-mark-end> 6.2.2.4
<text:reference-mark-start> 6.2.2.3
<text:toc-mark> 8.1.4
<text:toc-mark-end> 8.1.3
<text:toc-mark-start> 8.1.2
<text:user-index-mark> 8.1.7
<text:user-index-mark-end> 8.1.6
<text:user-index-mark-start> 8.1.5
<text:alphabetical-index-mark> 8.1.10
<text:alphabetical-index-mark-end> 8.1.9
<text:alphabetical-index-mark-start> 8.1.8
) are removed from the paragraph element.
2b) Descendant elements of the paragraph element which are not <text:s>, <text:tab> or <text:line-break> elements and for which the OpenDocument schema does not permit <text:s>, <text:tab> and <text:line-break> as child elements are replaced with a hypothetical <text:s text:c="0"/> element.
this proposal is based on the one in a previous comment, simplified a bit in the core and then extended with the hope to maximise interoperability.
the change is to remove the distinction between "mark elements" and "other elements" as a requirement for producers (which is presumably what Word and Calligra Words already do), and make the distinction optional for consumers (because 1. for documents produced according to the simplified algorithm, the distinction does not make a difference, so allowing it should be harmless; 2. existing ODF 1.2 documents written by OOo/LO/AOO rely on this distinction).
a note about nested paragraphs, in case you were wondering: if text
elements are nested, the inner one always occurs inside some other element that doesn't allow character content, so they will be completely removed by step 2 of the algorithm; thus the algorithm does not mix content of nested paragraphs.
i have a prototype patch to adapt the LO ODF export to this for all ODF versions (and also fix the text:meta-field bug that i mentioned in a previous comment), and it appears to work nicely on the whitespace.odt test document; hope this can ship with LO 5.4.
Collapsing white space characters inside a paragraph element is defined by the following algorithm:
1) Descendant <text:ruby> elements are replaced with their <text:ruby-base> child elements.
2) Descendant elements of the paragraph element which are not <text:s>, <text:tab> or <text:line-break> elements and for which the OpenDocument schema does not permit <text:s>, <text:tab> and <text:line-break> as child elements are removed from the paragraph element.
3) Descendant elements of the paragraph element for which the OpenDocument schema permits <text:s>, <text:tab> and <text:line-break> as child elements are replaced by their character data and <text:s>, <text:tab> and <text:line-break> element children.
4) original ODF 1.2 step 1) (U+0009 U+000D U+000A -> U+0020 replacement)
5) original ODF 1.2 step 3) remove leading U+0020
6) original ODF 1.2 step 4) replace many U+0020 with one
7) The remaining <text:s>, <text:tab> and <text:line-break> elements are interpreted as the [UNICODE] white space characters they represent.
OpenDocument producers shall produce paragraph elements that, when consumed according to this algorithm, result in the expected amount of white space.
OpenDocument consumers shall either process white space such that the result is equivalent to the result of the given algorithm, or implement a variation that increases interoperability with popular OpenDocument 1.2 producers. The variation replaces step 2 of the algorithm with steps 2a and 2b:
2a) Descendant elements of the paragraph element that are mark elements (
<text:change> 5.5.7.4
<text:change-end> 5.5.7.3
<text:change-start> 5.5.7.2
<text:bookmark> 6.2.1.2
<text:bookmark-end> 6.2.1.4
<text:bookmark-start> 6.2.1.3
<text:reference-mark> 6.2.2.2
<text:reference-mark-end> 6.2.2.4
<text:reference-mark-start> 6.2.2.3
<text:toc-mark> 8.1.4
<text:toc-mark-end> 8.1.3
<text:toc-mark-start> 8.1.2
<text:user-index-mark> 8.1.7
<text:user-index-mark-end> 8.1.6
<text:user-index-mark-start> 8.1.5
<text:alphabetical-index-mark> 8.1.10
<text:alphabetical-index-mark-end> 8.1.9
<text:alphabetical-index-mark-start> 8.1.8
) are removed from the paragraph element.
2b) Descendant elements of the paragraph element which are not <text:s>, <text:tab> or <text:line-break> elements and for which the OpenDocument schema does not permit <text:s>, <text:tab> and <text:line-break> as child elements are replaced with a hypothetical <text:s text:c="0"/> element.