[OFFICE-2102] Member Proposal: Input field normalization - OASIS Technical Committees Issue Tracker

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: ODF 1.1, ODF 1.2
Fix Version/s: ODF 1.3
Component/s: Fields, Part 3 (Schema) [1.2: 1]
Labels:
None

Proposal:

Hide

I.

3.18 White Space Processing and EOL Handling

in the Note, in "their element children. 6.1.2", replace "element children" with "descendant elements".

II.

6.1.2 White Space Characters

replace:

"* in their descendant elements, if the OpenDocument schema permits the inclusion of character data for the element itself and all its ancestor elements up to the paragraph element."

with:

"* in their descendant elements, if the OpenDocument schema permits <text:s> [6.1.3], <text:tab> [6.1.4] and <text:line-break> [6.1.5] as element content."

replace the entire algorithm with:

<quote>
Collapsing white space characters inside a paragraph element is defined by the following algorithm:

1) Descendant <text:ruby> elements are replaced with their <text:ruby-base> child elements.

2) Descendant elements of the paragraph element which are not <text:s>, <text:tab> or <text:line-break> elements and for which the OpenDocument schema does not permit <text:s>, <text:tab> and <text:line-break> as child elements are removed from the paragraph element.

3) Descendant elements of the paragraph element for which the OpenDocument schema permits <text:s>, <text:tab> and <text:line-break> as child elements are replaced by their character data and <text:s>, <text:tab> and <text:line-break> element children.

4) original ODF 1.2 step 1) (U+0009 U+000D U+000A -> U+0020 replacement)

5) original ODF 1.2 step 3) remove leading U+0020

6) original ODF 1.2 step 4) replace many U+0020 with one

7) The remaining <text:s>, <text:tab> and <text:line-break> elements are interpreted as the [UNICODE] white space characters they represent.

OpenDocument producers shall produce paragraph elements that, when consumed according to this algorithm, result in the expected amount of white space.

OpenDocument consumers shall either process white space such that the result is equivalent to the result of the given algorithm, or implement a variation that increases interoperability with popular OpenDocument 1.2 producers. The variation replaces step 2 of the algorithm with steps 2a and 2b:

2a) Descendant elements of the paragraph element that are mark elements (
<text:change> 5.5.7.4
<text:change-end> 5.5.7.3
<text:change-start> 5.5.7.2
<text:bookmark> 6.2.1.2
<text:bookmark-end> 6.2.1.4
<text:bookmark-start> 6.2.1.3
<text:reference-mark> 6.2.2.2
<text:reference-mark-end> 6.2.2.4
<text:reference-mark-start> 6.2.2.3
<text:toc-mark> 8.1.4
<text:toc-mark-end> 8.1.3
<text:toc-mark-start> 8.1.2
<text:user-index-mark> 8.1.7
<text:user-index-mark-end> 8.1.6
<text:user-index-mark-start> 8.1.5
<text:alphabetical-index-mark> 8.1.10
<text:alphabetical-index-mark-end> 8.1.9
<text:alphabetical-index-mark-start> 8.1.8
) are removed from the paragraph element.

2b) Descendant elements of the paragraph element which are not <text:s>, <text:tab> or <text:line-break> elements and for which the OpenDocument schema does not permit <text:s>, <text:tab> and <text:line-break> as child elements are replaced with a hypothetical <text:s text:c="0"/> element.

</quote>

III. add helpful note that generic pretty-printing is not reliable in 6.1.2 White Space Characters, following the algorithm

"Note: XML formatting software that does not implement the ODF whitespace rules might introduce or remove spaces."

Show
I. 3.18 White Space Processing and EOL Handling in the Note, in "their element children. 6.1.2", replace "element children" with "descendant elements". II. 6.1.2 White Space Characters replace: "* in their descendant elements, if the OpenDocument schema permits the inclusion of character data for the element itself and all its ancestor elements up to the paragraph element." with: "* in their descendant elements, if the OpenDocument schema permits <text:s> [6.1.3] , <text:tab> [6.1.4] and <text:line-break> [6.1.5] as element content." replace the entire algorithm with: <quote> Collapsing white space characters inside a paragraph element is defined by the following algorithm: 1) Descendant <text:ruby> elements are replaced with their <text:ruby-base> child elements. 2) Descendant elements of the paragraph element which are not <text:s>, <text:tab> or <text:line-break> elements and for which the OpenDocument schema does not permit <text:s>, <text:tab> and <text:line-break> as child elements are removed from the paragraph element. 3) Descendant elements of the paragraph element for which the OpenDocument schema permits <text:s>, <text:tab> and <text:line-break> as child elements are replaced by their character data and <text:s>, <text:tab> and <text:line-break> element children. 4) original ODF 1.2 step 1) (U+0009 U+000D U+000A -> U+0020 replacement) 5) original ODF 1.2 step 3) remove leading U+0020 6) original ODF 1.2 step 4) replace many U+0020 with one 7) The remaining <text:s>, <text:tab> and <text:line-break> elements are interpreted as the [UNICODE] white space characters they represent. OpenDocument producers shall produce paragraph elements that, when consumed according to this algorithm, result in the expected amount of white space. OpenDocument consumers shall either process white space such that the result is equivalent to the result of the given algorithm, or implement a variation that increases interoperability with popular OpenDocument 1.2 producers. The variation replaces step 2 of the algorithm with steps 2a and 2b: 2a) Descendant elements of the paragraph element that are mark elements ( <text:change> 5.5.7.4 <text:change-end> 5.5.7.3 <text:change-start> 5.5.7.2 <text:bookmark> 6.2.1.2 <text:bookmark-end> 6.2.1.4 <text:bookmark-start> 6.2.1.3 <text:reference-mark> 6.2.2.2 <text:reference-mark-end> 6.2.2.4 <text:reference-mark-start> 6.2.2.3 <text:toc-mark> 8.1.4 <text:toc-mark-end> 8.1.3 <text:toc-mark-start> 8.1.2 <text:user-index-mark> 8.1.7 <text:user-index-mark-end> 8.1.6 <text:user-index-mark-start> 8.1.5 <text:alphabetical-index-mark> 8.1.10 <text:alphabetical-index-mark-end> 8.1.9 <text:alphabetical-index-mark-start> 8.1.8 ) are removed from the paragraph element. 2b) Descendant elements of the paragraph element which are not <text:s>, <text:tab> or <text:line-break> elements and for which the OpenDocument schema does not permit <text:s>, <text:tab> and <text:line-break> as child elements are replaced with a hypothetical <text:s text:c="0"/> element. </quote> III. add helpful note that generic pretty-printing is not reliable in 6.1.2 White Space Characters, following the algorithm "Note: XML formatting software that does not implement the ODF whitespace rules might introduce or remove spaces."
Resolution:

Hide

[see proposal]

Show
[see proposal]

Description

http://wiki.oasis-open.org/office/InputFields

Attachments

Activity

Ascending order - Click to sort in descending order

14 older comments

Hide

Permalink

Michael Stahl (Inactive) added a comment - 11/Apr/17 3:18 PM - edited

this proposal is based on the one in a previous comment, simplified a bit in the core and then extended with the hope to maximise interoperability.

the change is to remove the distinction between "mark elements" and "other elements" as a requirement for producers (which is presumably what Word and Calligra Words already do), and make the distinction optional for consumers (because 1. for documents produced according to the simplified algorithm, the distinction does not make a difference, so allowing it should be harmless; 2. existing ODF 1.2 documents written by OOo/LO/AOO rely on this distinction).

a note about nested paragraphs, in case you were wondering: if text elements are nested, the inner one always occurs inside some other element that doesn't allow character content, so they will be completely removed by step 2 of the algorithm; thus the algorithm does not mix content of nested paragraphs.

i have a prototype patch to adapt the LO ODF export to this for all ODF versions (and also fix the text:meta-field bug that i mentioned in a previous comment), and it appears to work nicely on the whitespace.odt test document; hope this can ship with LO 5.4.

Collapsing white space characters inside a paragraph element is defined by the following algorithm:

1) Descendant <text:ruby> elements are replaced with their <text:ruby-base> child elements.

2) Descendant elements of the paragraph element which are not <text:s>, <text:tab> or <text:line-break> elements and for which the OpenDocument schema does not permit <text:s>, <text:tab> and <text:line-break> as child elements are removed from the paragraph element.

3) Descendant elements of the paragraph element for which the OpenDocument schema permits <text:s>, <text:tab> and <text:line-break> as child elements are replaced by their character data and <text:s>, <text:tab> and <text:line-break> element children.

4) original ODF 1.2 step 1) (U+0009 U+000D U+000A -> U+0020 replacement)

5) original ODF 1.2 step 3) remove leading U+0020

6) original ODF 1.2 step 4) replace many U+0020 with one

7) The remaining <text:s>, <text:tab> and <text:line-break> elements are interpreted as the [UNICODE] white space characters they represent.

OpenDocument producers shall produce paragraph elements that, when consumed according to this algorithm, result in the expected amount of white space.

OpenDocument consumers shall either process white space such that the result is equivalent to the result of the given algorithm, or implement a variation that increases interoperability with popular OpenDocument 1.2 producers. The variation replaces step 2 of the algorithm with steps 2a and 2b:

2a) Descendant elements of the paragraph element that are mark elements (
<text:change> 5.5.7.4
<text:change-end> 5.5.7.3
<text:change-start> 5.5.7.2
<text:bookmark> 6.2.1.2
<text:bookmark-end> 6.2.1.4
<text:bookmark-start> 6.2.1.3
<text:reference-mark> 6.2.2.2
<text:reference-mark-end> 6.2.2.4
<text:reference-mark-start> 6.2.2.3
<text:toc-mark> 8.1.4
<text:toc-mark-end> 8.1.3
<text:toc-mark-start> 8.1.2
<text:user-index-mark> 8.1.7
<text:user-index-mark-end> 8.1.6
<text:user-index-mark-start> 8.1.5
<text:alphabetical-index-mark> 8.1.10
<text:alphabetical-index-mark-end> 8.1.9
<text:alphabetical-index-mark-start> 8.1.8
) are removed from the paragraph element.

2b) Descendant elements of the paragraph element which are not <text:s>, <text:tab> or <text:line-break> elements and for which the OpenDocument schema does not permit <text:s>, <text:tab> and <text:line-break> as child elements are replaced with a hypothetical <text:s text:c="0"/> element.

Show

Michael Stahl (Inactive) added a comment - 11/Apr/17 3:18 PM - edited this proposal is based on the one in a previous comment, simplified a bit in the core and then extended with the hope to maximise interoperability. the change is to remove the distinction between "mark elements" and "other elements" as a requirement for producers (which is presumably what Word and Calligra Words already do), and make the distinction optional for consumers (because 1. for documents produced according to the simplified algorithm, the distinction does not make a difference, so allowing it should be harmless; 2. existing ODF 1.2 documents written by OOo/LO/AOO rely on this distinction). a note about nested paragraphs, in case you were wondering: if text elements are nested, the inner one always occurs inside some other element that doesn't allow character content, so they will be completely removed by step 2 of the algorithm; thus the algorithm does not mix content of nested paragraphs. i have a prototype patch to adapt the LO ODF export to this for all ODF versions (and also fix the text:meta-field bug that i mentioned in a previous comment), and it appears to work nicely on the whitespace.odt test document; hope this can ship with LO 5.4. Collapsing white space characters inside a paragraph element is defined by the following algorithm: 1) Descendant <text:ruby> elements are replaced with their <text:ruby-base> child elements. 2) Descendant elements of the paragraph element which are not <text:s>, <text:tab> or <text:line-break> elements and for which the OpenDocument schema does not permit <text:s>, <text:tab> and <text:line-break> as child elements are removed from the paragraph element. 3) Descendant elements of the paragraph element for which the OpenDocument schema permits <text:s>, <text:tab> and <text:line-break> as child elements are replaced by their character data and <text:s>, <text:tab> and <text:line-break> element children. 4) original ODF 1.2 step 1) (U+0009 U+000D U+000A -> U+0020 replacement) 5) original ODF 1.2 step 3) remove leading U+0020 6) original ODF 1.2 step 4) replace many U+0020 with one 7) The remaining <text:s>, <text:tab> and <text:line-break> elements are interpreted as the [UNICODE] white space characters they represent. OpenDocument producers shall produce paragraph elements that, when consumed according to this algorithm, result in the expected amount of white space. OpenDocument consumers shall either process white space such that the result is equivalent to the result of the given algorithm, or implement a variation that increases interoperability with popular OpenDocument 1.2 producers. The variation replaces step 2 of the algorithm with steps 2a and 2b: 2a) Descendant elements of the paragraph element that are mark elements ( <text:change> 5.5.7.4 <text:change-end> 5.5.7.3 <text:change-start> 5.5.7.2 <text:bookmark> 6.2.1.2 <text:bookmark-end> 6.2.1.4 <text:bookmark-start> 6.2.1.3 <text:reference-mark> 6.2.2.2 <text:reference-mark-end> 6.2.2.4 <text:reference-mark-start> 6.2.2.3 <text:toc-mark> 8.1.4 <text:toc-mark-end> 8.1.3 <text:toc-mark-start> 8.1.2 <text:user-index-mark> 8.1.7 <text:user-index-mark-end> 8.1.6 <text:user-index-mark-start> 8.1.5 <text:alphabetical-index-mark> 8.1.10 <text:alphabetical-index-mark-end> 8.1.9 <text:alphabetical-index-mark-start> 8.1.8 ) are removed from the paragraph element. 2b) Descendant elements of the paragraph element which are not <text:s>, <text:tab> or <text:line-break> elements and for which the OpenDocument schema does not permit <text:s>, <text:tab> and <text:line-break> as child elements are replaced with a hypothetical <text:s text:c="0"/> element.

Hide

Permalink

Michael Stahl (Inactive) added a comment - 23/May/17 7:55 PM

proposal was accepted in TC call 2017-04-24

Show

Michael Stahl (Inactive) added a comment - 23/May/17 7:55 PM proposal was accepted in TC call 2017-04-24

Hide

Permalink

Patrick Durusau added a comment - 31/May/18 2:01 AM

Applied in OpenDocument-v1.3-wd08-part3-documents.odt

Show

Patrick Durusau added a comment - 31/May/18 2:01 AM Applied in OpenDocument-v1.3-wd08-part3-documents.odt

Hide

Permalink

Michael Stahl [X] (Inactive) added a comment - 25/Jun/19 4:44 PM

Editors: there is one change in the wd15 draft that i don't understand:

"increases interoperability with popular OpenDocument 1.3 producers."

in the proposal, this read OpenDocument 1.2, not 1.3, and that is intentional: this is for compatibility with existing documents - there aren't yet ODF 1.3 producers and they should produce the ODF 1.3 documents according to the new algorithm anyway.

Furthermore there is a spurious empty paragraph between step 4) and step 5)

also in step 2a) there are still line breaks after each item; it doesn't bother me that much, but elsewhere such lists are comma-separated without linebreaks, so maybe do it here too for the sake of consistency?

Show

Michael Stahl [X] (Inactive) added a comment - 25/Jun/19 4:44 PM Editors: there is one change in the wd15 draft that i don't understand: "increases interoperability with popular OpenDocument 1.3 producers." in the proposal, this read OpenDocument 1.2, not 1.3, and that is intentional: this is for compatibility with existing documents - there aren't yet ODF 1.3 producers and they should produce the ODF 1.3 documents according to the new algorithm anyway. Furthermore there is a spurious empty paragraph between step 4) and step 5) also in step 2a) there are still line breaks after each item; it doesn't bother me that much, but elsewhere such lists are comma-separated without linebreaks, so maybe do it here too for the sake of consistency?

Hide

Permalink

Patrick Durusau added a comment - 05/Aug/19 7:21 PM

Applied OpenDocument-v1.3-wd16-part3-documents.odt

Show

Patrick Durusau added a comment - 05/Aug/19 7:21 PM Applied OpenDocument-v1.3-wd16-part3-documents.odt

People

Assignee:

Patrick Durusau

Reporter:

Robert Weir (Inactive)

Watchers:

6 Start watching this issue

Dates

Created:

14/Oct/09 8:44 PM

Updated:

28/Jan/20 7:34 PM

Resolved:

23/May/17 7:55 PM