We need to rethink how Paratext handles NBSP (non-breaking spaces) in light of some recently discovered problems.
We currently handle nbsps as follows:
Existing non-breaking spaces are preserved and new ones can be inserted
"~" (tilde) can also be used to represent a nonbreaking space, and is shown as a nbsp in Preview mode, as well as in Print Draft and publishing
Unfortunately, this approach has several problems:
In Paratext, it is very difficult to tell whether a non-breaking space is present or not; they are visually identical
The underlying editor technology (MSHTML), over which we have little control, sometimes inserts nbsps into the text without being asked to do so. This is fairly standard in HTML editors, apparently, and is done to preserve spaces at the beginnings of lines. I can see no way to stop it from doing so.
It is very difficult for users to be consistent in their use of non-breaking spaces if there is no way to see them
USFM standard already defines "~" as the official way to handle non-breaking spaces
As a result of #1 and #2, non-breaking spaces may be silently inserted and we have no way of easily seeing them.
There are a number of possible solutions.
A) Use tilde on disk
Change all non-breaking spaces that are read in to tilde. USFM on disk would contain "~" instead of U+00A0
Keep Preview mode the same to allow a preview with the tildes converted to non-breaking spaces
Replace all non-breaking spaces with spaces when converting back from the displayed text (this would mean that the editor would not have to worry about whether a space or a non-breaking space was inserted)
B) Use NBSP on disk
Immediately before display, convert any NBSP to tilde.
Preview mode works as it does now: both non-breaking spaces and tildes are displayed as nbsp
When saving to disk, all tildes are converted back to NBSP.
Option B is problematic:
Any projects that use "~" for any other purpose will have them converted to non-breaking spaces.
Many different parts of Paratext will need to be modified, as the displayed text does not match the actual text on disk
Projects that validly use "~" for non-breaking space will have them removed and replaced with U+00A0
Option A is less painful, but:
Users will have to learn to use ~ instead of U+00A0 when inserting
Note that with Option A:
Since change is made when reading in the text, the conversion of non-breaking spaces to tildes will not trigger a save notification and will not interfere with merging
Projects with existing tildes will be unaffected, except in publishing and preview mode (which is broken anyway for projects that are not using Unicode)
Precision is what this work is about and some texts want/need the precision of holding strings together. Typically it is more with resources that we need to really worry about non-breaking spaces. It is a non-issue for most projects.
Invisibility does not help for precision.
Therefore, I would absolutely vote for option A.
In the past we made NBSP visible. I think we need to return to that practice. When it is important for it to be converted, at that point convert it, but make it a ~ in the file.
In looking at the encodings for NBSP – the best "visible" option (which follows an accepted standard, as well as USFM standard) seems to the tilde. So I would suggest that we use the tilde. (see attachment NBSP-1 for the listing of NBSP encodings)
Joan brings up a good point in considering other "invisible" encodings. I do not work with any languages that need ZWJ, ZWNJ – but do we need to do the same thing with other "invisible" items? Does it help to make them visible on HDD as HTML does?
In looking at some charts, there are various invisible characters that could be considered; like right-to-left, left-to-right markers. (see attachment NBSP-2)
Many of the other "invisible" characters must remain invisible – they are more of a type of control character, resulting in a direction change in the text presentation, or a forced joining or separating of two shapes in a cursive script (ZWJ, ZWNJ). You never want to see anything, just the result of the their presence (they can make the characters around them look different).
Powered by a free Atlassian Confluence Open Source Project / Non-profit License granted to Canadian Bible Society. Evaluate Confluence today.
Precision is what this work is about and some texts want/need the precision of holding strings together. Typically it is more with resources that we need to really worry about non-breaking spaces. It is a non-issue for most projects.
Invisibility does not help for precision.
Therefore, I would absolutely vote for option A.
In the past we made NBSP visible. I think we need to return to that practice. When it is important for it to be converted, at that point convert it, but make it a ~ in the file.
I do not object to converting NBSP to a visible character.
I do object to losing the NBSPs (ZWJs, etc.) that are entered by the teams to get the text to display correctly.
If NBSPs are converted to tilde in the USFM file, I can handle that. It would not have to be a tilde, if we could agree on another suitable character.
So, either choice appears fine to me for non-roman typesetting.
In looking at the encodings for NBSP – the best "visible" option (which follows an accepted standard, as well as USFM standard) seems to the tilde. So I would suggest that we use the tilde. (see attachment NBSP-1 for the listing of NBSP encodings)
Joan brings up a good point in considering other "invisible" encodings. I do not work with any languages that need ZWJ, ZWNJ – but do we need to do the same thing with other "invisible" items? Does it help to make them visible on HDD as HTML does?
In looking at some charts, there are various invisible characters that could be considered; like right-to-left, left-to-right markers. (see attachment NBSP-2)
Many of the other "invisible" characters must remain invisible – they are more of a type of control character, resulting in a direction change in the text presentation, or a forced joining or separating of two shapes in a cursive script (ZWJ, ZWNJ). You never want to see anything, just the result of the their presence (they can make the characters around them look different).