Discussion:
sh(1) read: add LINE_MAX safeguard and "-n" option
(too old to reply)
t***@kergis.com
2024-09-24 10:56:49 UTC
Permalink
The intrinsics built-in read was specified in earlier POSIX to
treat text file, that are supposed to end with a newline. It has been
relaxed to take into account usage and implementation, allowing to
treat as input a file without a newline.

Problem arise when read is used, for example in pkgsrc utilities,
to get the first "line" of files to decide if they are interpreted
files and some portability adjusting has to be done. Exec'ing sed(1)
for every file costs too much, so a built-in is prefered, but
when a file does not contain a newline, read reads the entire file,
byte by byte, and it takes ages.

The present patch does two things:

1) Set, by default, the maximum of bytes read, in every case, as being
LINE_MAX (the maximum number of bytes in a line in a text file);

2) Implement the '-n' option that allows to set explicitely the
maximum number of bytes to read, thus allowing too to bypass deliberately
the LINE_MAX value.

It is a compromise between usage, historical meaning, practical use
and safety.

If limiting by default the number of bytes to read to a text file
associated limit (LINE_MAX) was considered contrary to the present
POSIX spec (it has "relaxed" the necessity of and end-of-line, not
stated that read has to handle whatever file) another maximum value
associated with the size of a file could be used instead of LINE_MAX
(in the present patch; this will not improve the time in practice when
the maximum is not set to a more reasonable value).

BTW: the usage displayed when a variable name was not given didn't
show the "[-d delim]" option.

diff --git a/bin/sh/miscbltin.c b/bin/sh/miscbltin.c
index c4f963d0d86a..2248b7830835 100644
--- a/bin/sh/miscbltin.c
+++ b/bin/sh/miscbltin.c
@@ -54,6 +54,7 @@ __RCSID("$NetBSD: miscbltin.c,v 1.54 2023/10/05 20:33:31 kre Exp $");
#include <stdlib.h>
#include <ctype.h>
#include <errno.h>
+#include <limits.h> /* LINE_MAX, if defined */

#include "shell.h"
#include "options.h"
@@ -67,6 +68,9 @@ __RCSID("$NetBSD: miscbltin.c,v 1.54 2023/10/05 20:33:31 kre Exp $");

#undef rflag

+#ifndef LINE_MAX
+# define LINE_MAX 2048 /* peak value of _POSIX2_LINE_MAX */
+#endif


/*
@@ -75,6 +79,11 @@ __RCSID("$NetBSD: miscbltin.c,v 1.54 2023/10/05 20:33:31 kre Exp $");
*
* This uses unbuffered input, which may be avoidable in some cases.
*
+ * For safety and efficiency (specially when called on a not text file),
+ * the maximum number of bytes read is LINE_MAX. The '-n' option can
+ * explicitely bypass it (since it is not a POSIX required option,
+ * POSIX text file limits have not to apply).
+ *
* Note that if IFS=' :' then read x y should work so that:
* 'a b' x='a', y='b'
* ' a b ' x='a', y='b'
@@ -101,6 +110,8 @@ readcmd(int argc, char **argv)
int i;
int is_ifs;
int saveall = 0;
+ int linemax;
+ int nread;
ptrdiff_t wordlen = 0;
char *newifs = NULL;
struct stackmark mk;
@@ -108,11 +119,18 @@ readcmd(int argc, char **argv)
end = '\n'; /* record delimiter */
rflag = 0;
prompt = NULL;
- while ((i = nextopt("d:p:r")) != '\0') {
+ linemax = LINE_MAX;
+
+ while ((i = nextopt("d:n:p:r")) != '\0') {
switch (i) {
case 'd':
end = *optionarg; /* even if '\0' */
break;
+ case 'n':
+ linemax = atoi(optionarg);
+ if (linemax <= 0)
+ linemax = LINE_MAX;
+ break;
case 'p':
prompt = optionarg;
break;
@@ -124,7 +142,7 @@ readcmd(int argc, char **argv)

if (*(ap = argptr) == NULL)
error("variable name required\n"
- "Usage: read [-r] [-p prompt] var...");
+ "Usage: read [-r] [-d delim] [-n count] [-p prompt] var...");

if (prompt && isatty(0)) {
out2str(prompt);
@@ -138,16 +156,20 @@ readcmd(int argc, char **argv)
status = 0;
startword = 2;
STARTSTACKSTR(p);
+ nread = 0;
for (;;) {
if (read(0, &c, 1) != 1) {
status = 1;
break;
}
- if (c == '\\' && c != end && !rflag) {
+ if (++nread > linemax) /* same as end */
+ break;
+ if (c == '\\' && c != end && !rflag && nread != linemax ) {
if (read(0, &c, 1) != 1) {
status = 1;
break;
}
+ ++nread;
if (c != '\n') /* \ \n is always just removed */
goto wdch;
continue;
diff --git a/bin/sh/sh.1 b/bin/sh/sh.1
index a819e9d72188..1938dd9478d3 100644
--- a/bin/sh/sh.1
+++ b/bin/sh/sh.1
@@ -3663,7 +3663,7 @@ the program will use
and the built-in uses a separately cached value.
.\"
.Pp
-.It Ic read Oo Fl d Ar delim Oc Oo Fl p Ar prompt Oc Oo Fl r Oc Ar variable Op Ar ...
+.It Ic read Oo Fl d Ar delim Oc Oo Fl n Ar count Oc Oo Fl p Ar prompt Oc Oo Fl r Oc Ar variable Op Ar ...
The
.Ar prompt
is printed on standard error if the
@@ -3674,8 +3674,14 @@ first character of
.Ar delim
if the
.Fl d
-option was given, or a newline character otherwise,
-is read from the standard input.
+option was given, or a newline character otherwise, or after reading
+at maximum
+.Ar count
+bytes if the
+.Fl n
+option was given, or the system defined at compile time
+.Dv LINE_MAX
+otherwise, is read from the standard input.
The ending delimiter is deleted from the
record which is then split as described in the field splitting section of the
.Sx Word Expansions
@@ -3697,6 +3703,16 @@ built-in will indicate success unless EOF, or a read error,
is encountered on input, in
which case failure is returned.
.Pp
+In what follows, the processing is only done if the maximum of bytes
+to read has not been reached. This maximum always exists, whether set
+by the user with the
+.Fl n
+option, or defaulting to a system defined at compile time
+.Dv LINE_MAX
+value. This indicates that not more than this amount of bytes will be
+read, and does not indicate the amount of bytes that will actually be
+read, nor the number of chars that will be present in the variables.
+.Pp
By default, unless the
.Fl r
option is specified, the backslash
--
Thierry Laronde <tlaronde +AT+ kergis +dot+ com>
http://www.kergis.com/
http://kertex.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Robert Elz
2024-09-24 11:54:35 UTC
Permalink
Date: Tue, 24 Sep 2024 12:56:49 +0200
From: <***@kergis.com>
Message-ID: <***@kergis.com>

| The present patch does two things:
|
| 1) Set, by default, the maximum of bytes read, in every case, as being
| LINE_MAX (the maximum number of bytes in a line in a text file);

I am not really in favour of that part, while allowed by the standard,
imposing unnecessary limits, just because they are permitted, is not
really ideal. Apart from that, the "line" read by read (without -r)
can actually be several (or many) text file lines, if each is ended by
a \ (line continuation).

| 2) Implement the '-n' option that allows to set explicitely the
| maximum number of bytes to read, thus allowing too to bypass deliberately
| the LINE_MAX value.

Martin suggested that as well. Your implementation isn't correct
as it is (if the limit is reached, the next character will be discarded,
that's not allowed ... also easy to fix) but before doing anything I
want to check what other shells which implement the option actually
count (particularly wrt \ sequences, but also the word splitting).
There is no point being needlessly different if that is possible to
avoid.

| BTW: the usage displayed when a variable name was not given didn't
| show the "[-d delim]" option.

Sometimes things just get forgotten... ;=)

kre

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Greg Troxel
2024-09-24 13:09:29 UTC
Permalink
Post by Robert Elz
Date: Tue, 24 Sep 2024 12:56:49 +0200
|
| 1) Set, by default, the maximum of bytes read, in every case, as being
| LINE_MAX (the maximum number of bytes in a line in a text file);
I am not really in favour of that part, while allowed by the standard,
imposing unnecessary limits, just because they are permitted, is not
really ideal. Apart from that, the "line" read by read (without -r)
can actually be several (or many) text file lines, if each is ended by
a \ (line continuation).
Sure, but the problem is that if you have a file which is e.g one line
(single \n at end) that is 10 MB, read from it is unreasonable, and it's
difficult to deal with this in portable code.

If there were a limit which was well under 1 MB, but well over anything
reasonably in a bona fide text file, it would finesse the issue.

Perhaps 32 * LINE_MAX.

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
t***@kergis.com
2024-09-24 14:16:17 UTC
Permalink
Post by Greg Troxel
Post by Robert Elz
Date: Tue, 24 Sep 2024 12:56:49 +0200
|
| 1) Set, by default, the maximum of bytes read, in every case, as being
| LINE_MAX (the maximum number of bytes in a line in a text file);
I am not really in favour of that part, while allowed by the standard,
imposing unnecessary limits, just because they are permitted, is not
really ideal. Apart from that, the "line" read by read (without -r)
can actually be several (or many) text file lines, if each is ended by
a \ (line continuation).
Sure, but the problem is that if you have a file which is e.g one line
(single \n at end) that is 10 MB, read from it is unreasonable, and it's
difficult to deal with this in portable code.
If there were a limit which was well under 1 MB, but well over anything
reasonably in a bona fide text file, it would finesse the issue.
Perhaps 32 * LINE_MAX.
POSIX issue 8 has added the "-d delim", that is a delimiter of a
"line" and this makes things more complex, since the continuation is
the escaping of the delimiter.

My solution was too simple.

We have to make a difference between the maximal length of a "line"
(linemax), and the maximum of bytes to read (the "-n" option):
recordmax.

If the delimiter is the newline, the maximal length of each "line" is a
text line, that is LINE_MAX; if the delimiter is something else, the
maximum is ULONG_MAX.
If this amount is reached without reaching the delimiter (escaped or
not), the reading stops. When changing line (after a continuation
line), the counter is reset to zero allowing to absord another "line".

What is set by "-n" is the maximum count of bytes composing the record
(recordmax), that may be a concatenation of "lines", not counting
the discarded bytes (backslash and delimiter that are not part of
data since the "escaped line" is presentation, to be discarded)
and counting only 1 for an escaped sequence if it is interpreted (not
raw) (replacing the escaped sequence by the character).

If the maximum is not set it defaults to ULONG_MAX.

Slightly more complex than what I made, but still reasonably simple.
--
Thierry Laronde <tlaronde +AT+ kergis +dot+ com>
http://www.kergis.com/
http://kertex.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Robert Elz
2024-09-24 15:42:34 UTC
Permalink
Date: Tue, 24 Sep 2024 09:09:29 -0400
From: Greg Troxel <***@lexort.com>
Message-ID: <***@s1.lexort.com>

| Sure, but the problem is that if you have a file which is e.g one line
| (single \n at end) that is 10 MB, read from it is unreasonable, and it's
| difficult to deal with this in portable code.

Yes. That's just a limitation of what portable code (using sh's read anyway)
can provide, if it weren't for the cost of running non-builtin processes, then
there would be other easy ways to handle this.

| If there were a limit which was well under 1 MB, but well over anything
| reasonably in a bona fide text file, it would finesse the issue.

That's what the -n option is supposed to achieve, and while not portable
(and cannot be it seems, as different shells have different definitions
of what it actually does -- even ignorning zsh where it is a different
thing entirely). That's reasonable to add, if we can work out what it
should really mean, that can happen.

Then on any system with our (once updated) shell, or bash, or mksh, or ksh93
which all have -n options similar enough for your purpose, you can implement
whatever limit you like, without it affecting other uses of read in sh.
On other systems just fall back to sed/dd/... and accept the cost (that
script probably isn't often used on such systems I'd expect.)

Part of the reason that things are costly now, is that read is required
(normally anyway) to read 1 byte at a time, so if you're doing a read
which consumes a MB, that's a million (plus) system calls... That's not
cheap! That's kind of inherent in the definition of read and how it is
required to leave the state of the fd it reads from.

kre

ps: in that script (the one in question) the read calls (or at least the
ones related to this) should certainly be using read's -r option.



--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
t***@kergis.com
2024-09-25 13:12:07 UTC
Permalink
Post by Robert Elz
Date: Tue, 24 Sep 2024 12:56:49 +0200
|
| 1) Set, by default, the maximum of bytes read, in every case, as being
| LINE_MAX (the maximum number of bytes in a line in a text file);
I am not really in favour of that part, while allowed by the standard,
imposing unnecessary limits, just because they are permitted, is not
really ideal. Apart from that, the "line" read by read (without -r)
can actually be several (or many) text file lines, if each is ended by
a \ (line continuation).
FWIW, you will find below a proposed solution:

In a nutshell, there are 2 counters: one for each "line" (in the
stream read, the sequence of bytes terminated by the end byte) and a
second counter counting all the bytes read, except the continuation
line sequence (when not in raw mode) that is considered to be ignored
(it's the same data, whether it is one long line or split with
continuation lines).

If '-n' does not specify it, the maximum number of bytes (except
continuation line sequence) is "unlimited" (macro READ_UNLIMITED
defined to be ULONG_MAX).

If the delimiter is not specified, it is the default '\n', to the
input stream is supposed to be a POSIX text file. The maximum length
of a "line" is LINE_MAX. When called like:

read myvar <myfile

a not text file will not be read more than LINE_MAX, hence the "naive"
way to read to obtain the first "line" to looke for a shebang works as
expected.

One can perfectly have more than this, with continuation lines: the
counter is only for each line read; it is reset to zero with switching
to the "next" continuating line.

If '-d' is used, even with '\n', linemax is reset to READ_UNLIMITED,
allowing to bypass the limit for a "text" file not POSIX compliant.

I have dropped trying to count (for '-n') only bytes effectively put
in the "record" for 2 reasons:

1) The handling of IFS, even more so if IFS is a name of one of the
variables to be assigned, is not easy to get right (we end processing
before perhaps removing characters written);

2) When the user defines the maximum to read, he can hardly know what
processing will be done with the data i.e. the number of bytes
effectively written in the variables. So this number is effectively
the number of bytes to read, minus the continuation line sequences
that could not be here and should not count.

It seems to me that this addresses the usage, adds a reasonable
safeguard, for a very small price in code.

Best,

T. Laronde

diff --git a/bin/sh/miscbltin.c b/bin/sh/miscbltin.c
index c4f963d0d86a..b0cba4fab15e 100644
--- a/bin/sh/miscbltin.c
+++ b/bin/sh/miscbltin.c
@@ -54,6 +54,7 @@ __RCSID("$NetBSD: miscbltin.c,v 1.54 2023/10/05 20:33:31 kre Exp $");
#include <stdlib.h>
#include <ctype.h>
#include <errno.h>
+#include <limits.h> /* LINE_MAX, if defined */

#include "shell.h"
#include "options.h"
@@ -67,6 +68,9 @@ __RCSID("$NetBSD: miscbltin.c,v 1.54 2023/10/05 20:33:31 kre Exp $");

#undef rflag

+#ifndef LINE_MAX
+# define LINE_MAX 2048 /* peak value of _POSIX2_LINE_MAX */
+#endif


/*
@@ -75,6 +79,14 @@ __RCSID("$NetBSD: miscbltin.c,v 1.54 2023/10/05 20:33:31 kre Exp $");
*
* This uses unbuffered input, which may be avoidable in some cases.
*
+ * For safety and efficiency, when called with the default end "line"
+ * delimiter ('\n'), the maximum length of a line is set to LINE_MAX.
+ * When specifying explicitely something (including '\n'), user
+ * must know what he's doing since this safeguard doesn't exist
+ * (the limit is ULONG_MAX considered unlimited here).
+ * When specifying the maximum number to read, this is the number of
+ * bytes except the escaping sequence '\\' + end.
+ *
* Note that if IFS=' :' then read x y should work so that:
* 'a b' x='a', y='b'
* ' a b ' x='a', y='b'
@@ -86,6 +98,7 @@ __RCSID("$NetBSD: miscbltin.c,v 1.54 2023/10/05 20:33:31 kre Exp $");
* ':b c:' x='', y='b c:'
*/

+#define READ_UNLIMITED ULONG_MAX /* our 'unlimited' length */
int
readcmd(int argc, char **argv)
{
@@ -101,17 +114,28 @@ readcmd(int argc, char **argv)
int i;
int is_ifs;
int saveall = 0;
+ unsigned long nread_inline; /* #bytes read in "line" */
+ unsigned long linemax; /* nread_inline <= linemax */
+ unsigned long nread; /* #bytes read, not counting escaped end */
+ unsigned long readmax; /* nread <= readmax */
ptrdiff_t wordlen = 0;
char *newifs = NULL;
struct stackmark mk;

- end = '\n'; /* record delimiter */
+ end = '\n'; /* line delimiter */
rflag = 0;
prompt = NULL;
- while ((i = nextopt("d:p:r")) != '\0') {
+ linemax = LINE_MAX; /* default */
+ readmax = READ_UNLIMITED;
+
+ while ((i = nextopt("d:n:p:r")) != '\0') {
switch (i) {
case 'd':
end = *optionarg; /* even if '\0' */
+ linemax = READ_UNLIMITED;
+ break;
+ case 'n':
+ readmax = strtoul(optionarg, (char **)NULL, 10);
break;
case 'p':
prompt = optionarg;
@@ -122,9 +146,12 @@ readcmd(int argc, char **argv)
}
}

+ if (end == '\\')
+ rflag = 1; /* there is no escape with escape... */
+
if (*(ap = argptr) == NULL)
error("variable name required\n"
- "Usage: read [-r] [-p prompt] var...");
+ "Usage: read [-r] [-d delim] [-n count] [-p prompt] var...");

if (prompt && isatty(0)) {
out2str(prompt);
@@ -138,19 +165,30 @@ readcmd(int argc, char **argv)
status = 0;
startword = 2;
STARTSTACKSTR(p);
+ nread = nread_inline = 0;
for (;;) {
+ if (nread >= readmax || nread_inline >= linemax)
+ break;
if (read(0, &c, 1) != 1) {
status = 1;
break;
}
- if (c == '\\' && c != end && !rflag) {
+ if (c == '\\' && !rflag) {
if (read(0, &c, 1) != 1) {
status = 1;
break;
}
- if (c != '\n') /* \ \n is always just removed */
+ if (c != end) { /* not continuation line */
+ nread_inline += 2;
+ nread += 2;
goto wdch;
- continue;
+ } else {
+ nread_inline = 0; /* new line */
+ continue;
+ }
+ } else {
+ ++nread_inline;
+ ++nread;
}
if (c == end)
break;
--
Thierry Laronde <tlaronde +AT+ kergis +dot+ com>
http://www.kergis.com/
http://kertex.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Robert Elz
2024-09-25 14:01:12 UTC
Permalink
Date: Wed, 25 Sep 2024 15:12:07 +0200
From: ***@kergis.com
Message-ID: <***@kergis.com>

| FWIW, you will find below a proposed solution:

Thanks, but there is no need, I will make it work eventually.

I like that you got rid of atoi() from the previous version, but
sh has a function number() which not only gets an int (just an int,
but that's enough for any rational use here) and does all the syntax
check, and error handling for it, which is what should be used.

| If the delimiter is not specified, it is the default '\n', to the
| input stream is supposed to be a POSIX text file. The maximum length
| of a "line" is LINE_MAX.

That's not exactly what it is intended to be - LINE_MAX is the limit
applications are supposed to keep line under, as that's what applications
are required to support (as a minimum) - there's nothing that says that
the application can't support longer lines - even unlimited length ones
(within available memory if the whole line needs to be stored), and
not imposing unnecessary limits is generally a much better result.

| When called like:
| read myvar <myfile

I know you're trying to find a way to make that portable usage run
much faster (at least with our shell, we won't be improving any others)
but making changes like this to meet the specific needs of one application
isn't what we ought to be doing. (I haven't really looked at the script
but my guess would be that there are other changes that could be made which
would improve performance more than what this one could do.)

| If '-d' is used, even with '\n', linemax is reset to READ_UNLIMITED,

Why would there be a difference between '\n' as the default delimiter
and '\n' explicitly set? You're requiring applications that might work
just fine right now to change to meet your new restriction - it should be
the other way around, applications that need the new feature (limited
length reads) should be changed to request it.

| I have dropped trying to count (for '-n') only bytes effectively put
| in the "record" for 2 reasons:

That was probably always the wrong approach, is read is supposed to
read a max of N bytes, it should read no more than that, whatever is
done with them afterwards.

But to be compatible with the other shells that have -n (in this fashion)
a read from a device should complete as soon as N bytes are available,
not wait for more to appear to allow the read to succeed - that means
that the terminal has to have canonical processing turned off, and that
means that we need to consider what happens when the ERASE (etc) chars
are entered.

Thanks for the effort, but please stop now.

kre


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
t***@kergis.com
2024-09-24 12:27:48 UTC
Permalink
Post by Robert Elz
Date: Tue, 24 Sep 2024 12:56:49 +0200
|
| 1) Set, by default, the maximum of bytes read, in every case, as being
| LINE_MAX (the maximum number of bytes in a line in a text file);
I am not really in favour of that part, while allowed by the standard,
imposing unnecessary limits, just because they are permitted, is not
really ideal. Apart from that, the "line" read by read (without -r)
can actually be several (or many) text file lines, if each is ended by
a \ (line continuation).
Good point.

This can be solved by resetting nread to 0 when an actual end-of-line
is reached and escaped. In this case the condition (nread == linemax)
has to be suppressed and the case handled in the corresponding block.
Post by Robert Elz
| 2) Implement the '-n' option that allows to set explicitely the
| maximum number of bytes to read, thus allowing too to bypass deliberately
| the LINE_MAX value.
Martin suggested that as well. Your implementation isn't correct
as it is (if the limit is reached, the next character will be discarded,
that's not allowed ... also easy to fix) but before doing anything I
want to check what other shells which implement the option actually
count (particularly wrt \ sequences, but also the word splitting).
There is no point being needlessly different if that is possible to
avoid.
I gave a quick look to the bash(1) man page, that has two differing
options: -n (max number read) and -N (read exactly this number).

I have not looked at the '-N' case in details (it seems to me overly
too complex to get right for whatever a user might want regarding bytes
read vs "chars" actually ending in the variables).

For '-n', if I understand correctly, this is the number of bytes read,
without consideration of "char"s and, in this sense, escaping sequences.

For me, the least surprise thing is to treat size limit as
next char is an eof except if newline escaped.

This is why I use "bytes" for the count, to treat it differently from
"char" that may be an interpretation of a sequence of bytes.
--
Thierry Laronde <tlaronde +AT+ kergis +dot+ com>
http://www.kergis.com/
http://kertex.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Robert Elz
2024-09-24 16:13:05 UTC
Permalink
Date: Tue, 24 Sep 2024 14:27:48 +0200
From: ***@kergis.com
Message-ID: <***@kergis.com>

| This can be solved by resetting nread to 0 when an actual end-of-line
| is reached and escaped.

I think it better to just not have a limit in the normal case, it
serves no purpose, except for this one rather exotic use of the
read builtin (which really is meant for reading text files - the
"variation" was to allow it to read \0 delimited "records" as output
from find -print0 and similar.). It cannot really read binary blobs,
no matter what is done, as sh variables cannot contain \0 characters
(ever). That doesn't matter for the present purpose, but nor does
much else here.

| I gave a quick look to the bash(1) man page, that has two differing
| options: -n (max number read) and -N (read exactly this number).
|
| I have not looked at the '-N' case in details (it seems to me overly
| too complex to get right for whatever a user might want regarding bytes
| read vs "chars" actually ending in the variables).
|
| For '-n', if I understand correctly, this is the number of bytes read,
| without consideration of "char"s and, in this sense, escaping sequences.

The bash manual says "characters" in both cases, but I'm not sure that
it really means that, and certainly for us the difference is moot, as
sh really wants 1 byte == 1 character, almost always (it can process
UTF-8 and similar, it because it mostly doesn't need to interpret the
strings as characters, just a byte strings).

| This is why I use "bytes" for the count, to treat it differently from
| "char" that may be an interpretation of a sequence of bytes.

Yes, that part isn't the issue - the issue is that if "read" reads N
bytes (characters) [0..N-1] (and after processing assigns them to variables)
then another following read must start at the very next byte [N], read isn't
allowed to simply discard anything not explicitly specified -- that is it can
remove \ chars if -r isn't given, and always removes the delimiter char,
if found, but it cannot actually read 128 bytes, and then just process
100 of them, as there's no way to put back the other 28 (particularly
when reading from a pipe). That's why it reads 1 byte at a time, and
never reads the next unless it is needed.

The other versions (ignoring zsh where -n means something totally unrelated)
all put the terminal into raw mode (or the equivalent) when -n is specified,
so as soon as n characters have been read the read can stop - otherwise the
terminal driver won't return anything until the user enters a \n (and while
the 1 byte at a time read scheme avoid reading more than N of the bytes
entered, leaving the rest for later, if one does "read -n 1 var" and the
read doesn't return after 1 byte is typed (which it does in the other shells)
people will be unhappy.

I am looking at how to make something reasonable work. It won't happen
within a day or two however.

kre



--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Robert Elz
2024-09-25 17:02:08 UTC
Permalink
Date: Wed, 25 Sep 2024 21:01:12 +0700
From: Robert Elz <***@munnari.OZ.AU>
Message-ID: <***@jacaranda.noi.kre.to>

This isn't avoiding adding a -n that works, but a possible simpler change
that might help, is much easier to install properly, should break nothing,
and I think will probably avoid the problem that is being observed.

That is, currently the read builtin simply ignores \0 bytes in the input,
except if \0 is the delimiter character. We could change that, and make
\0 an error - POSIX specifies that the input shall not contain \0 chars
unless -d has been used to make \0 the delimiter character.

If that change got made, I suspect that the read would terminate quite
quickly on non-text files, and text files will end at the first \n.

Would that be a useful first step at least?

If I do that, I'd probably add an option to retain the current behaviour,
just in case there's something (inappropriately) relying upon it.

kre


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
t***@kergis.com
2024-09-25 18:03:56 UTC
Permalink
Post by Robert Elz
That is, currently the read builtin simply ignores \0 bytes in the input,
except if \0 is the delimiter character. We could change that, and make
\0 an error - POSIX specifies that the input shall not contain \0 chars
unless -d has been used to make \0 the delimiter character.
Just to be sure about the nul byte: a nul byte generates an error (non
interactive: terminates the process; interactive: stops processing,
displays an error and waits for more).

But a nul byte given as an escaped sequence does not generate an error
and is discarded (present implementation OK, since choice up to the
implementation) conforming to (issue 8, taken from 2.2.4
Dollar-Single-Quotes, and applying consistency):

---8<---
If a \xXX or \ddd escape sequence yields a byte whose value is 0, it
is unspecified whether that null byte is included in the result or if
that byte and any following regular characters and escape sequences up
to the terminating unescaped single-quote are evaluated and discarded.
--->8---

Is that it?
--
Thierry Laronde <tlaronde +AT+ kergis +dot+ com>
http://www.kergis.com/
http://kertex.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
t***@kergis.com
2024-09-25 17:20:31 UTC
Permalink
Post by Robert Elz
Date: Wed, 25 Sep 2024 21:01:12 +0700
This isn't avoiding adding a -n that works, but a possible simpler change
that might help, is much easier to install properly, should break nothing,
and I think will probably avoid the problem that is being observed.
That is, currently the read builtin simply ignores \0 bytes in the input,
except if \0 is the delimiter character. We could change that, and make
\0 an error - POSIX specifies that the input shall not contain \0 chars
unless -d has been used to make \0 the delimiter character.
If that change got made, I suspect that the read would terminate quite
quickly on non-text files, and text files will end at the first \n.
Would that be a useful first step at least?
If I do that, I'd probably add an option to retain the current behaviour,
just in case there's something (inappropriately) relying upon it.
AFAIK, this won't solve the problem at hand, because numerous
files with bytes in the range 0x20--0x7E are made stripping extra
white space and rowing it all on one "line", to speed up the parsing
or to render them more difficult to read by a human being---or because
they are generated by software and are not intended for humans,
and the software simply adds instructions, one after the other,
without inserting new lines.

So '\0' is another question---no harm to solve this to to be consistent
with what NetBSD wants to be consistent with---but this will not help
with typical html files or javascript files that may as well not
have any newline while every byte may be in 0x20 0x20 0x7E range.

Even without changing at present the way read reads, the easy solution
is to add, without dealing with terminal settings, the '-n' option.
Then the decision has to be made to count or not the escaping line
sequence (I'm for not counting it, the rational being a continuation
line is simply a formatting of data entry, whether to conform to
LINE_MAX or to whatever line length, but does not count as data).

But this is just one opinion, and it is up to you.
--
Thierry Laronde <tlaronde +AT+ kergis +dot+ com>
http://www.kergis.com/
http://kertex.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Robert Elz
2024-09-25 22:25:42 UTC
Permalink
Date: Wed, 25 Sep 2024 20:03:56 +0200
From: ***@kergis.com
Message-ID: <ZvRQjAg2qzt-***@kergis.com>

| But a nul byte given as an escaped sequence does not generate an error
| and is discarded (present implementation OK, since choice up to the
| implementation) conforming to (issue 8, taken from 2.2.4
| Dollar-Single-Quotes, and applying consistency):

$'' has nothing to do with it, that is describing chars that will become
part of the script, not input to a utility, the relevant text is in read ...

If the -d delim option is not specified, or if it is specified and
delim is not the null string, the standard input shall contain zero
or more bytes (which need not form valid characters) and shall
not contain any null bytes.

When we have invalid input we can treat it as an error, or we can
do something different. Currently we just ignore them. They can't
be included.

But if (from your earlier message) stopping at \0 isn't going to help,
then there's no point changing things. I had assumed that the files
being checked were generally either #!/whatever files, or binaries.
If there are javascipt, html, and all kinds of other stuff included,
then the \0 check might indeed be not all that useful - but as I said,
that wasn't intended as an alternative to -n, just as something that
could reasonably be done which would allow "read var < file" to work
better than it does now, without needing to be modified.

kre


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
t***@kergis.com
2024-09-24 15:30:42 UTC
Permalink
Post by Robert Elz
Date: Tue, 24 Sep 2024 12:56:49 +0200
|
| 1) Set, by default, the maximum of bytes read, in every case, as being
| LINE_MAX (the maximum number of bytes in a line in a text file);
I am not really in favour of that part, while allowed by the standard,
imposing unnecessary limits, just because they are permitted, is not
really ideal. Apart from that, the "line" read by read (without -r)
can actually be several (or many) text file lines, if each is ended by
a \ (line continuation).
In the present code, a delimiter can not be "any character": because
if the delimiter is the backslash, what shall be the behavior?

Since backslash is defined to be the escape character, it can not be
used as a delimiter, unless specifying that all backslashes have to
be escaped to be accepted as end delimiter... But how to specify a
continuation line then?

Shouldn't we rule out backslash as a delimiter (for -d) (and POSIX has
to be amended). Or, if backslash is a delimiter, does this imply
"raw", i.e. no escaping club?

Furthermore the continuation test on:

if (c != '\n') /* \ \n is always just removed */
goto wdch;

seems wrong. Shouldn't it be?:

if (c != end)
goto wdch;

to accept whatever continuation line with whatever delimiter?
--
Thierry Laronde <tlaronde +AT+ kergis +dot+ com>
http://www.kergis.com/
http://kertex.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Robert Elz
2024-09-24 17:18:28 UTC
Permalink
Date: Tue, 24 Sep 2024 17:30:42 +0200
From: ***@kergis.com
Message-ID: <***@kergis.com>

| In the present code, a delimiter can not be "any character": because
| if the delimiter is the backslash, what shall be the behavior?

That isn't even a half an issue ... the read reads characters until
the delimiter is encountered, then discards that, and processes the
line/record/whatever you want to call it, as directed - which includes
(assuming -r wasn't given) looking for (and eventually removing) escape
chars ... in this case it is unlikely to find any! (But that's OK).

| Since backslash is defined to be the escape character, it can not be
| used as a delimiter, unless specifying that all backslashes have to
| be escaped to be accepted as end delimiter... But how to specify a
| continuation line then?

It isn't possible. Actually using \ as the delimiter (without
-r anyway) makes little sense at all, but that doesn't mean it
needs to be prohibited.

| Furthermore the continuation test on:
[...]
| seems wrong. Shouldn't it be?:
[...]

Yes, probably - use of -d without -r is kind of rare I suspect (in
fact actually using read at all without -r is not all that common, or
not in correct code).

I will fix it along with other things - thanks.

kre


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
t***@kergis.com
2024-09-24 18:22:45 UTC
Permalink
Post by Robert Elz
It isn't possible. Actually using \ as the delimiter (without
-r anyway) makes little sense at all, but that doesn't mean it
needs to be prohibited.
Then, this is another thing that has to be corrected in POSIX, issue
8:

---8<---
If the -r option is not specified, <backslash> shall act as an escape
character. An unescaped <backslash> shall preserve the literal value
of a following <backslash> and shall prevent a following byte (if any)
from being used to split fields, with the exception of either
<newline> or the logical line delimiter specified with the -d delim
option (if it is used and delim is not <newline>); it is unspecified
which. If this excepted character follows the <backslash>, the read
--->8---

And this escape business is simply non parsable with a backslash as
a delimiter.

I suggest in our code to explicitely (for readability) set:

if (end == '\\')
rflag = 1; /* no escaping if escape */

this will help a casual reader and seems, IMHO, more easy to grasp
when reading than (c == '\\' && c != end) --- that indeed discard
end == '\\'. Too smart at least for me ;-)

What I can't once more parse in the POSIX specification is if it shall
be interpreted as "the sequence backslash and delimiter in not raw mode
is a continuation line", or if "in not raw mode, any escaped delimiter
is a continuation line as well as the escaped newline".

For me, the "either newline or other" has to be interpreted as xor,
but am I right? But in this case this covers the whole range "newline
and not newline", so why not simply state that the line delimiter
escaped when not in raw mode is a continuation line (having stated
once and for all that a backslah as delimiter implies raw mode)?

And they should start by stating that the input is a sequence of lines,
considered as a sequence of bytes ending by the first appearance of a
delimiter byte that is the newline by default but that can be set to any
byte with the -d option.

That a record can span multiple lines if there are continuation
lines that is. if not in raw mode, when the end delimiter is escaped.

And that read reads one record, discarding continuation lines and
replacing escaped sequences (when not in raw mode), and then splitting
the record according to the following rules.
--
Thierry Laronde <tlaronde +AT+ kergis +dot+ com>
http://www.kergis.com/
http://kertex.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Robert Elz
2024-09-27 11:52:31 UTC
Permalink
Date: Fri, 27 Sep 2024 12:25:49 +0200
From: ***@kergis.com
Message-ID: <ZvaILZGn-***@kergis.com>


| I have an algebraic mind: I always think of rule. A line, sometime
| ago, was considered a sequence of bytes ending by the first appearance
| of '\n'. If a "line" is defined more generally as a sequence of bytes
| ending by the first appearance of whatever byte delimiter,

But it isn't - what a line is is defined, and it isn't that.
The delimiter is just what terminates the read, just as the
byte count given to -n does. That might take a fractional line
or many lines to achieve (given various combinations of -d and -n).

| But could you state it clearly (not \`a la POSIX :-^)
| in the man page?

That would be my hope. But writing English was never one of my
better achievements, as some of these e-mails should reveal.

| Other corner case: when specifying a limit (-n) that is "end reading at the
| first appearance of either eof, not escaped delimiter or that amount
| of bytes read", what do you do when the last byte read (reaching the
| count) is '\\'?

Stop anyway. In general, every time it can occur, a stray ending \ just
generates unspecified behaviour. In general I'd expect that using -n
would normally mean -r as well, so the whole question is irrelevant, but
for now, all that happens is that \ is read (no more, that would go beyond
the limit) and having nothing to escape, is removed along with all the
other \ chars that don't have any useful purpose (when -r is not given).

| Or do you allow the stray backslash in the last
| variable, convert it to the sequence "\\", or remove it?

For now at least, the last (the first two would be essentially
the same thing, as if that final \ was actually followed by another
and the -n limit were one byte bigger). I think the only other
reasonable approach to take would be to make it be an error, but
I don't think that's warranted here.

There will be, after all, no way to ever know it happened (in the
script), without -r \ chars (except the escaped one, \\) are all
removed anyway, as is IFS whitespace, etc - there's no immediate way
to detect how much of each of those actually happened (with or
without -n).

[On -z]
| IMHO, the reverse:

That's my general preference as well, but it is a change to current
behaviour, so I will wait upon others' opinions before making that
happen (it is after all, one minor "!" operator addition, so mpt
exactly something that is going to take hours of work).

| Would it make sense to add a '-Z' option that translates a nul byte
| into the sequence '\000' with the specification that such a sequence
| is a constant one and is never interpreted, except by printf?

No, I don't think so. I doubt there's any immediate need for that,
and even in printf, what happens when that appears is unspecified (and
for use with %b which would be where it ought to be used, if anywhere -
not in the format string, which would mean allowing that to come from
arbitrary external input, which is almost never a good idea, though not
quite as bad in printf(1) as in printf(3)) it would need to be \0000 anyway,
to meet the ancient stupid System III definition of how to write an
octal constant for its echo program.

kre


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Robert Elz
2024-09-27 16:26:08 UTC
Permalink
Date: Fri, 27 Sep 2024 18:09:03 +0200
From: ***@kergis.com
Message-ID: <ZvbYn3rsyEPJmI-***@kergis.com>


| Just a note on typos:
| > mences. If more lines of data are requred in
| ^^^^^^
| required

Thanks ... I would have run a spell checker before committing!
(I still will).

| > If the -r option not was given, and the two character sequence `\'
| ^^^^^^^ ^
| was not s

Ugh ... for the first of those, I know what happened ... the "not" was
initially missing, I detected that, and inserted it! Obviously not in
the correct location.

But for the 's', no, English is just weird for things like that.

kre


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
t***@kergis.com
2024-09-27 18:25:35 UTC
Permalink
Post by Mouse
Post by Robert Elz
Post by t***@kergis.com
Post by Robert Elz
If the -r option not was given, and the two character sequence `\'
^^^^^^^ ^
was not s
But for the 's', no, English is just weird for things like that.
Yeah. When you're using a count noun with a number as a compound
adjective, it uses the singular form of the noun, even if the
number-plus-noun as a compound noun would use the plural form. "Two
characters", but "two-character sequence" (written, above, without the
dash; I'm including the dash because I think it's clearer that way).
Similarly, "three cars", but "a three-car garage"; "this keyboard has
94 keys" but "this is a 94-key keyboard".
I have no idea why, except "history".
Perhaps because of the spoken language: "two-cars garage" could be
confused with "two cars' garage" (the garage for these two cars) to be
contrasted with a garage able to accommodate two cars?

I will finally learn english some day ;-)
--
Thierry Laronde <tlaronde +AT+ kergis +dot+ com>
http://www.kergis.com/
http://kertex.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Mouse
2024-09-27 18:04:59 UTC
Permalink
Post by Robert Elz
Post by t***@kergis.com
Post by Robert Elz
If the -r option not was given, and the two character sequence `\'
^^^^^^^ ^
was not s
But for the 's', no, English is just weird for things like that.
Yeah. When you're using a count noun with a number as a compound
adjective, it uses the singular form of the noun, even if the
number-plus-noun as a compound noun would use the plural form. "Two
characters", but "two-character sequence" (written, above, without the
dash; I'm including the dash because I think it's clearer that way).
Similarly, "three cars", but "a three-car garage"; "this keyboard has
94 keys" but "this is a 94-key keyboard".

I have no idea why, except "history".

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
t***@kergis.com
2024-09-27 16:09:03 UTC
Permalink
Post by Robert Elz
Date: Fri, 27 Sep 2024 15:04:18 +0200
| If I understand correctly your view, the explanation could be
| something around this (I mean for the idea; for the way it is
This what I came up with (no -N option has been implemented, I don't
see the point at the minute - that can be revisited later if someone
can demonstrate a meaningful use for it).
Quite an improvement compared to what I have read elsewhere!
Post by Robert Elz
-p prompt If the standard input is a terminal, then prompt
is written to standard error before the read com-
mences. If more lines of data are requred in
^^^^^^
required
Post by Robert Elz
[...]
If the -r option not was given, and the two character sequence `\'
^^^^^^^ ^
was not s

Best,
--
Thierry Laronde <tlaronde +AT+ kergis +dot+ com>
http://www.kergis.com/
http://kertex.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Edgar Fuß
2024-09-29 15:35:34 UTC
Permalink
As I understand, the performance problem of the read built-in utility
originates from its need to read one byte at a time in order not to
swallow input it doesn't process.

Would it make sense to add an "exclusive" option (call it "-x" for now)
to read, where "read -x" essentially means "I promise to do all processing
on this open file with read -x or not to complain if input gets lost"?

This may be infeasable due to not being able to guarantee that any
"read -x" really uses the built-in version.

The point, of course, is to allow the shell to buffer the input.

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Robert Elz
2024-09-27 00:03:15 UTC
Permalink
Date: Tue, 24 Sep 2024 17:30:42 +0200
From: ***@kergis.com
Message-ID: <***@kergis.com>

| Furthermore the continuation test on:
|
| if (c != '\n') /* \ \n is always just removed */
| goto wdch;
|
| seems wrong. Shouldn't it be?:
|
| if (c != end)
| goto wdch;

Actually no, what is there now is what is intended.

The idea is that the input might need to be divided into many lines
to meet the requirement that it be a text file, which means a max
line length (as you're aware), and that max length is from the first
char in the line to the next \n char (read's delimiter char has
nothing to do with that use of \n). To allow that, while not restricting
the length of a record, the sequence \ \n is allowed to indicate
continuation lines, regardless of what the delimiter is, and is simply
removed from the input stream (just as in cpp and sh - and more).

Other than that usage, a \ also escapes the following char, avoids
it being anything special (not a field (word) separator, not the
delimiter, and of course, as \\ not the escape char either).

If the delimiter was \n (the default, or -d $'\n') then the end of line
continuation removal causes it to vanish before the code checks if the
delimiter has appeared, if the delimiter is something else, we don't want
it to vanish, there is no point in that -- say we use "-d :", why would
we then ever write \: in the input if those pair of chars are simply
deleted? Makes no sense. What we would want is the escaped : there
to be a regular char, not deleted, and not the delimiter either.

So the test above is is checking for when we have a \ before some
character other than \n - in which case the goto adds the following
character to the current word (which makes it into just a ordinary
char, not special in any way, with the preceding \ removed). But
if it is \n after the \ we don't do that, so just continue (next
line not shown above) which goes back to read more input, simply
discarding the \ \n sequence, which is what we want to happen whether
\n is the delimiter or not.

This is specifically allowed by posix in the spec of the read command,
though you have to read the almost indecipherable sentence about a
million times, and already knowing what it is trying to say, to
understand it (and even then I think what it is saying has an error,
but it is so hard to decipher I'm not sure).

Apart from that:

I think I have -n implemented as intended (by me anyway) now. But
now I need to also update the manual ... I started trying to fit it
into the text in the form the description of the read builtin
currently exists, but that got ridiculously messy, so I am going to
discard the whole current destription and do it again in the more
conventional form, with the options listed as a list, rather than just
worked into the description in narative form. That's going to take
another day or so.

I have also added -z (currently, for not very important backward compat
with the current impl) to issue an error if a \0 is encountered in the
input (other than as the record delimiter). Inverting the
sense of that option probably makes more sense (-z to allow \0
chars, and error without that option). Either way this is very
very simple and cheap to implement, as the code has to check for
the \0 chars anyway. (The error would cause the read to terminate
with exit status 2, as does any other error).

Or that option could just go away again. Opinions please? (everyone)

kre

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
t***@kergis.com
2024-09-27 10:25:49 UTC
Permalink
Post by Robert Elz
Date: Tue, 24 Sep 2024 17:30:42 +0200
|
| if (c != '\n') /* \ \n is always just removed */
| goto wdch;
|
|
| if (c != end)
| goto wdch;
Actually no, what is there now is what is intended.
The idea is that the input might need to be divided into many lines
to meet the requirement that it be a text file, which means a max
line length (as you're aware), and that max length is from the first
char in the line to the next \n char (read's delimiter char has
nothing to do with that use of \n). To allow that, while not restricting
the length of a record, the sequence \ \n is allowed to indicate
continuation lines, regardless of what the delimiter is, and is simply
removed from the input stream (just as in cpp and sh - and more).
Other than that usage, a \ also escapes the following char, avoids
it being anything special (not a field (word) separator, not the
delimiter, and of course, as \\ not the escape char either).
If the delimiter was \n (the default, or -d $'\n') then the end of line
continuation removal causes it to vanish before the code checks if the
delimiter has appeared, if the delimiter is something else, we don't want
it to vanish, there is no point in that -- say we use "-d :", why would
we then ever write \: in the input if those pair of chars are simply
deleted? Makes no sense. What we would want is the escaped : there
to be a regular char, not deleted, and not the delimiter either.
I have an algebraic mind: I always think of rule. A line, sometime
ago, was considered a sequence of bytes ending by the first appearance
of '\n'. If a "line" is defined more generally as a sequence of bytes
ending by the first appearance of whatever byte delimiter, then a "continuation
line" is the escaped delimiter. And if the delimiter is not '\n',
'\\''\n' yields a '\n'.

But this is all fuzzy because read was intended for text files,
meaning essentially with lines defined against '\n' and all the rest
has been added, if not at random, by usage (ignoring that it can't be
a general binary read because it can't handle the nul byte).

So, it's obviously up to you. But could you state it clearly (not \`a la POSIX :-^)
in the man page?
Post by Robert Elz
[about option '-n']
Other corner case: when specifying a limit (-n) that is "end reading at the
first appearance of either eof, not escaped delimiter or that amount
of bytes read", what do you do when the last byte read (reaching the
count) is '\\'? Do you absorb in every case the following byte even if
the "read -n num" leads to reading "num + 1"?---and this is not what
the user required---; Or do you allow the stray backslash in the last
variable, convert it to the sequence "\\", or remove it?
Post by Robert Elz
[...]
I have also added -z (currently, for not very important backward compat
with the current impl) to issue an error if a \0 is encountered in the
input (other than as the record delimiter). Inverting the
sense of that option probably makes more sense (-z to allow \0
chars, and error without that option). Either way this is very
very simple and cheap to implement, as the code has to check for
the \0 chars anyway. (The error would cause the read to terminate
with exit status 2, as does any other error).
Or that option could just go away again. Opinions please? (everyone)
IMHO, the reverse: since the nul byte is ignored (at current time),
user is not getting what he wants, perhaps not even knowing it. So signaling problem
(erroring is better) and forcing to explicitely set -z meaning "I'm
aware I'm not getting all".

Would it make sense to add a '-Z' option that translates a nul byte
into the sequence '\000' with the specification that such a sequence
is a constant one and is never interpreted, except by printf?
--
Thierry Laronde <tlaronde +AT+ kergis +dot+ com>
http://www.kergis.com/
http://kertex.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Robert Elz
2024-09-27 15:47:39 UTC
Permalink
Date: Fri, 27 Sep 2024 15:04:18 +0200
From: ***@kergis.com
Message-ID: <***@kergis.com>

| If I understand correctly your view, the explanation could be
| something around this (I mean for the idea; for the way it is
| expressed in some kind of english...):

This what I came up with (no -N option has been implemented, I don't
see the point at the minute - that can be revisited later if someone
can demonstrate a meaningful use for it).

In the description of the -z option, either just the brackets, or the
brackets and all contained text, will end up being deleted, depending
upon which way the option ends up working.

I did add the -b option (turns out to be easy, and actually helpful to
avoid the tty needing to be put into raw mode, losing erase/kill
processing in most cases).

I also added the PS2 output (required by POSIX) when obtaining a
continuation line from stdin as a terminal, which we never bothered
with before.

Comments appreciated (other than about it being just ascii, with no
extra formatting visible - the actual man page doesn't have that
limitation). I am not particularly happy with the wording for -n.

The final paragraph is about (just slightly modified) all that remains
from the existing man page (sh(1)) description of read.

kre

read [-brz] [-d delim] [-n max] [-p prompt] variable [...]

The read command reads a record from its standard input (by
default one line) splits that record as if by field splitting,
and assigns the results to the named variable arguments, as
detailed below.

The options are as follows:

-b Do buffered reads, rather than reading one byte
at a time. Use of this option might result in
reading more bytes from standard input than the
read utility actually processes, causing some
data from standard input to be unavailable to any
subsequent utility that expects to obtain them.

-d delim End the read when the first byte of delim is
obtained from standard input. Specifying "" as
delim causes the nul character (`\0') to be the
end delimiter. The default is <newline> (`\n').

-n max read will read no more than max bytes from stan-
dard input. The default is unlimited. If the
end delim has not been encountered within max
bytes, read will act as if one immediately fol-
lowed the max'th byte, without attempting to
obtain it. However, even if the -r option is not
given and the final byte actually read were the
escape character (not itself escaped), no more
bytes will be read, and that escape character
would simply be removed as descibed below.

-p prompt If the standard input is a terminal, then prompt
is written to standard error before the read com-
mences. If more lines of data are requred in
that case, the normal PS2 prompt is written as
each subsequent line is to be obtained.

-r Reduced processsing of the input. No escape
characters are recognised, and line continuation
is not performed. See below.

-z If a nul character (`\0') is found in the input,
other than when acting as the delimiter, an error
is [normally] generated. [This option disables
that error, the nul is simply ignored.]

If the read is from a terminal device, and the -p option was
given, prompt is printed on standard error. Then a record, termi-
nated by the first character of delim if the -d option was given,
or a <newline> (`\n') character otherwise, but no longer than max
bytes if the -n option was given, is read from the standard input.
If the -b option is not given, no data from standard input beyond
the end delimiter, or the max bytes that may be read, are
obtained.

If the -r option not was given, and the two character sequence `\'
`\n' is encountered, those two characters are simply deleted, and
provided that max bytes have not yet been obtained, and the end
delimiter has yet to be encountered, more input is obtained, with
the first character of the following line placed in the input
where the deleted `\' had been. This allows logical lines longer
than the maximum line length permitted for text files to be pro-
cessed. The two removed characters are still counted for the pur-
poses of the max input limit.

If the -r flag was not given, the <backslash> character (`\')
character is then treated as an escape character, the character
following it is always treated as a normal, insignificant, data
character, and is never treated as the end delimiter nor as an IFS
character for field splitting.

After field splitting has completed, but before data has been
assigned to any variables, all escape characters are removed.
Note that the two character sequence `\' `\' can be used to enter
the escape character as data, the first acts as the escape charac-
ter, the second becomes just a normal data character.

The ending delimiter, if encountered, and not escaped, is deleted
from the record which is then split as described in the field
splitting section of the Word Expansions section above. The
pieces are assigned to the variables in order. If there are more
pieces than variables, the remaining pieces (along with the char-
acters in IFS that separated them) are all assigned to the last
variable. If there are more variables than pieces, the remaining
variables are assigned the null string. The read built-in utility
will indicate success unless EOF, or a read error, is encountered
on input, or there is a usage error (unknown option, etc) in which
case failure is returned.



--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Paul Goyette
2024-09-27 20:51:08 UTC
Permalink
Post by t***@kergis.com
I will finally learn english some day ;-)
Wait until we native English speakers can learn at least some
of it! :-)


+---------------------+--------------------------+----------------------+
| Paul Goyette (.sig) | PGP Key fingerprint: | E-mail addresses: |
| (Retired) | 1B11 1849 721C 56C8 F63A | ***@whooppee.com |
| Software Developer | 6E2E 05FD 15CE 9F2D 5102 | ***@netbsd.org |
| & Network Engineer | | ***@gmail.com |
+---------------------+--------------------------+----------------------+

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Jason Thorpe
2024-09-28 01:23:53 UTC
Permalink
Post by Paul Goyette
Post by t***@kergis.com
I will finally learn english some day ;-)
Wait until we native English speakers can learn at least some
of it! :-)
Me talk pretty one day!

-- thorpej


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Loading...