sh(1) read: add LINE_MAX safeguard and "-n" option

Discussion:

(too old to reply)

t***@kergis.com

2024-09-24 10:56:49 UTC

The intrinsics built-in read was specified in earlier POSIX to
treat text file, that are supposed to end with a newline. It has been
relaxed to take into account usage and implementation, allowing to
treat as input a file without a newline.

Problem arise when read is used, for example in pkgsrc utilities,
to get the first "line" of files to decide if they are interpreted
files and some portability adjusting has to be done. Exec'ing sed(1)
for every file costs too much, so a built-in is prefered, but
when a file does not contain a newline, read reads the entire file,
byte by byte, and it takes ages.

The present patch does two things:

1) Set, by default, the maximum of bytes read, in every case, as being
LINE_MAX (the maximum number of bytes in a line in a text file);

2) Implement the '-n' option that allows to set explicitely the
maximum number of bytes to read, thus allowing too to bypass deliberately
the LINE_MAX value.

It is a compromise between usage, historical meaning, practical use
and safety.

If limiting by default the number of bytes to read to a text file
associated limit (LINE_MAX) was considered contrary to the present
POSIX spec (it has "relaxed" the necessity of and end-of-line, not
stated that read has to handle whatever file) another maximum value
associated with the size of a file could be used instead of LINE_MAX
(in the present patch; this will not improve the time in practice when
the maximum is not set to a more reasonable value).

BTW: the usage displayed when a variable name was not given didn't
show the "[-d delim]" option.

diff --git a/bin/sh/miscbltin.c b/bin/sh/miscbltin.c
index c4f963d0d86a..2248b7830835 100644
--- a/bin/sh/miscbltin.c
+++ b/bin/sh/miscbltin.c
@@ -54,6 +54,7 @@ __RCSID("$NetBSD: miscbltin.c,v 1.54 2023/10/05 20:33:31 kre Exp $");
#include <stdlib.h>
#include <ctype.h>
#include <errno.h>
+#include <limits.h> /* LINE_MAX, if defined */

#include "shell.h"
#include "options.h"
@@ -67,6 +68,9 @@ __RCSID("$NetBSD: miscbltin.c,v 1.54 2023/10/05 20:33:31 kre Exp $");

#undef rflag

+#ifndef LINE_MAX
+# define LINE_MAX 2048 /* peak value of _POSIX2_LINE_MAX */
+#endif

/*
@@ -75,6 +79,11 @@ __RCSID("$NetBSD: miscbltin.c,v 1.54 2023/10/05 20:33:31 kre Exp $");
*
* This uses unbuffered input, which may be avoidable in some cases.
*
+ * For safety and efficiency (specially when called on a not text file),
+ * the maximum number of bytes read is LINE_MAX. The '-n' option can
+ * explicitely bypass it (since it is not a POSIX required option,
+ * POSIX text file limits have not to apply).
+ *
* Note that if IFS=' :' then read x y should work so that:
* 'a b' x='a', y='b'
* ' a b ' x='a', y='b'
@@ -101,6 +110,8 @@ readcmd(int argc, char **argv)
int i;
int is_ifs;
int saveall = 0;
+ int linemax;
+ int nread;
ptrdiff_t wordlen = 0;
char *newifs = NULL;
struct stackmark mk;
@@ -108,11 +119,18 @@ readcmd(int argc, char **argv)
end = '\n'; /* record delimiter */
rflag = 0;
prompt = NULL;
- while ((i = nextopt("d:p:r")) != '\0') {
+ linemax = LINE_MAX;
+
+ while ((i = nextopt("d:n:p:r")) != '\0') {
switch (i) {
case 'd':
end = *optionarg; /* even if '\0' */
break;
+ case 'n':
+ linemax = atoi(optionarg);
+ if (linemax <= 0)
+ linemax = LINE_MAX;
+ break;
case 'p':
prompt = optionarg;
break;
@@ -124,7 +142,7 @@ readcmd(int argc, char **argv)

if (*(ap = argptr) == NULL)
error("variable name required\n"
- "Usage: read [-r] [-p prompt] var...");
+ "Usage: read [-r] [-d delim] [-n count] [-p prompt] var...");

if (prompt && isatty(0)) {
out2str(prompt);
@@ -138,16 +156,20 @@ readcmd(int argc, char **argv)
status = 0;
startword = 2;
STARTSTACKSTR(p);
+ nread = 0;
for (;;) {
if (read(0, &c, 1) != 1) {
status = 1;
break;
}
- if (c == '\\' && c != end && !rflag) {
+ if (++nread > linemax) /* same as end */
+ break;
+ if (c == '\\' && c != end && !rflag && nread != linemax ) {
if (read(0, &c, 1) != 1) {
status = 1;
break;
}
+ ++nread;
if (c != '\n') /* \ \n is always just removed */
goto wdch;
continue;
diff --git a/bin/sh/sh.1 b/bin/sh/sh.1
index a819e9d72188..1938dd9478d3 100644
--- a/bin/sh/sh.1
+++ b/bin/sh/sh.1
@@ -3663,7 +3663,7 @@ the program will use
and the built-in uses a separately cached value.
.\"
.Pp
-.It Ic read Oo Fl d Ar delim Oc Oo Fl p Ar prompt Oc Oo Fl r Oc Ar variable Op Ar ...
+.It Ic read Oo Fl d Ar delim Oc Oo Fl n Ar count Oc Oo Fl p Ar prompt Oc Oo Fl r Oc Ar variable Op Ar ...
The
.Ar prompt
is printed on standard error if the
@@ -3674,8 +3674,14 @@ first character of
.Ar delim
if the
.Fl d
-option was given, or a newline character otherwise,
-is read from the standard input.
+option was given, or a newline character otherwise, or after reading
+at maximum
+.Ar count
+bytes if the
+.Fl n
+option was given, or the system defined at compile time
+.Dv LINE_MAX
+otherwise, is read from the standard input.
The ending delimiter is deleted from the
record which is then split as described in the field splitting section of the
.Sx Word Expansions
@@ -3697,6 +3703,16 @@ built-in will indicate success unless EOF, or a read error,
is encountered on input, in
which case failure is returned.
.Pp
+In what follows, the processing is only done if the maximum of bytes
+to read has not been reached. This maximum always exists, whether set
+by the user with the
+.Fl n
+option, or defaulting to a system defined at compile time
+.Dv LINE_MAX
+value. This indicates that not more than this amount of bytes will be
+read, and does not indicate the amount of bytes that will actually be
+read, nor the number of chars that will be present in the variables.
+.Pp
By default, unless the
.Fl r
option is specified, the backslash

--
Thierry Laronde <tlaronde +AT+ kergis +dot+ com>
http://www.kergis.com/
http://kertex.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Robert Elz

2024-09-24 11:54:35 UTC