Skip to content

feat(read): implement read builtin with IFS field-splitting#211

Draft
matt-dz wants to merge 4 commits intomainfrom
claude/gifted-dewdney-13b845
Draft

feat(read): implement read builtin with IFS field-splitting#211
matt-dz wants to merge 4 commits intomainfrom
claude/gifted-dewdney-13b845

Conversation

@matt-dz
Copy link
Copy Markdown
Collaborator

@matt-dz matt-dz commented May 1, 2026

What does this PR do?

Adds the read shell builtin to the restricted shell. read consumes a delimited chunk from stdin and assigns each IFS-split field to a shell variable (defaulting to REPLY when no name is given).

This is the first builtin in the codebase that mutates parent-shell state, so the change extends the builtin contract with two minimal capabilities on builtins.CallContext:

  • SetVar(name, value) error — routes through the runner's existing setVarErr, so MaxVarBytes (1 MiB) and MaxTotalVarsBytes (1 MiB) are enforced for free.
  • GetVar(name) (string, bool) — needed so read can consult IFS for field-splitting.

These two together form the minimal complete contract for any future state-aware builtin (export, unset, shift, …).

Motivation

read is a fundamental POSIX builtin. The canonical idiom while IFS= read -r line; do …; done < file is widely used in shell scripts. Until now the shell had no way to express it; read was explicitly blocked at dispatch time. Lifting that block required granting builtins a way to mutate parent-shell variables, which was a deliberate architectural decision discussed during implementation rather than a one-line capability add.

Surface

Supported flags (six + --help):

Flag Behaviour
-r Raw mode (no backslash interpretation)
-p PROMPT Print PROMPT to stderr; only when stdin is a TTY (matches bash)
-d DELIM Single-rune delimiter; -d '' means NUL
-n NCHARS At most N characters (UTF-8 aware), capped at 1 MiB
-N NBYTES Exactly N bytes, ignoring delimiters, capped at 1 MiB
-t SECONDS Read deadline via *os.File.SetReadDeadline; context.WithTimeout fallback. Returns exit 142 on timeout (bash convention: 128 + SIGALRM)

Excluded from v1 with rationale in SHELL_FEATURES.md: -a (needs a parallel SetArrayVar contract addition; can follow later), -s/-u/-e/-i (interactive-terminal features that don't apply in a non-interactive shell).

Memory & DoS hardening

  • -n and -N are rejected at flag-parse time when N > 1 MiB or N < 0. The 1 MiB anchor matches the existing MaxLineBytes cap in cat / head and the runner's MaxVarBytes/MaxTotalVarsBytes.
  • readInput enforces a separate runtime byte ceiling (also 1 MiB) so a UTF-8-amplified -n cannot exceed the buffer cap.
  • Bytes are read one at a time without buffering so subsequent read calls on the same stdin observe unconsumed bytes (a bufio.Reader would swallow bytes past the delimiter — common bug in naive implementations).
  • Defenses layered: flag-time rejection → runtime byte cap → MaxVarBytes on the assigned value → MaxTotalVarsBytes on the aggregate variable store.

IFS field-splitting

Implements POSIX read semantics:

  • Whitespace IFS chars (space, tab, newline) coalesce; non-whitespace IFS chars introduce a separate (possibly empty) field per occurrence.
  • Leading and trailing IFS-whitespace is stripped from the input.
  • When the input has more fields than names, the last variable absorbs the remainder verbatim with only trailing IFS-whitespace stripped.
  • When fewer fields than names, missing variables are set to empty.

Testing

16 new YAML scenarios under tests/scenarios/cmd/read/:

  • basic/ (7) — single var, default REPLY, multi-var, last-absorbs, fewer-fields, redirect-from-file, EOF-returns-1.
  • flags/ (6) — raw vs non-raw, line continuation, delimiter, -n, -N.
  • ifs/ (3) — default IFS (space/tab), custom IFS (:), empty IFS (no stripping).
  • errors/ (4) — -n and -N cap rejection, invalid timeout, invalid identifier.
  • help/ (1) — --help exits 0 with usage.

The non-bash-comparison scenarios all pass locally. Bash byte-for-byte comparison (RSHELL_BASH_TEST=1) was not run locally — Docker daemon was not running. Worth verifying in CI.

Three pre-existing help scenarios were updated for the bumped builtin count (28 → 29). The obsolete shell/blocked_commands/.../read.yaml scenario was deleted.

Symbol allowlists in analysis/symbols_builtins.go updated: 10 new symbols added to the global ceiling (context.CancelFunc, context.DeadlineExceeded, fmt.Fprint, os.ErrDeadlineExceeded, os.File, os.ModeCharDevice, time.Now, unicode/utf8.{DecodeLastRuneInString,FullRune,RuneSelf}) plus a new "read": per-command sublist.

Reviewer notes

  • The two-field expansion of CallContext is the architecturally significant change. Future state-aware builtins should reuse SetVar/GetVar rather than reaching past them.
  • Non-raw backslash handling does require multi-rune lookahead (\<delim> is a line continuation, \<X> is just X). The readInput function handles this with a single helper closure for rune reads.
  • -p is silent in this codebase because stdin is always a pipe in scripted contexts. The TTY check is there for parity in case stdin is ever wired to a terminal by a calling embedder.
  • Per AGENTS.md, this PR is opened as a draft; please do not add the verified/allowed_symbols label automatically — that is reserved for human approval after reviewing the symbol-allowlist changes.

Checklist

  • Tests added (16 YAML scenarios)
  • Documentation updated (SHELL_FEATURES.md)
  • Symbol allowlists updated
  • make fmt clean, go vet clean, full test suite passes locally
  • Bash comparison verified locally (skipped — Docker not running)

🤖 Generated with Claude Code

matt-dz and others added 2 commits May 1, 2026 13:28
Adds the `read` shell builtin, which consumes a delimited chunk from
stdin and assigns the IFS-split fields to one or more shell variables
(defaulting to REPLY when no name is given). This is the first builtin
in the codebase that mutates parent-shell state, so the change extends
the builtin contract with a minimal pair of capabilities.

Contract change (builtins.CallContext):
  - SetVar(name, value) error: assign to a shell variable; routes
    through the runner's setVarErr so MaxVarBytes and
    MaxTotalVarsBytes are enforced.
  - GetVar(name) (string, bool): read a shell variable (used to
    consult IFS for field-splitting).

Wired at both CallContext construction sites in interp/runner_exec.go.

Builtin (builtins/read/read.go) supports:
  -r              raw mode (no backslash interpretation)
  -p PROMPT       print to stderr (only when stdin is a TTY, matches bash)
  -d DELIM        single-rune line delimiter; -d '' uses NUL
  -n NCHARS       read at most N characters (UTF-8 aware), capped at 1 MiB
  -N NBYTES       read exactly N bytes ignoring delimiters, capped at 1 MiB
  -t SECONDS      timeout via *os.File.SetReadDeadline (or context.WithTimeout
                  fallback); returns 142 on timeout to match bash
  --help          usage to stdout

Excluded from v1: -a (would need SetArrayVar), -s/-u/-e/-i (interactive
features that don't apply to a non-interactive shell). Documented in
SHELL_FEATURES.md.

Field-splitting follows POSIX read semantics: whitespace IFS chars
coalesce, non-whitespace IFS chars introduce separate fields, leading
and trailing IFS-whitespace is stripped, and the last variable absorbs
the remainder when there are more fields than names.

Bytes are read one at a time without buffering so subsequent reads from
the same stdin observe unconsumed bytes (a buffered reader would
swallow them past the delimiter).

Tests: 16 new scenarios under tests/scenarios/cmd/read/ covering
basic, flags, ifs, errors, and help. Three help scenarios bumped from
28 to 29 builtins; the obsolete read-is-blocked scenario removed.
Symbol allowlists updated for the new package's imports.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builtins must not call time.Now directly; the canonical pattern is
callCtx.Now (script-start reference, captured once per Run for stable
multi-builtin reasoning) for things like find -mtime / ls -l.

For read -t SECONDS we want a true monotonic "from now" deadline, not a
script-start reference (a 5s timeout 10s into the script must not fire
immediately). callCtx.Now is therefore wrong here. The right answer is
to read the deadline already computed by context.WithTimeout — that
call is on the symbol allowlist, the stdlib does the time.Now lookup
internally, and ctx.Deadline() lets us project that same instant onto
*os.File.SetReadDeadline so kernel and context deadlines stay in sync.

Removes time.Now from both the global builtin allowlist and the read
per-command sublist.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@matt-dz
Copy link
Copy Markdown
Collaborator Author

matt-dz commented May 1, 2026

@codex conduct a comprehensive security and code review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cbf514b698

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread builtins/read/read.go Outdated
Comment thread builtins/read/read.go Outdated
Comment thread builtins/read/read.go Outdated
Comment thread builtins/read/read.go
matt-dz and others added 2 commits May 1, 2026 14:13
Four bash-compatibility fixes flagged by codex:

1. Preserve unsplit input for REPLY and -N
   When no NAME is given, bash assigns the raw line to REPLY without
   IFS trimming or splitting. With -N, bash assigns the whole read to
   the first NAME and leaves remaining NAMEs empty (no IFS splitting).
   Both paths now skip splitIFS entirely.

2. Count -N input as characters, not raw bytes
   Bash's -N is character-based, not byte-based: read -N 1 over "éx"
   stores "é" (2 bytes), not the first byte of the UTF-8 sequence.
   Dropped the dedicated byte-mode branch and unified -N with the
   default rune+escape path; -N differs only in that it ignores the
   delimiter rune. Backslash handling in -N now matches bash too
   (escapes processed unless -r).

3. Permit values exactly at the read byte cap
   The cap check fired before the -n / delimiter checks, so a value of
   exactly MaxReadBytes was rejected on the next iteration even though
   the flag validation and MaxVarBytes both allow that size. The cap
   now fires only when an append would push past it, so a 1 MiB
   value/line followed by EOF or the delimiter returns successfully.

4. Handle a single trailing IFS separator
   `printf 'a:b:\n' | { IFS=: read a b; }` should give b="b" because
   the trailing colon represents an empty trailing field that bash
   silently drops. splitIFS now strips a lone trailing non-ws IFS
   character from the last absorbed field when it is the only such
   character in that field. Verified across the spectrum: a:b: → b,
   a:b:: → b::, a::b: → :b: (only one strip when there is exactly
   one trailing IFS char in the absorbed remainder).

New scenarios:
  basic/reply_preserves_padding.yaml
  flags/N_counts_chars_utf8.yaml
  flags/N_applies_escape.yaml
  flags/N_skips_field_split.yaml
  ifs/trailing_single_ifs_stripped.yaml

All 22 read scenarios pass; full suite green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bash comparison test runs in debian:bookworm-slim's default C/POSIX
locale, where bash's `read -N` counts bytes (so `read -N 1` over `éx`
returns the first byte 0xC3 — an invalid UTF-8 fragment). rshell always
counts UTF-8 characters regardless of locale, so the test assertion
`[é]` matched bash on macOS (UTF-8 locale) but not in CI.

Setting LC_ALL=C.UTF-8 for the bash side puts bash in UTF-8 mode,
making the comparison meaningful: both shells now treat -N as
character-based on UTF-8 input. Verified locally with
`LC_ALL=C bash` (returns `[\xc3]`) vs `LC_ALL=C.UTF-8 bash` (returns
`[é]`).

The intentional divergence in C-locale environments is documented in
the scenario comment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@matt-dz
Copy link
Copy Markdown
Collaborator Author

matt-dz commented May 1, 2026

@codex conduct a comprehensive security and code review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 388464302b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread builtins/read/read.go
Comment on lines +180 to +183
if err != nil {
if errors.Is(err, context.DeadlineExceeded) || errors.Is(err, os.ErrDeadlineExceeded) {
// Bash convention for read -t timeout: 128 + SIGALRM (14) = 142.
return builtins.Result{Code: 142}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve partial input on read timeouts

When read -t times out after receiving some bytes but before the delimiter, this branch returns before the assignment logic runs, so the partially read value is lost. Bash still assigns the partial line in this scenario (for example, { printf abc; sleep 0.2; printf '\n'; } | { read -t 0.05 x; ...; } leaves x=abc with status 142), so timeout handling should continue through the normal assignment path when line is non-empty.

Useful? React with 👍 / 👎.

Comment thread builtins/read/read.go
Comment on lines +123 to +125
dur := time.Duration(secs * float64(time.Second))
var cancel context.CancelFunc
readCtx, cancel = context.WithTimeout(ctx, dur)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Handle zero-second read timeouts as a poll

For read -t 0, Bash help specifies that read returns immediately without reading data and succeeds only if input is available, but context.WithTimeout(ctx, 0) creates an already-expired context so readInput returns DeadlineExceeded unconditionally. This makes even printf x | { read -t 0 x; echo $?; } report 142 instead of the expected successful poll.

Useful? React with 👍 / 👎.

Comment thread builtins/read/read.go
Comment on lines +236 to +240
info, err := f.Stat()
if err != nil {
return false
}
return info.Mode()&os.ModeCharDevice != 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Detect actual terminals for read prompts

This treats any character device as a TTY, so redirected stdin from non-terminal devices such as /dev/null will still emit the -p prompt. Bash suppresses prompts for non-interactive redirected input (read -p PROMPT x </dev/null produces no prompt), so this should use an actual terminal check such as isatty/term.IsTerminal rather than os.ModeCharDevice alone.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant