Minishell

C Shell Parsing Linux

Introduction

The Minishell project is a comprehensive reimplementation of a core component of the Unix environment: the command-line interpreter (shell). Replicating many of the fundamental features of Bash shell, this project serves as a deep dive into system programming, demanding meticulous attention to process management, effective I/O redirection, and complex command line parsing.

It was completed during the 42 common core, in collaboration with jrandet.

At its core, a shell operates in a continuous execution loop, defined by three main phases:

1. Waiting: The shell displays a prompt and awaits user command input.
2. Receiving: The command-line input is captured as a raw string.
3. Processing: The input is then analyzed through a precise, multi-stage pipeline—from parsing to final execution—demonstrating a practical understanding of how operating systems manage user commands and processes.

Key Challenges and Features

During this project, we were asked to:

Recode essential Built-in Commands (echo, cd, pwd, env, export, unset, exit).
Handle complex quotes and variable expansion (or why "$USER" and '$USER' behave differently).
Implement I/O redirections and piping to control data flow between commands.
Handle recursive wildcard expansion (*) for pattern matching in file names.
Handle Signals (Ctrl+C, Ctrl+D and Ctrl+\).

Command Line Parsing

The initial and most critical phase of the shell is transforming the raw input string into a structured, executable format.

Initial Tokenization and Testing Blueprint

Before implementing the parser, I needed a complete understanding of how Bash tokenizes complex command lines. I created the visualization below as a rigorous testing blueprint in Bash. This systematic analysis of distinct command scenarios served as the crucial pre-development stage.

Tokenization and Structural Mapping

Diagram showing various commands broken down into structured components: bin, arg, input redir, file descriptors, and pipe.

Visual representation of different command-line inputs are decomposition, done with Obsidian's canva plug-in.

Data Structure Choice

Instead of a complex Abstract Syntax Tree (AST), we chose an array of command structures. This array-based approach significantly simplifies pipeline execution, as each element directly maps to a stage in the command chain (e.g., cmd1 | cmd2), making process forking and piping straightforward.

The Central Command Structure (`t_cmd`)

The entire pipeline is built around the primary command structure, which encapsulates all necessary information—binaries to execute a single command.

		typedef struct s_cmd
		{
			char		*command;	   // Binary name (e.g., "ls" or "echo")
			char		*path;		  // Full binary path (e.g., "/bin/ls")
			int		 type;		   // Command type Enum (USER, BUILTIN, or INVALID)
			char		**args;		 // Argument list (argv equivalent)
			int		 arg_amount;	 // Number of arguments in the command line
			t_redir	 *redir;		 // I/O redirection (path and type)
			int		 redir_amount;   // Total number of redirections
			int		 heredoc_amount; // Number of heredoc redirections
			pid_t	   pid;			// Process ID, used during execution
			t_pipefd	*pipe_in;	   // int fdes[2] for input pipe redirection
			t_pipefd	*pipe_out;	  // int fdes[2] for output pipe redirection
			bool		is_directory;   // Check if command is a directory (error handling)
			int		 error_access;   // Error number if any access issue occurs
		} t_cmd;

Parsing Examples in Action

The array-of-structures model proves its efficiency when handling complex command lines that involve multiple pipes (|) and redirections (<, >).

Example 1: Long Pipeline

ls -l | grep "config" | tee configs.txt | wc -l

Command 1: ls
Path: /usr/bin/ls
Args: ["ls", "-l"]
Total Args: 2 | Total Redirections: 0

Command 2: grep
Path: /usr/bin/grep
Args: ["grep", "config"]
Total Args: 2 | Total Redirections: 0

Command 3: tee
Path: /usr/bin/tee
Args: ["tee", "configs.txt"]
Total Args: 2 | Total Redirections: 0

Command 4: wc
Path: /usr/bin/wc
Args: ["wc", "-l"]
Total Args: 2 | Total Redirections: 0

Each command is an individual t_cmd structure. The shell's execution automatically handles the necessary pipes to connect their input/output streams.

Example 2: Redirection & Pipes

< infile.txt grep a1 | wc -w > outfile.txt

Command 1: grep
Path: /usr/bin/grep
Args: ["grep", "a1"]
Total Args: 2
Redir[0]: infile.txt (IN REDIR)

Command 2: wc
Path: /usr/bin/wc
Args: ["wc", "-w"]
Total Args: 2
Redir[0]: outfile.txt (OUT REDIR)

The initial command reads from a file, the output pipes to the second command, and the second command's final output is redirected to a new file.

Quote and Variable Expansion

One of the subtle complexities of shell programming is mastering the rules for single quotes ('), double quotes ("), and variable expansion ($).

Our core challenge was ensuring the shell behaves exactly like Bash:

Double Quotes (" "): Allow environment variables (like $PATH or $USER) to be replaced (expanded) with their actual values.
Single Quotes (' '): Treat everything inside literally. No expansion is performed. The dollar sign ($) is treated as a regular character.

Minishell vs. Bash Comparison

Comparison of Minishell and Bash output for a command containing single and double quotes.

A side-by-side comparison of a complex echo command executed in Minishell Versus Bash.

The Execution Phase: Fork, Pipe, and Execve

This phase is responsible for correctly running user-defined binaries and built-in commands, managing I/O redirection, and handling pipelines.

Process Management

To run any external command (a non-built-in like /bin/ls), the shell uses a two-step approach:

fork(): The parent shell process calls fork() to create an exact copy of itself (the child process) in order to create a new process for the command.
execve(): Inside the child process, execve() is called. This function replaces the running program, hence the necessity of fork().

// char *cmd.path = "/bin/ls"
// char **cmd.args = {"ls", "-l", NULL}
// char **env = {"PATH=...", "USER=...", NULL}

execve(cmd.path, cmd.args, env);

The parent shell then waits for the child process to terminate, retrieves its exit status that will be expandable using "$?".

Handling Pipelines with pipe()

Pipelines (cmd1 | cmd2) introduce a layer of complexity known as Inter-Process Communication (IPC). The shell uses the pipe() system call to create a connection between the standard output (stdout) of the first command and the standard input (stdin) of the second.

For a sequence of commands, the process involves:

Before execve() is called in the child process, the file descriptors for stdin and/or stdout are redirected using dup2().
For the first command in the pipe, stdout is pointed to the write-end of the pipe.
For the second command, stdin is pointed to the read-end of the pipe.
The child process executes its command, and any output automatically flows through the pipe to the next process's input, completing the chain.

This meticulous orchestration of fork(), pipe(), and file descriptor manipulation is crucial for the shell's ability to run complex chained commands efficiently.