Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add KillError.NoSuchProcess #16816

Closed
wants to merge 1 commit into from
Closed

add KillError.NoSuchProcess #16816

wants to merge 1 commit into from

Conversation

drakbar
Copy link

@drakbar drakbar commented Aug 14, 2023

WHAT

Adds NoSuchProcess to the KillError error set in std.os.kill, and returns that error when errno is set ESRCH.

WHY

Returning this error from os.kill is useful for determining if a process id is valid.

# https://pubs.opengroup.org/onlinepubs/9699919799/
If sig is 0 (the null signal), error checking is performed but no signal is actually sent.
The null signal can be used to check the validity of pid.

This facilitates a nice set of features in binaries that are meant to be executed as a daemon or background process. You can cache the daemon's pid to some type of storage at start up, and implement conditional branching with something such as cli arguments. A new process wants conditionally execute depending on a daemon's current state. i.e. (start, stop, restart, status ...) The process can then reach out to the storage location and pull up a pid of a potentially running daemon process and check with the operating system if that pid is still in use. Sending out signals is an asynchronous operation so super tight coordination is not reliable, however, this makes for a relatively simple low frequency heartbeat mechanism that can come in really handy.

HOW

const std = @import("std");
const os = std.os;
const debug = std.debug;

const s_fmt = "{s:<18} : {s}\n";
const d_fmt = "{s:<18} : {d}\n";
const p_fmt = "pid({d}) exists : {any}\n";

// https://pubs.opengroup.org/onlinepubs/9699919799/
// If sig is 0 (the null signal), error checking is performed but no signal is actually sent. 
// The null signal can be used to check the validity of pid. 
fn pokeProcess(pid: os.pid_t) bool {
    if (os.kill(pid, 0)) |_| {
        return true;
    } else |err| {
        // this is just to demostrate that the error makes it back clearly the 
        // implementation needs to isolate os.KillError.NoSuchProcess to verify
        // that this error was spawned from that errno
        debug.print(s_fmt, .{ "error type", @errorName(err) });
        return false;
    }
}   

test "os_kill_no_process" {
    const pid = os.linux.getpid();
    debug.print("\n", .{});
    debug.print(p_fmt, .{ pid, pokeProcess(pid) });

    if (os.fork()) |c_pid| {
        if (os.linux.getpid() == pid) {
            debug.print(d_fmt, .{ "c_pid", c_pid });
            os.exit(0); // parent process early exit
        } else {
            debug.print(d_fmt, .{ "pid", pid });
        }
    } else |err| {
        debug.print("{s}\n", .{@errorName(err)});
        os.exit(1);
    }
    // because of asynchronistic nature of singals 
    // wait for os to fully clean up parent process
    std.time.sleep(10000);
    debug.print(d_fmt, .{ "killing pid", pid });

    debug.print(p_fmt, .{ pid, pokeProcess(pid) });
    os.exit(0);
}

@rohlem
Copy link
Contributor

rohlem commented Aug 14, 2023

The process can then reach out to the storage location and pull up a pid of a potentially running daemon process and check with the operating system if that pid is still in use.

Correct me if I'm wrong, but if the parent process cannot reserve that process id from being re-used by a completely unrelated process, this mechanism is unreliable: If the daemon process has quit and an unrelated process has started up, the OS may have reassigned the same process id. This would make the parent process think that the daemon is still running, when it's actually now a completely different program.

I imagine that is the reason that ESRCH was classified as "always a race condition" - because there's no (platform-agnostic) reliable way of circumventing this shortcoming.

@drakbar
Copy link
Author

drakbar commented Aug 14, 2023

Correct me if I'm wrong, but if the parent process cannot reserve that process id from being re-used by a completely unrelated process, this mechanism is unreliable: If the daemon process has quit and an unrelated process has started up, the OS may have reassigned the same process id. This would make the parent process think that the daemon is still running, when it's actually now a completely different program.

I am not here to correct anyone or debate the efficacy of a technique that programmer B thinks is unreliable and programmer A should not do. But to address your criticisms, the amount of max pid that my system can have can be found at /proc/sys/kernel/pid_max. (I don't know where that would be on windows or apple nor do I care). On my linux system that is 2²² or about 4,194,304. Once the max has been reached it will overflow and start again from the bottom. The max is configurable to any positive integer less than 2²². So sure in your scenario you can make that happen. You can lower the number of pids available and then generate a large contention for that limited resource. You can by all means create the conditions for that hypothetical to happen, but then you used so many foot guns that whether or not that server hosting pictures of your cat 🐈‍⬛ on the internet is running has been superseded by other things in the priority queue.

If the programmer is that paranoid or if the service is really that critical then yeah don't use that method, however, there are additional steps that could be taken to verify that PID X is a running process of the TYPE daemon right 🤔?

Regardless of my toy example of how I would like use this side effect and why I wrote this two line fix, the real issue and I thought it was obvious, is how trivial it is to hit the unreachable block. Like here we have an exposed interface (os.kill) which mirrors the c interface. So I can shove any value in for pid and sig, all while the code police 👮 sits over on the sideline wagging their finger at me saying "HEY! thats unreliable". And as the shouted words fade with a doppler like effect, and I descend into the existential depths of my mind and wonder quietly as if it were a whisper 😨 "can I ever know at the moment of execution if process id is valid?". A moment of dread overcomes me as the stark realization has me questioning everything; descending deeper and deeper into nihilism 😰. And then as I teeter at the edge of the endless void looking into the infinite precipice I feel my material form deconstructing and being pull into the vast nothingness. And then from an infinitesimal point there is a 💥 flash! A surge of electrical discharge ⚡ explodes from the base of my neck rocketing upward 🚀, and then slamming into my occiptial bone 💨💀 . Crashing down like a tsunami 🌊 and flooding out through the rest of my cranium 🧠. With a disheveled look I stutter out through grimacing teeth 😬, "EUR..E......KA!", as wave of ecstasy overwhelms me in a cogito ergo sum moment. It is quickly followed by a frantic need to capture that revelation. Typing intensifies My finger fly across the key as fast as they possibly can go, hunched over and staring at them as they go currrrchunk as if examining them would somehow improve the accuracy of the code being written. And then I look up at the screen, sweat streaming down my face, and behold the singular case needed to throw at my earlier proposition: os.kill(os.linux.getpid(), 9)

Sorry I have been working on creative writing and so I was capturing a moment of inspiration while the ole brain juices were flowing. Anyways, I hope the wall of text above illustrates my points of reductio ad absurdum. The real issue here is not how the user's implementation stiches the features together. Its the fact that the std library wraps c code that is capable of handling the underlying errors and notifying the user safely. The standard library intercepts this graceful solution and redirects the instruction pointer to point at a label that eventually causes a panic, and now NO ONE can see the cute picture of Mr.Cheese sitting in the box that I just left on the floor next to trash can. The issue is that we are providing to an end user a wrapper around code, and that wrapper causes the code to perform worse than the original implementation. We are a waiter in a restaurant taking a dump in the customers food on the way out from the kitchen, and then telling them its their fault for ordering the poo poo plater. As far as platform-agnosticism goes perhaps there should be an implementation of kill down inside the linux namespace? I am just having issues with why the interface is wide open if we intend for user to not use every feature that comes with said interface, and insist on cutting their seat-belts as we take them on that roller coaster ride.

TDLR;

if (os.kill(my_pid, my_sig)) |_| {
    // ... do my cool programmer stuff in here
} else |err| {
    // this block is unreachable and an unavoidable program crash occurs
    // if there is no process running with the pid of `my_pid`
    debug.print(s_fmt, .{ "error type", @errorName(err) });
}

meanwhile...

#include <stdio.h>
#include <errno.h>
#include <signal.h>

int main() {
    int rc = kill(my_pid, my_sig);
    printf("return code %d\n", rc);

    if (rc == -1) {
        if (errno == ESRCH) {
            printf("No such process [ESRCH %d]\n", errno);
            // https://www.youtube.com/watch?v=OynKbfEGV_A
        }
    } else {
        // ... do my cool programmer stuff in here
    }
}

@rohlem
Copy link
Contributor

rohlem commented Aug 14, 2023

I am not here to correct anyone

Please do tell me if I'm wrong about something though, I like learning things.

To the more general point, I imagine we could add an error.DetectedRaceCondition, configurable via an @import("root").std_options entry, so that a program could decide between unreachable and an error in cases such as this one.
There's been more discussions on the use of unreachable in several issues; some of them here: #6389 , #6925 , #10776 , #15664 , #15607 .
Just FYI, std is currently planned to be and remain opinionated about how to write programs. If your opinions differ (as mine sometimes do) you might have to copy it into / create your own library/package to maintain independently. (That said, I for one don't mind hearing those differing opinions and reading walls of text.)

@matu3ba
Copy link
Contributor

matu3ba commented Aug 15, 2023

This facilitates a nice set of features in binaries that are meant to be executed as a daemon or background process. You can cache the daemon's pid to some type of storage at start up, and implement conditional branching with something such as cli arguments.

That is what pid 1 is for and why pid 1 has (daemonized) supervision capabilities. What is your specific use case, say for portability, to not use pid 1?
My guess is that the Kernel would need effectively a supervision system or always unique process ids and I dont know of an api to get those in a race-free manner from the Kernel meaning without procfs.

Correct me if I'm wrong, but if the parent process cannot reserve that process id from being re-used by a completely unrelated process, this mechanism is unreliable: If the daemon process has quit and an unrelated process has started up, the OS may have reassigned the same process id. This would make the parent process think that the daemon is still running, when it's actually now a completely different program.

At least on Linux the proper way to do it is to set the uuid beforehand https://lwn.net/Articles/754980/:

For a process to have permission to send a signal, it must either be privileged (under Linux: have the CAP_KILL capability in the user namespace of the target process), or the real or effective user ID of the sending process must equal the real or saved set-user-ID of the target process.
[..]
It turns out that we can. If we create a new reaper_uid, and set its <euid, ruid, suid> to <target_uid, reaper_uid, X> (where X can be anything as long as it's not target_uid), then:

    The reaper process can kill the target process, since its effective UID is equal to the target process's real UID
    But the target process can't kill the reaper, since its real and effective UIDs are different than the real and saved UIDs of the reaper process.

So the following code will safely kill all processes of target_uid in a race-free way:

    setresuid(reaper_uid, target_uid, reaper_uid);
    kill(-1, 9);

Note that this reaper_uid must have no other running processes when we call kill(), or they will be killed as well. In practice this means either setting aside a single reaper_uid (and using a lock to make sure only one reaper process runs at a time) or having a separate reaper_uid per target_uid.

Proof-of-concept code for both the rogue process and the reaper process can be found in [this GitHub repository](https://github.com/gwd/runner-reaper).

The [setresuid() system call](http://man7.org/linux/man-pages/man2/setresuid.2.html) is implemented by both Linux and FreeBSD. It is not currently implemented by NetBSD, but implementing it seems like a pretty straightforward exercise (and certainly a lot simpler than implementing seccomp()). NetBSD does implement RLIMIT_NPROC, which should also be helpful at preventing our process from executing fork().

So one needs semi-aweful hacks to kill reliably (rogue) processes.

@drakbar
Copy link
Author

drakbar commented Aug 15, 2023

So I am wrong. std.os.linux does have a kill function, and it works just like the c version. I am not sure how I didn't catch it the first time around, my LSP didn't pick it up for some reason.

@matu3ba I have used a variation of that technique in containers where restarting the container is not preferable to just attempting to manage a process life cycle. The article you linked is indeed very interesting, clearly their need for a reaper comes at the problem from a security perspective, my needs are not that serious. However, I did feel inspired by that article. Thanks for sharing!

Just FYI, std is currently planned to be and remain opinionated about how to write programs. If your opinions differ (as mine sometimes do) you might have to copy it into / create your own library/package to maintain independently. (That said, I for one don't mind hearing those differing opinions and reading walls of text.)

Oh believe me I am fully aware of that option and quite frequently take that approach. However, that is typically done with a third party library more so than a std lib. There is nothing wrong with being opinionated, everyone has them, and they all stink, er... wait is that something else? Anyways while its completely fine to be opinionated, std's opinion is in a way bound by its responsibility. If the goal is to proliferate the language and to gain adoption by an ever growing group of developers the constraints of hitting a that type of target will most likely force it into a more generic / traditional position that facilitates that type of success. It will need to be easy to use and reliable. I am not saying that it currently has a systemic issue of doing the things that I pointed out in my post, but if the std library becomes synonymous with usurpation of control flow directly into a forced runtime crashes, then its my opinion that users will have a degraded opinion of the product as a whole.

I am gonna close this PR as std.os.linux meets my need, however, I would like to ask as this is like my first week in zig. Is there somewhere that provides a good break down of how the build system works? As I would like to create a more complex project that creates / brings in outside libraries. Also whats going on with std.fs?

Like you need a Dir object to open a File? While that makes some sense I would think that an open / close method would be scoped to a File. Also the creation of the Dir struct seems kinda cumbersome, and which reference do I close when I call close on one?

var d  = fs.Dir{ .fd = undefined };
var fp = try d.open("/path/to/file", .{ .access_sub_paths = true });
// which instance is the one that is open here?
// and which do I need to close, both?
// I assume that defer close is safe even if you call close prior to the stack poping

if (fp.createFile("name of file", flags)) |file| {
    //...
}

// what happens when I want to create a file in a another directory?
// if I want to reused the handle, do I need to close the directory first and then another call to open?

// I suppose I could create a whole new handle and just disregard the old one
// but what is the idiomatic approach? 

@rohlem
Copy link
Contributor

rohlem commented Aug 15, 2023

I would like to ask as this is like my first week in zig.

Feel free to also re-ask in one of the communities (f.e. the Zig Programming Language Discord has a dedicated help forum); usage questions are generally avoided on the issue tracker (and PRs I assume) to not overwhelm the primary maintainers with non-progress work.

Is there somewhere that provides a good break down of how the build system works?

There's an outdated section on ziglearn that mostly goes over the output of zig init-exe. There's also a zig.news article detailing some internals of the still-WIP package manager. My approach for finding something specific is usually directly reading through the code std.Build. You might also learn something by looking at repositories of Andrew himself (EDIT: fixed link).

Regarding your code snippet:

There is std.fs.openFileAbsolute (for now - #16736). But generally we expect most programs to use relative paths, which are relative to a Dir.
A good starting point for Dir-s is usually std.fs.cwd() - but careful, the Dir returned from there is a special sentinel on most systems and MUST NOT be close()-ed. Very ugly footgun, not sure whether documented yet - IMO the cwd it should be its own special type due to that.

var d = fs.Dir{ .fd = undefined };
var fp = try d.open("/path/to/file", .{ .access_sub_paths = true });
// which instance is the one that is open here?

As a general rule of thumb: If you directly initialize an aggregate, you know that initialization did nothing special, and so close/deinit will usually not be necessary - unless you know what it will do and want it to do that. (F.e. if you assigned a handle to it that you WANT to close using that close/deinit method.)
ESPECIALLY if you initialize it with undefined field values (and btw var d: fs.Dir = undefined; works too), which a method cannot check for (undefined is allowed to be any value at runtime) calling close/deinit is usually a bad idea.

Also note that you can always call methods as functions on the struct: instead of initializing an undefined fs.Dir instance, you can call fs.Dir.open with first argument undefined. (Although generally you should probably look around an API for functions that don't require undefined arguments on invocation for your use case - unless you know what you're doing already.)

Another hint: There's currently no open function in std.fs as far as I can see, but assuming your code means either openFile or openDir, note how the first argument of type Dir is passed by value and not by pointer *Dir.
That means the object cannot be modified by the function call, making it impossible for self to change any of its fields (which you previously set to undefined).
That in turn makes it impossible highly unlikely for self to now represent an opened handle that needs to be closed after the call to the function.
(Of course, this is not directly visible from your callsite, because d is var here - if you defined it const you could guarantee no information will be stored there.)
(Also I'm not defending the API or anything, these are just hints meant to be helpful / clear up semantics.)

// what happens when I want to create a file in a another directory?
// if I want to reused the handle, do I need to close the directory first and then another call to open?

I think opening new handles is the way to go, but not sure I fully understand the scenario.
There is also Dir.makePath and Dir.makeOpenPath - maybe that suffices for your use case and you don't need the directory handle to stick around.
(And since createFileAbsolute just calls cwd().createFile, createFile in turn must also support subdirectory and absolute paths, so you could also makePath followed by createFile, though that's a bit less secure because the dir could technically be deleted in between if you didn't have it open.)

Of course you can reuse the variables for handles:

var current_dir = openStartDir();
defer current_dir.close(); //again don't close cwd() though (but explicitly opened `"."` should be okay afaik)
{
  var swap = try openNextDir(current_dir);
  current_dir.close();
  current_dir = swap;
}

@drakbar drakbar closed this Aug 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants