-
Notifications
You must be signed in to change notification settings - Fork 73
Spring20Cs361sLab5
This lab is designed to teach you a bit about programming language sandboxing.
We have uploaded a sandbox.tgz
file to Canvas. Please download it and unzip it somewhere in your Virtual Machine that you used for lab 4. Inside will be a pypy
installation that was compiled for Sandboxing. The zip file has not top level directory, so you should unzip this inside an empty folder. For example, you could create a folder ~/sandbox
under your home directory, and unzip the tar file within it. If you used this folder structure, you would get something like this:
/home/name/sandbox/pypy
/home/name/sandbox/lib-python
/home/name/sandbox/lib-pypy
/home/name/sandbox/py
/home/name/sandbox/pypy
/home/name/sandbox/rpython
PLEASE NOTE: pypy by default no longer supports sandboxing. I personally updated pypy code in order to make this work. Do not install from source, or install the Ubuntu pypy modules.
Once unzipped, you should be able to immediately test out the interactive shell in sandbox mode. You need to change directories to the pypy/sandbox
dir. Then:
/home/name/sandbox/pypy/sandbox/$ python3 pypy_interact.py pypy3-c-sandbox
<some intro lines>
>>>>
You've started a special Python shell that is in sandbox mode. In this mode, even the filesystem is virtualized. For example, import os
and try doing some file operations
>>>> import os
>>>> os.listdir('/')
['bin', 'tmp', 'dev']
This file system is almost completely made up. Nothing matches what is on the real filesystem. What is going on here?
Well, a lot of things. The real heart of this system is pypy3-c-sandbox
. This is a binary python interpreter, similar to the normal python3
. But during its compilation, it was gutted so that all syscalls intercepted. A syscall is a call to the operating system and is required for just about any potentially "dangerous" operation, including filesystem accesses.
The pypy3-c-sandbox
interpreter, upon receiving a syscall, sends the data out to another, controlling, program. Where is that happening?
You will notice that pypy3-c-sandbox
is not being invoked directly. Rather, it is invoked as an argument of pypy_interact.py
. This python script is the controller for the sandbox. It launches pypy3-c-sandbox
and communicates with it using stdin/stdout. When pypy3-c-sandbox
encounters a syscall, it will send the call and the arguments to pypy_interact.py
. This secondary program is a policy and policy enforcement script. It decides if a syscall will be allowed AND how the syscall will behave! The default, out-of-the-box implementation does not allow opening any file for writing, and only files within /tmp
for reading. It does not even allow for changing directories! (Try, for example, os.chdir
, or try opening a file for writing).
Every single day, you trust an awful lot of code. Almost all major websites have code running in the browser in one form or another. JavaScript is everywhere.
That's incredibly dangerous. Just think about how much of a security risk it would be if navigating to a website could cause your hard disk to be deleted. We trust that Browsers keep the JavaScript we download sandboxed. That means, the downloaded program can't do any real damage to your computer.
One of the first languages to really incorporate this idea was Java. Java Applets were designed to be safe to run anywhere. The Java Sandbox was to make them innocuous, no mater what code they contained.
Unfortunately, the Java Sandbox failed. It has been deprecated and you shouldn't trust it.
But the ideas have survived and continue to be employed in JavaScript and other languages.
The goal of this lab is for you to learn how to "sandbox" python so that even if you downloaded an "evil" script from a classmate, it couldn't do any serious harm to your computer.
You've already run the Sandbox in interactive mode. Now it's time to execute a script. One such file, test1.py
is already included. Let's try to run it.
/home/name/sandbox/pypy/sandbox/$ python3 pypy_interact.py pypy3-c-sandbox test1.py
What happened? You should have received some form of file-not-found error. Why? The file is literally right there... in the same directory. Why can't it find it?
Remember, as soon as the interpreter starts, ALL SYS CALLS ARE INTERCEPTED. Literally every one, including the syscalls for loading a file to execute! So, if the interpreter goes looking for a file test1.py
, it has to find it in the virtual file system! Sadly, test1.py
is not anywhere to be found in this virutalized tree.
How do we put it there? For our purposes, all files for our use need to be in the /tmp
directory. This directory is not the real /tmp
on your system. If you do a directory listing of both of those directories, you'll see that the virtual temp directory is empty. There's nothing in it, regardless of what's in your real /tmp
.
The pypy_interact.py
script allows you to specify a real path as the virtual /tmp
path. For convenience, we'll just make the current path the /tmp
path. Let's quickly do this in interactive mode again:
/home/name/sandbox/pypy/sandbox/$ python3 pypy_interact.py --tmp=. pypy3-c-sandbox
<introductory lines>
>>>> import os
>>>> os.listdir("/")
This time, you should find it thinks the contents of /tmp
are the same as /home/name/sandbox/pypy/sandbox
. This is because we told the virtual file system that its /tmp
directory contents cam from the real path on the real system.
Now we can execute test1.py.
/home/name/sandbox/pypy/sandbox/$ python3 pypy_interact.py --tmp=. pypy3-c-sandbox test1.py
Why does it work now?! Because when pypy3-c-sandbox
starts up, the current working directory is /tmp
. The virtual /tmp
path is the same as the real path for /home/name/sandbox/pypy/sandbox
. Because test1.py
is in this path, when pypy3-c-sandbox
goes looking for it, it can find it. Now it can execute the script.
Why do we even have a virtual file system? To prevent an "evil" script from being able to call os.system("rm -rf /")
WARNING: Although the sandboxed interpreter can do a heck of a lot of "real" python, it does have some significant limitations. First of all, it won't run ANY library that has a non-python-based implementation. Some python scripts aren't implemented in python; they have a thin python shell that wraps a binary shared object (.so) file. The reason many libraries are implemented this way is speed; the binary code executes hundreds to thousands of times faster than regular python. But the Sandbox can't have it. If it allowed this, the sandbox could be bypassed by an evil SO-based library. So don't get too aggressive with trying to run "regular" python scripts.
In this lab, you are going to create a new policy/controller for the sandbox that will permit three additional operations. These operations are:
- Change directory
- Create files for writing
- Permit sockets for a "same origin"
The first assignment is super easy, but it will permit us to get our feet wet. And it will help us simply learn how to create a controller file at all.
Currently, the controller we're using is called pypy_interact.py
. As a reminder, the sandbox interpreter pypy3-c-sandbox
is just like the python3
interpreter, except that ALL SYS CALLS are redirected to the controlling program. So, pypy_interact.py
is determining how to handle all of those calls. Let's see how this is done.
If you look in the pypy_interact.py
code, you will see this:
class PyPySandboxedProc(VirtualizedSandboxedProc):
This is the class that handles all of the input from pypy_interact.py
. There are a few really critical pieces in this class itself, plus a bunch of handlers in the parent class VirtualizedSandboxedProc
. Here are the things to look for in this file:
Dir({
'bin': Dir({
'pypy3-c': RealFile(self.executable),#, mode=0111),
'lib-python': RealDir(os.path.join(libroot, 'lib-python'),
exclude=exclude),
'lib_pypy': RealDir(os.path.join(libroot, 'lib_pypy'),
exclude=exclude),
}),
'tmp': tmpdirnode,
'dev': Dir({
'urandom': RealFile("/dev/urandom")})
})
This is an object returned by a function build_virtual_root
. This function is what is creating a virtual file system! I won't tell you all of how it works; some of this you should explore! But this is where the magic starts.
(PS, do you see urandom
? See how it is mapped to RealFile("/dev/urandom")
? This is a mapping from a "file" in the virtual world to a file in the real world. I added this into the virtual file system because Python's random isn't working. You have to access /dev/urandom
in order to get random numbers. If, in python, you open /dev/urandom
as a file and call read, you will get random numbers. This is an example of how you "plumb" the virtual file system and virtual sandbox to provide the functionality that you want).
One other key element is:
virtual_cwd = '/tmp'
This is the current working directory within the virtual filesystem when the sandbox starts.
Now, onto even meatier handling stuff. How do we handle incoming syscalls? The Sandbox system resorts to handling functions within the Sandbox class. These are currently in the parent classes. Let's dig those out. VirtualizedSandboxProc
is defined in rpython/translator/sandbox/sandlib.py
. There are also several parent classes for this as well: SimpleIoSandboxProc
and VirtualizedSandboxedProc
and SandboxedProc
. The key thing to look for in these are the do_ll_os__
functions. The "ll" stands for "low level". So these are "do low-level os" functions. Each is mapped to a different syscall or syscall-related python function. Some are obvious and some aren't. For the most part, you'll only need to experiment with some obvious ones.
For example, take a look at this function:
def do_ll_os__ll_os_getcwd(self):
return self.virtual_cwd
This function is the handler for python's os.getcwd()
. Remember virtual_cwd
above?
And this leads us to our first part of the assignment. The current implementation has no support for os.chdir
(change directory). If you try it in the sandbox shell, you'll get a runtime error. You need to add this.
How?
Well, first, to keep things clean, let's create a new pypy_interact.py
. Let's call it lab5_interact.py
. This represents our controller for the sandbox for this lab. You don't need to start from scratch, literally just copy it: cp pypy_interact.py lab5_interact.py
.
Next, let's create a local sandlib.py
file. In this file, we'll create our own parent class for our new controller. It will inherit from the VirtualizedSandboxedProc
that the current controller uses. In short, we'll create an new class in the hierarchy in which we can stick all of our modifications.
So, in your local sandlib.py
class, create the following object:
from rpython.translator.sandbox.sandlib import VirtualizedSandboxedProc
class Lab5SandboxedProc(VirtualizedSandboxedProc):
pass
Yes, it's an empty class that does nothing, but let's test it out first. Inside lab5_interact.py
, change the proc class to inhert from this class instead:
class PyPySandboxedProc(Lab5SandboxedProc):
Now, test this by running the sandbox with the new controller:
python3 lab5_interact.py pypy3-c-sandbox
It should work exactly like the old one. Now, let's change our version to support changing directories. The first thing we need to do is add a "handler". All handlers in the sandbox are automatically inferred so it's simply a matter of knowing what the incoming syscall looks like. If you look in the original sandlib (rpython/translator/sandbox/sandlib.py
), you will find this method:
def handle_message(self, fnname, *args):
# fnname had better be convertable to utf-8...
fnname = fnname.decode()
if '__' in fnname:
raise ValueError("unsafe fnname")
try:
handler = getattr(self, 'do_' + fnname.replace('.', '__'))
except AttributeError:
raise RuntimeError("no handler for this function")
resulttype = getattr(handler, 'resulttype', None)
return handler(*args), resulttype
This is where the incoming data from the sandbox is processed. The handle method automatically looks for a python function to handle by creating a do_
function for it. So, if you didn't know what the handler was for chdir
for example, you could add a print
statement to this method, run the interpreter, and call chdir
. You're welcome to try this, but be warned, this will print out the syscall name of EVERY SYSCALL being called. So be prepared for a lot of output.
I'll tell you, however, that the call you want is do_ll_os__os_chdir()
. I'll even give you the code for a "basic" version that partially works. In your Lab5SandboxedProc
class, start with the following:
def do_ll_os__ll_os_chdir(self, path):
self.virtual_cwd = path.decode()
Remember virtual_cwd
? This is the internal state that controls our current working directory within the virtual filesystem. This handler simply changes that to whatever the string is passed in by the user (it comes in as bytes, so we have to decode to a string).
This handler does work, but only minimally. Try this:
python3 lab5_interact.py pypy3-c-sandbox
>>>> import os
>>>> os.chdir("/")
>>>> os.listdir(".")
['bin', 'tmp', 'dev']
Not bad, not bad. But it has a couple of problems. First of all, relative paths don't work at all. Nor is there any error checking. That's what you need to figure out. Fortunately, there is already a function that does a lot of this: translate_path()
. This function is used for opening files. See if you can adapt it to simply give up an absolute path (if relative paths are given) and tell you if the path exists before you set virtual_cwd
.
Speaking of open
, now we'll talk about the three handlers that will help us with both of the next parts of the lab. Open, read, and write:
def do_ll_os__ll_os_open(self, vpathname, flags, mode):
def do_ll_os__ll_os_read(self, fd, size):
def do_ll_os__ll_os_write(self, fd, data):
Currently, the system you have only permits reading. Well, writing is permitted to fd's 1 and 2. Can you figure out why?
For part 2 of this lab, you need to permit writing to the tmp (virtual) directory only. This only needs to work when the tmp
directory in the virtual fs is tied to a real directory in the real fs. You need to be able to write to existing files and create non-existent files when you write. You do NOT need to support creating sub directories within tmp.
You will need to look at the file rpython/translator/sandbox/vfs.py
to see how the system creates a class for handling readable files. You will need to create your own class for a writeable file.
The hardest part will be that the incoming flags
parameter can be hard to figure out. For the purposes of this lab, you can accept the following pseudo code as how to convert from flags to an open mode for the file:
if os.O_RDONLY in flags:
return "rb"
elif os.O_WRONLY in flags and os.O_TRUNC in flags:
return "wb"
elif os.O_WRONLY in flags:
return "ab"
elif os.O_RDWR in flags and os.O_TRUNC in flags:
return "wb+"
elif os.O_RDWR in flags:
return "ab+"
return "rb"
Don't worry about opening in text mode. Just assume everything is in binary. Please note, this is PSEUDO CODE. You can't actually use "in flags". Flags
is a bitfield and you'll have to figure out how to check it.
Again, to be clear, in the existing vfs.py
file, you'll see a RealFile
class. In your local directory, either in the local sandlib.py
or in a local vfs.py
file, you should create your own class (perhaps ReadWriteFile
) that implements writing as well as reading.
Make sure that a sandbox user can only open a file for writing within the temp directory.
Now, for the final assignment, sockets. If you try to open sockets the normal way, it won't work. We're going to expose the socket interface in the virtual system as a file!.
If you look in the original sandlib
file (under rpython/translator/sandbox
) you'll see most of the code is already written for you in the un-used VirtualSocketProc
. The system treats calls to open a file prefixed with tcp://
as a socket call! It opens the file, returns a file descriptor, and then intercepts the read write calls. If it's a socket, itsends or recv's from the socket, rather than from a file.
This code (not written by me) isn't completely functional. But all that should be required is converting from bytes to strings on the input. It is more-or-less correct regardless and you can use it as a guide.
The final requirement is to only permit outbound socket connections to "same origin". As these files aren't actually going to be downloaded, we will express the "origin" as a command line parameter.
Within your lab5_interact.py
, please add a command line parameter --origin
. You will see how command line options are handled in main()
. It should be hard to add a new one. You will also need to add a parameter to the Lab5SandboxedProc
class that can be used to pass this into the instance. In the open
function, you should not permit outbound connections to anywhere except as prefixed by this origin.
This lab is due on the last day of classes, May 8th 2020 at midnight (23:59). You may work in pairs.
As with other labs, please submit via github and tag your submission.