python_code_loading_and_proj_struct.txt

Code loading: how to 'find' and run code in a location other than the main script being run.

Given a statement like 'import numpy', Python first searches for a 'numpy' entry in the dictionary that is sys.modules. If there is such an entry, then nothing is done - Python thinks the 'numpy' module is already imported. In particular, if a module has already been imported then it is not imported again. 

sys.modules is a standard Python dictionary; its keys are strings representing module names (note: these are actual module names and not any aliases that we might give to those modules, e.g. 'import numpy as np' does not create a 'np' entry in sys.modules but a 'numpy' entry); the corresponding values are the module objects. Because sys.modules is just an ordinary Python dictionary, it can be modified at run-time; this affects the behaviour of import statements from that point forth. For example, if we delete a 'numpy' entry in sys.modules, then the statement 'import numpy' will re-import the numpy module, because 'numpy' wasn't an entry in sys.modules at the time of the import statement. However, this will simply create a new entry in sys.modules with a new 'numpy' module object, which is distinct from any previous 'numpy' module object that previously existed in the program; in particular, references to the old module object might still exist in the program. If needing to re-import modules - perhaps because you're editing a module and have an interpreter session loading and testing it at the same time - then use the importlib.reload() function: it's designed for this purpose and is not a 'hacky' solution like manually deleting from the sys.modules dictionary is.

Suppose instead that 'numpy' is not in sys.modules when we run 'import numpy'. Then, Python knows to go off and look for code to load. A module named 'numpy' is searched for in a specific list of directories, in a specific order, called the module search path (MSP). The MSP can be accessed and modified from within a Python session via sys.path. It is an ordinary Python list. When we attempt to import a module, Python first searches for it in the first directory in sys.path, and then in the second, the third... until it finds the module, at which point it stops and runs it; if it does not find it, an exception is raised.

Because the MSP is just an ordinary list, we can modify it at run-time. Modifying the MSP affects all imports from that point forth.

When trying to enable Python to find a particular module, modifying the MSP at run-time is a tempting, 'easy' solution. However, it is not ideal and should generally be avoided. Instead, one should have an understanding of how the MSP is initialized and use this to structure code and configure the run-time environment appropriately (i.e. in such a way that any code which is desired to be loaded can be found and loaded). Messing with Python's import mechanism at run-time - be it through modifying sys.modules or sys.path or via other means - should in general be avoided.

The MSP is initialized as follows, from the back of the MSP to the front:
1. An installation-dependent list of directories configured at the time Python is installed.
2. The list of directories contained in the PYTHONPATH environment variable, if it is set.
3. The directory of the input script, or simply '' for the current working directory within a Python REPL session.

A notable example of a directory in the first category is the /site-packages directory of the particular Python interpreter. This is where (most) modules/packages installed via pip go, so this is where 'numpy' would be.

Probably the best way to enable the correct modules to be found is to modify the PYTHONPATH environment variable in the shell or shell script that launches the Python interpreter process.

What actually happens when a module is found via the MSP and is 'imported'? If the module is a .py file, say calculate.py, then the code in the file is run from top to bottom. In the Python interpreter session where 'import calculate' appeared, the key 'calculate' is added to the dictionary sys.modules with value a module object; this ensures that any future importing of the same module does not re-import but rather creates a reference to the same module object. All of the global names defined in the calculate.py file are bound to this module object and can be accessed using it via the '.' notation, e.g. calculate.sum references the 'sum' object defined globally in 'calculate.py'. Finally, a reference to the module object is created in the importing file; this depends on how the import was done. For example, 'import calculate' will lead to the name 'calculate' in the importing module being a reference to the module object; 'import calculate as c' will refer to the name 'c' being a reference to the module object. If the imported module is actually a directory with a __init__.py file, the process is the exact same, except it is the __init__.py file that is executed. Finally, even executing something like 'from calculate import sum' leads to the entire 'calculate' module file being run from top to bottom and 'calculate' being added to sys.modules; the only difference is that now there is no reference created to the 'calculate' module object in the importing file; instead, a name 'sum' is created in the importing module's namespace, which refers to sys.modules['calculate'].sum. 

Note that sys.modules is 'shared' among all of the modules/files in a single Python interpreter session. That is, sys.modules refers to the same thing in every part of the program. This ensures that if one script has 'import pandas' and another also has 'import pandas' then pandas is not imported twice; it is imported once when the first import is made, and the second import just leads to a reference to the first module object.

A word on Pycharm:
- Pycharm's editor locates modules depending on the 'Project Structure' within 'Preferences'. There, one can add 'Content Roots' and 'Sources Roots'. By default, the directory of the project is a Content Root. These roots determine what modules the editor 'thinks' are import-able (i.e. can be found).
- When it comes to running code within Pycharm, we do so using run configurations. What modules are import-able depends on the run configuration and is very much distinct from the editor's idea of what's import-able; we could easily have that the editor complains that it can't find certain modules, but our run configuration is such that we can run the code and find all of the modules.
- By default, the run configuration of a given script has that script's directory as its working directory and adds all Content and Sources Roots to PYTHONPATH, so that the run configuration's idea of what's 'import-able' agrees with the editor's. This can be modified, of course.

How to structure Python projects?
The structure of a Python project - how modules are organised and imported from one another - should reflect how the scripts in the project are going to be run (i.e. Which scripts are going to be run? With PYTHONPATH (and other environment variables) set to what?). But here's a fool-proof recipe:
- Have a top-level project folder within which everything lives (excluding third-party packages, of course).
- Run-time setup: Run scripts from the top-level project folder, making sure to add this directory to PYTHONPATH before running anything. Alternatively, only ever run scripts that are in the top-level project folder, since the directory of the script that is run is always added to sys.path by default.
- Within the modules in the project, use absolute imports from the top-level project folder.
- Write filepaths in the module relative to the top-level project folder. Running scripts from the top-level project folder ensures that the current working directory throughout program execution is the top-level project folder, filepaths should be written in accordance with this. Use paths relative to the top-level folder and not absolute paths, because you want your code to be invariant to where in the system the project lives.
- A final word: If using an IDE like Pycharm, make sure that the run-time configurations reflect this set-up. Specifically, one will have to override the default working directories, forcing them to be the top-level project folder (by default, they are the directories which the scripts are in). One should have the only Content Root being the top-level project folder, have no Sources Roots, and ensure that Content Roots are added to the PYTHONPATH in the run configurations (true by default).