-
Notifications
You must be signed in to change notification settings - Fork 0
Query Builder
Query Builder is implemented with Python, Html and JavaScript as a Jupyter notebook. This section describes how the tool can be utilized from the AEGIS platform.
Query Builder is, by default, pre-installed and available in every newly created project. In case, however, that the Query Builder Jupyter notebook cannot be found in a specific project or the project is outdated, the user can download it and then, upload the QueryBuilder_v1.ipynb file into Jupyter from the corresponding menu, as seen in figure below.
Once the notebook is present, there are three ways to access it.
The notebook can be opened from the AEGIS Jupyter Tools page by clicking on the Query Builder tile.
Visit the Datasets page and then open "Jupyter" dataset which is available in all projects. The user should browse for the Query Builder notebook, right click on it and then select the "Open Jupyter Notebook" context menu option.
This approach makes again use of the context menu. Navigate to the Datasets Page and then open a dataset. Then, by right-clicking on any CSV file, the user is given the ability to open that file directly into the Query Builder, as it can be seen in the screenshot above.
Once the notebook has been loaded, the user sees the following:
This is the initial state. All code cells are hidden using the hide code plugin and all output has been cleared. If the user is not interested in reviewing or altering the code (it is not required in the standard workflow) he/she may proceed with hiding the cell toolbars. In order to do that, select "View" from the menu, then "Cell Toolbar" and click on the "None" option, as shown below.
The user should now see the following screen:
In order to initialize the notebook user must run the first cell. Make sure that the first cell is selected and either click on the "Run" button from the top Jupyter menu, or press Ctrl + Enter
.
User should see a button appearing as shown below.
Click on the "Initialise QB" button and wait for the execution to stop. When this is done, user should see a screen like the one below, where all cell prompts have been replaced by numbers and the file selection UI has appeared.
Note that the output regarding the Spark application will only appear when the interpreter is started. This means that if the "Initialise QB" button is pressed again, the message will disappear, but the tool will be still functional.
User can see a list of all "Available Datasets" in the current project. The list can also be refreshed. Select the desired dataset and click "Open dataset", this will populate the "Available Files" list.
Once the user selected a file and presses the "Open file" button, the file is loaded (as a PySpark Dataframe) and becomes available for further processing/querying.
Dataframes are essentially data tables. At any moment, Query Builder has two dataframes: the temporary dataframe (TempDF) which holds the result of the last performed action, i.e. the contents of a file or the result of applying a filter/query on the previous contents of TempDF and the master dataframe (MasterDF) which is explicitly updated by pressing a button that moves the TempDF contents to the MasterDF.
When a file is loaded, a form appears with several options for simple data filtering and processing.
Each selected action will open a new form with the necessary parameters. Once the user fills the required information and presses OK a new filter will be added in the queue to be applied when the user wants. As an example, if the user chooses the "Rename Column" filter he/she will then need to select which column to rename and which should be the new name.
Once the user presses "OK", the filter will be added in the queue, under the "Selected Filters" label
The user may add more than one filter before executing them. They will be added in the same list:
When the user decides to apply them, he/she should click on the "Apply Filters" button. When the execution is finished, all successfully applied filters will change color:
The user may continue adding new filters after that point. The list will show which filters have been already applied and which are pending:
By pressing the "Refresh temp" button, the user will get the first 40 lines of the TempDF. This is useful both when the user first loads a file but also to review the result of applying some filters.
A "Refresh Master" button is also available to preview the contents of the master dataset. Keep in mind that the two dataframes may hold completely different data. The user may also join the two dataframes through the corresponding option in the filters form. The result will be kept in the TempDF.
The user can use the "Update Query Output" button which will reveal a ready to copy and use python snippet which corresponds to the processes that have been applied on the tempDF.
By clicking the "Save master to CSV" button, the Save configuration form will appear where the user can select where to store the masterDF's data.
The user may load, process and combine data following the previous steps in any order, until he/she is satisfied with the end-result. At any time, the user may press the "Initialise QB" button to clear the screen and start over. When the user has completed his/her work with the Query Builder, he/she may either directly exit the Notebook or clear all the output first so that he/she can have a fresh start without any leftovers the next time. This can be achieved from Jupyter's top menu. Choose Cell -> All Output -> Clear.
Not applicable in the current version, since Query Builder is actually a Jupyter notebook