Author: Jorge Henriquez ([email protected])
Last updated: Mon Jun 29 2020
This proposal will add the ability to filter git repositories for certain file extensions.
Since testing the wordJS project will rely on testing various file formats, it is efficient to reuse test files that have already been written by other members of the open source community.
This tool will make it easier to do so.
Add a cross platform utility to extract files that match a certain extension from git repositories.
The alternate method of doing this is to do so manually, this would be inefficient and would require a tedious amount of human hours to do.
Using rooster
, we can clone many repositories at a time and get all the desired content.
Since this utility will aim to be cross platform this eliminates the use of UNIX programs like bash
and find
.
Instead we will use Golang as it is easy to develop with and can ship binaries that are directly runnable by the end user without any other dependencies.
Rooster will be executed in the following format
rooster [git repo URL] [-ext ext1,ext2,ext3] (-out dirname)
Where the []
brackets represent required parameters and the ()
represent optional parameters.
Running rooster -repo https://github.com/foo/bar -ext ".doc,docx,.txt"
will clone https://github.com/foo/bar
and extract all files ending the doc, docx, and txt
extensions. Two things of note:
- The leading
.
is optional - Rooster will be strict and will NOT infer other file extension given another. (i.e. using .txt will only grab .txt files and NOT .text files)
Since the -out
was omitted rooster
will create a directory named rooster_output
and once the program is done it will result in the following file structure:
rooster_output
├── doc
├── docx
└── txt
Where all docx files will be in the docx
directory and all the doc
files will be in the doc
folder, etc, etc.
If the -out
flag was used then rooster_output
would be substituted for the argument passed to -out
.
It is necessary to clone the entire git repo as there is no way to know where the files may be located within the repository.
To clone the repo we have two options:
- Make a child process that directly invokes the
git clone
command. - We can use libgit2 to directly clone the repo within the same process.
Going with option 2 is indeed the "appropriate" way to do this. But since this utility is rather small, the overhead of making a child process isn't too taxing on the system. This also has the benefit using whatever credentials the user has stored with git for free without extra implementation.
- Initialize a set
S
- Initialize a map
M
that maps a file extension to an array of file paths. - Add all given file extensions to the set
S
. - Walk the given repo recursively
- Assign
F
to the current file. - If
F
's file extension in in the setS
, then appendF.path
toM[F.ext]
. - Continue to the next file.
- Assign
This algorithm then leaves M
complete; we can then write M
's representation back to the disk.