Overall questions: What does code look like? What are the basic statistics of code projects across a huge number of repos?
RQ1: What does a code project look like?
- Total LOC:
linecount
- Number of files:
filecount
- Number of classes: TODO
- Number of methods: TODO
- Number of different languages:
filecount
&linecount
- Number of GitHub stars: out-of-scope for VFP (API hits are expensive)
- Number of contributors: TODO
- Number of commits: TODO
- Repo age: TODO
- All of these should be aggregated by language (as designated by the GitHub repo?)
RQ1.5/Maybe: What is the file structure of a repo?
- Number of directories: TODO
- Number of different file extensions:
filecount
- Layout of file structure: TODO needs clarification
- Max directory depth: TODO
- All of these should be aggregated by language (as designated by the GitHub repo?)
RQ2: What is the visual shape of code?
- Length of files:
linecount
- Length of classes: TODO
- Length of functions: TODO
- Width of functions: TODO
- Heatmaps showing the shape?
- All of these should be aggregated by language (as designated by the GitHub repo?)
RQ3: What is in a line of code?
- Comments: TODO
- Stats on frequency and associations between token types
- Heatmaps showing different types of tokens?
RQ4: What is the correlation between all of these results and various project factors?
- Relationships between RQ1-3 results and...
- Number of stars
- LOC
- Number of contributors
- Repo age
- Number of commits
- Time since commit