How to identify technologies ?

Jump to bottom

Nicolas Delsaux edited this page Sep 12, 2024 · 16 revisions

In order to provide good diagrams, we need to be able to automatically identify technologies in model elements for which a package manager (Maven, NPM, pip, whatever) is used. How to do that ?

Experiment 1 : Gathering opinions (interviews, forums)

What do we want to validate ?

We want to know, by asking developers directly (or through well-known websites), how they choose their technologies

Is there any preexisting solution or research ?

Polls like

Provide good overview of the state of technology landscape, but nothing about the decision processes

Does this solution or research completely answers our needs ?

This first experiment should provide us research directions : what are good ways for developers to choose their technolongies ? These research directions will in turn allow us to perform experiments.

How will we validate this hypothesis ?

We have no hypothesis here

When is the experiment a success ?

If developers give us some queriable information sources.

What was the experimental process ?

To get ideas and opinions, we asked questions of the company's developers, as well as on various forums (Reddit, Discord, Stackoverflow (failed), développez.com (still awaiting validation)). Below is a summary of what was said.

In general, the devs we interviewed didn't have a clear opinion, but had some interesting ideas. It was recommended that we visit a number of sites and forums to see what technologies were listed.

Stackoverflow tech tags
Stacks from stackshare.io
Techempower
Github/google Advanced Search Reddit
ChatGPT
Gitlab Auto Devops and RedHat OpenShift auto detection of technologies.

As for the devs interviewed directly, the most common response was that when it comes to an architecture doc, the name of the language(s) and frameworks is enough. Which is relevant, of course, but not precise enough.

#Forum message template Hello everyone !

I'm working on a maven project called aadarchi. It's a Maven archetype allowing you to easily create your agile architecture documentation using a mix of C4, Asciidoc and PlantUML.

So far it's been mostly focused on Java projects, and we're currently trying to make it work for JS, or even maybe Python. But there's a catch ! In order to provide good diagrams, we need to be able to automatically identify technologies in model elements for which a package manager (Maven, NPM, pip...) is used. But how ? How can we decide what is really a "technology", which deserves to be detected in a project (for example in package.json files) and used in its architecture documentation ? Of course, we already thought about "just the language and framework", but we need some advices...

So : What are your thoughts ?

Thank you !

--> QUESTION "OFF TOPIC", immediately closed.

What are the signifiant results ?

We got a list of websites to analyze, but no really conclusive answer.

We suppose the subject relies heavily upon developer culture, and should require more analysis, both about information source but also about culture dynamics (which is way off-topic for a pure technology research subject).

Experiment 2 : Scraping ?

What do we want to validate ?

Suggested by a company dev, why don't we use a scraping script (Python Scrapy) to get the list of technologies on relevant sites ? For him, it was worth a shot to try it on https://stackshare.io/ or maybe https://techdetector.de/welcome

Is there any preexisting solution or research ?

There is a StackShare dataset avaiable at coresignal
A scrapy-based stackshare scraper is also available

Does this solution or research completely answers our needs ?

It could provides some complementary data

How will we validate this hypothesis ?

There is no hy^pothesis, only search for data

When is the experiment a success ?

Considering we don't know what we're looking for, this experiment has been postponed

What are the results ?

Experiment 3 : How does Gitlab Auto DevOps and other auto detect technologies ?

What do we want to validate ?

While asking to Devs on Slack, I got this answer : "You might find some answers by looking at the way Gitlab Auto DevOps or Red Hat OpenShift "auto detects" the underlying language and technologies of a project." ~S.R

https://about.gitlab.com/stages-devops-lifecycle/auto-devops/

Is there any preexisting solution or research ?

Does this solution or research completely answers our needs ?

How will we validate this hypothesis ?

By reading available code to understand how they analyze the applciations to deploy

What are the results ?

Consdiering we have no knowledge of the gitlab platform, this experiment has been postponed.

Experiment 4 :

In the end, we decided that scraping the sites seems like a good solution, in order to recover as many technologies as possible on the most objective criteria possible. But we decided to do it in a more complete way, and on two different sites: mvnrepository and stackoverflow. We used 2 tools, scrapy (python) for mvnrepository and a RestAPI for stackoverflow, filtering technologies according to their popularity.

What do we want to validate ?

That there exists a correlation between number of downloads and questions asked on Stackoverflow

Is there any preexisting solution or research ?

Stackoverflow provides a BigQuery dataset containing all their data up to 2023

Does this solution or research completely answers our needs ?

It should allow us to validate an hypothesis : there is a correlation between the StackOverflow activity and the download count

How will we validate this hypothesis ?

. Download questions count per mont . Download download count per month . Validate correlation

When is the experiment a success ?

If we observe signifiant correlation between both sides

Technical limits

mvnrepository doesn't provide download count infos fjor Java artifacts
Stackoverflow dataset is only available in BigQuery
mvnrepository and npmjs infos are differently formatted

What are the results ?

Considering these figures are quite the best-formatted ones we can get, we started another initiative at Zenika to analyze all that with more depth.
We also built an automated download count extractor for the most-known artifacts of all languages at aadarchi-technology-detector. This project provides automated up-to-the-month download figres for npmjs and python dependencies (we're still searching for a way to get download figures for Java artifacts)

Experiment template

Use this template to describe a new experiment

Experiment name

What do we want to validate ?

Is there any preexisting solution or research ?

Does this solution or research completely answers our needs ?

How will we validate this hypothesis ?

When is the experiment a success ?

What are the results ?