Skip to content

Latest commit



79 lines (62 loc) · 11.2 KB

File metadata and controls

79 lines (62 loc) · 11.2 KB

Syncing new affiliations

Make sure that you don't have different case email duplicates in src/cncf-config/email-map: cd src, ./ cncf-config/email-map.

  1. If you generated new email-map using ./, then: mv email-map cncf-config/email-map
  2. To generate git.log file and make sure it includes all orgs used by devstats use cncf/devstats's GHA2DB_PROJECTS_OVERRIDE="+cncf,+opencontainers,+istio,+spinnaker,+knative,+linux,+zephyr" PG_PASS=... GHA2DB_EXTERNAL_INFO=1 GHA2DB_PROCESS_REPOS=1 ./get_repos and then final command line it generates. Make it uniq.
  3. Update repos.txt to contain all repositories returned by the above command. Update to include data from CNCF, CDF and LF.
  4. To run cncf/gitdm on a generated git.log file run: cd src/; ~/dev/alt/gitdm/src/ -i git.log -r "^vendor/|/vendor/|^Godeps/" -R -n -b ./ -t -z -d -D -A -U -u -o all.txt -x all.csv -a all_affs.csv > all.out. New approach is ./mtp but it don't have a way (yet) to deal with the same emails mapped into different user names from different per-thread buckets.
  5. To generate human readable text affiliation files: first run: ./ then: SKIP_COMPANIES="(Unknown)" ./
  6. If updating via or (step 6) - run too. If you need LF actors, run: AWS_PROFILE=... KUBECONFIG=... ./ prior to running ./
  7. Consider ./ or ./ (if you run this, then copy result json somewhere and get 0-committers from previous version to save GH API points). Sometimes you should just run ./ without cache.
  8. Recommended: 2> errors.txt will refetch repos metadata and commits since last fetched and get users data from github_users.json so you can save a lot of API points. You can prepend with NCPUS=N to override autodetecting number of CPU cores available.
  9. To copy source type from previous JSON version do ./
  10. Run ./ to fix typical company names spell errors, lower/upper case etc. Update company-names-mapping before running this (with a new typos/correlations data from the last 3 steps).
  11. To update (enhance) github_users.json with new affiliations ./ If you run ghusers you may need to update skip_github_logins.txt with new broken GitHub logins found. This is optional if you already have an enhanced json. You can prepend with NCPUS=N to override autodetecting number of CPU cores available.
  12. To merge with previous JSON use: ./
  13. To merge multiple GitHub logins data (for example propagate known affiliation to unknown or not found on the same GitHub login) run: ./
  14. Because this can find new affiliations you can now use ./ to import back from github_users.json and then ./ cncf-config/email-map and restart from step 4. This uses company-names-mapping file to import from GitHub company field.
  15. Run ./ and examine its output correlations.txt to try to normalize company names and remove common suffixes like Ltd., Corp. and downcase/upcase differences.
  16. Run ./check_spell for fuzziness/spell check errors finder (uses Levenshtein distance to find bugs).
  17. Run ./ and examine its output JSONs - those GitHub profiles have some useful data directly available - this will save you some manual research work.
  18. ALWAYS before any commit to GitHub run: ./ to remove any forbiden affiliations, please also see
  19. You can use ./ to clear all affiliations on a generated github_users.json.
  20. To make json unique, call ./unique_json.rb github_users.json. To sort JSON by commits, login, email use: ./sort_json.rb github_users.json.
  21. You should run genderize/geousers (if needed) before the next step.
  22. You can create smaller final json for cncf/devstats using ./ github_users.json; ./check_source.rb github_users.json; ./ github_users.json stripped.json; cp stripped.json ~/dev/go/src/
  23. To generate final unknowns.csv manual research task file run: ./gen_aff_task.rb unknowns.txt. You can also generate all actors ./gen_aff_task.rb alldevs.txt. You can prepend with ONLY_GH=1 to skip entries without GitHub. You can prepend with ONLY_EMP=1 to skip entries with any affiliation already set.
  24. To manually edit all affiliations related files: edit cncf-config/email-map all.txt all.csv all_affs.csv github_users.json stripped.json ../developers_affiliations.txt ../company_developers.txt affiliations.csv
  25. To add all possible entries from github_users.json to cncf-config/email-map use This is optional.
  26. Finally copy github_users.json to github_users.old. You can check if JSON fileds are correct via ./ github_users.json, ./ stripped.json small.
  27. If any file displays error with 'Invalid UTF-8' encoding, scrub it using Ruby tool: ./scrub.rb filename.

Example command generated by cncf/devstats/get_repos:

  • ./ /root/devstats_repos/jenkins-x/* /root/devstats_repos/jenkinsci/* /root/devstats_repos/spinnaker/* /root/devstats_repos/tektoncd/* /root/devstats_repos/Azure/* /root/devstats_repos/BuoyantIO/* /root/devstats_repos/GoogleCloudPlatform/* /root/devstats_repos/OpenObservability/* /root/devstats_repos/RichiH/* /root/devstats_repos/Virtual-Kubelet/* /root/devstats_repos/alibaba/* /root/devstats_repos/apcera/* /root/devstats_repos/appc/* /root/devstats_repos/brigadecore/* /root/devstats_repos/buildpack/* /root/devstats_repos/cdfoundation/* /root/devstats_repos/cloudevents/* /root/devstats_repos/cncf/* /root/devstats_repos/containerd/* /root/devstats_repos/containernetworking/* /root/devstats_repos/coredns/* /root/devstats_repos/coreos/* /root/devstats_repos/cortexproject/* /root/devstats_repos/crosscloudci/* /root/devstats_repos/datawire/* /root/devstats_repos/docker/* /root/devstats_repos/dragonflyoss/* /root/devstats_repos/draios/* /root/devstats_repos/envoyproxy/* /root/devstats_repos/etcd-io/* /root/devstats_repos/falcosecurity/* /root/devstats_repos/fluent/* /root/devstats_repos/goharbor/* /root/devstats_repos/grpc/* /root/devstats_repos/helm/* /root/devstats_repos/istio/* /root/devstats_repos/jaegertracing/* /root/devstats_repos/knative/* /root/devstats_repos/kubeedge/* /root/devstats_repos/kubernetes/* /root/devstats_repos/kubernetes-client/* /root/devstats_repos/kubernetes-csi/* /root/devstats_repos/kubernetes-graveyard/* /root/devstats_repos/kubernetes-helm/* /root/devstats_repos/kubernetes-incubator/* /root/devstats_repos/kubernetes-incubator-retired/* /root/devstats_repos/kubernetes-retired/* /root/devstats_repos/kubernetes-security/* /root/devstats_repos/kubernetes-sig-testing/* /root/devstats_repos/kubernetes-sigs/* /root/devstats_repos/linkerd/* /root/devstats_repos/lyft/* /root/devstats_repos/miekg/* /root/devstats_repos/nats-io/* /root/devstats_repos/open-policy-agent/* /root/devstats_repos/opencontainers/* /root/devstats_repos/openeventing/* /root/devstats_repos/opentracing/* /root/devstats_repos/pingcap/* /root/devstats_repos/prometheus/* /root/devstats_repos/rkt/* /root/devstats_repos/rktproject/* /root/devstats_repos/rook/* /root/devstats_repos/spiffe/* /root/devstats_repos/telepresenceio/* /root/devstats_repos/theupdateframework/* /root/devstats_repos/tikv/* /root/devstats_repos/torvalds/* /root/devstats_repos/uber/* /root/devstats_repos/virtual-kubelet/* /root/devstats_repos/vitessio/* /root/devstats_repos/vmware/* /root/devstats_repos/weaveworks/* /root/devstats_repos/youtube/* /root/devstats_repos/zephyrproject-rtos/* /root/devstats_repos/iovisor/* /root/devstats_repos/mininet/* /root/devstats_repos/open-switch/* /root/devstats_repos/opencord/* /root/devstats_repos/opennetworkinglab/* /root/devstats_repos/opensecuritycontroller/* /root/devstats_repos/p4lang/* /root/devstats_repos/tungstenfabric/*.

To sync maintainers:

  1. Open CNCF projects maintainers list
  2. Save "Name", "Company", "GitHub name" columns to a new sheet and download it as "maintainers.csv".
  3. Add "name,company,login" CSV header.
  4. Example file
  5. Run [ONLYNEW=1] ./ script. Follow its instructions.

Add new project (cncf or non-cncf) to get affiliation for it.

Please follow the instructions from

Geodata and gender

To add geo data (country_id, tz) and gender data (sex, sex_prob), do the following:

  • Download file from geonames server.
  • Create geonames database via: sudo -u postgres createdb geonames, sudo -u postgres psql -f geonames.sql. Table details in
  • Unzip and run PG_PASS=... ./ allCountries.tsv - this will populate the DB.
  • Create indices on columns to speedup localization: sudo -u postgres psql -f geonames_idx.sql.
  • If this is a first geousers run create geousers_cache.json via cp empty.json geousers_cache.json.
  • To use cache it is best to have stripped.json from the previous run. See step 22.
  • Enchance github_users.json via PG_PASS=... ./ github_users.json stripped.json geousers_cache.json 2000. It will add country_id and tz fields.
  • Go to and get you API_KEY, basic subscription ($9) allows 100,000 monthly gender lookups.
  • If this is a first genderize run create genderize_cache.json via cp empty.json genderize_cache.json.
  • Enchance github_users.json via API_KEY=... ./ github_users.json stripped.json genderize_cache.json 2000. It will add sex and sex_prob fields.
  • You can skip API_KEY=... but only 1000 gender lookups/day are allowed then.
  • Copy enhanced json to devstats: ./ github_users.json stripped.json; cp stripped.json ~/dev/go/src/devstats/github_users.json
  • Import new json on devstats using ./import_affs tool.

Manual affiliations

  • To import manual affiliations from a google sheet save this sheet as affiliations.csv and then use ./ script.
  • Prepend with UPDATE=1 to only import those marked as changed: column changes='x'.
  • Prepend with DBG=1 to enable verbose output.
  • After finishing import add a status line to affiliations_import.txt file and update the online spreadsheet.
  • After importing new data run ./src/ (from the src's parent directory). Do this after processing all data mentioned here, not after just importing new CSV.
  • Import generated csv/burndown.csv data into
  • To calculate CNCF/LF ratio use number of CNCF found from last commit - number of CNCF found from some previous commit diveded by the same ratio for all actors.