Skip to content

Latest commit

 

History

History
79 lines (62 loc) · 11.2 KB

SYNC.md

File metadata and controls

79 lines (62 loc) · 11.2 KB

Syncing new affiliations

Make sure that you don't have different case email duplicates in src/cncf-config/email-map: cd src, ./lower_unique.sh cncf-config/email-map.

  1. If you generated new email-map using ./import_affs.sh, then: mv email-map cncf-config/email-map
  2. To generate git.log file and make sure it includes all orgs used by devstats use cncf/devstats's GHA2DB_PROJECTS_OVERRIDE="+cncf,+opencontainers,+istio,+spinnaker,+knative,+linux,+zephyr" PG_PASS=... GHA2DB_EXTERNAL_INFO=1 GHA2DB_PROCESS_REPOS=1 ./get_repos and then final command line it generates. Make it uniq.
  3. Update repos.txt to contain all repositories returned by the above command. Update all_repos.sh to include data from CNCF, CDF and LF.
  4. To run cncf/gitdm on a generated git.log file run: cd src/; ~/dev/alt/gitdm/src/cncfdm.py -i git.log -r "^vendor/|/vendor/|^Godeps/" -R -n -b ./ -t -z -d -D -A -U -u -o all.txt -x all.csv -a all_affs.csv > all.out. New approach is ./mtp but it don't have a way (yet) to deal with the same emails mapped into different user names from different per-thread buckets.
  5. To generate human readable text affiliation files: first run: ./enchance_all_affs.sh then: SKIP_COMPANIES="(Unknown)" ./gen_aff_files.sh.
  6. If updating via ghusers.sh or ghusers_cached.sh (step 6) - run generate_actors.sh too. If you need LF actors, run: AWS_PROFILE=... KUBECONFIG=... ./generate_actors_lf.sh prior to running ./generate_actors.sh.
  7. Consider ./ghusers_cached.sh or ./ghusers.sh (if you run this, then copy result json somewhere and get 0-committers from previous version to save GH API points). Sometimes you should just run ./ghusers.sh without cache.
  8. Recommended: ghusers_partially_cached.sh 2> errors.txt will refetch repos metadata and commits since last fetched and get users data from github_users.json so you can save a lot of API points. You can prepend with NCPUS=N to override autodetecting number of CPU cores available.
  9. To copy source type from previous JSON version do ./copy_source.sh
  10. Run ./company_names_mapping.sh to fix typical company names spell errors, lower/upper case etc. Update company-names-mapping before running this (with a new typos/correlations data from the last 3 steps).
  11. To update (enhance) github_users.json with new affiliations ./enhance_json.sh. If you run ghusers you may need to update skip_github_logins.txt with new broken GitHub logins found. This is optional if you already have an enhanced json. You can prepend with NCPUS=N to override autodetecting number of CPU cores available.
  12. To merge with previous JSON use: ./merge_jsons.sh.
  13. To merge multiple GitHub logins data (for example propagate known affiliation to unknown or not found on the same GitHub login) run: ./merge_github_logins.sh.
  14. Because this can find new affiliations you can now use ./import_from_github_users.sh to import back from github_users.json and then ./lower_unique.sh cncf-config/email-map and restart from step 4. This uses company-names-mapping file to import from GitHub company field.
  15. Run ./correlations.sh and examine its output correlations.txt to try to normalize company names and remove common suffixes like Ltd., Corp. and downcase/upcase differences.
  16. Run ./check_spell for fuzziness/spell check errors finder (uses Levenshtein distance to find bugs).
  17. Run ./lookup_json.sh and examine its output JSONs - those GitHub profiles have some useful data directly available - this will save you some manual research work.
  18. ALWAYS before any commit to GitHub run: ./handle_forbidden_data.sh to remove any forbiden affiliations, please also see FORBIDDEN_DATA.md.
  19. You can use ./clear_affiliations_in_json.sh to clear all affiliations on a generated github_users.json.
  20. To make json unique, call ./unique_json.rb github_users.json. To sort JSON by commits, login, email use: ./sort_json.rb github_users.json.
  21. You should run genderize/geousers (if needed) before the next step.
  22. You can create smaller final json for cncf/devstats using ./delete_json_fields.sh github_users.json; ./check_source.rb github_users.json; ./strip_json.sh github_users.json stripped.json; cp stripped.json ~/dev/go/src/github.com/cncf/devstats/github_users.json.
  23. To generate final unknowns.csv manual research task file run: ./gen_aff_task.rb unknowns.txt. You can also generate all actors ./gen_aff_task.rb alldevs.txt. You can prepend with ONLY_GH=1 to skip entries without GitHub. You can prepend with ONLY_EMP=1 to skip entries with any affiliation already set.
  24. To manually edit all affiliations related files: edit cncf-config/email-map all.txt all.csv all_affs.csv github_users.json stripped.json ../developers_affiliations.txt ../company_developers.txt affiliations.csv
  25. To add all possible entries from github_users.json to cncf-config/email-map use :github_users_to_map.sh. This is optional.
  26. Finally copy github_users.json to github_users.old. You can check if JSON fileds are correct via ./check_json_fields.sh github_users.json, ./check_json_fields.sh stripped.json small.
  27. If any file displays error with 'Invalid UTF-8' encoding, scrub it using Ruby tool: ./scrub.rb filename.

Example command generated by cncf/devstats/get_repos:

  • ./all_repos_log.sh /root/devstats_repos/jenkins-x/* /root/devstats_repos/jenkinsci/* /root/devstats_repos/spinnaker/* /root/devstats_repos/tektoncd/* /root/devstats_repos/Azure/* /root/devstats_repos/BuoyantIO/* /root/devstats_repos/GoogleCloudPlatform/* /root/devstats_repos/OpenObservability/* /root/devstats_repos/RichiH/* /root/devstats_repos/Virtual-Kubelet/* /root/devstats_repos/alibaba/* /root/devstats_repos/apcera/* /root/devstats_repos/appc/* /root/devstats_repos/brigadecore/* /root/devstats_repos/buildpack/* /root/devstats_repos/cdfoundation/* /root/devstats_repos/cloudevents/* /root/devstats_repos/cncf/* /root/devstats_repos/containerd/* /root/devstats_repos/containernetworking/* /root/devstats_repos/coredns/* /root/devstats_repos/coreos/* /root/devstats_repos/cortexproject/* /root/devstats_repos/crosscloudci/* /root/devstats_repos/datawire/* /root/devstats_repos/docker/* /root/devstats_repos/dragonflyoss/* /root/devstats_repos/draios/* /root/devstats_repos/envoyproxy/* /root/devstats_repos/etcd-io/* /root/devstats_repos/falcosecurity/* /root/devstats_repos/fluent/* /root/devstats_repos/goharbor/* /root/devstats_repos/grpc/* /root/devstats_repos/helm/* /root/devstats_repos/istio/* /root/devstats_repos/jaegertracing/* /root/devstats_repos/knative/* /root/devstats_repos/kubeedge/* /root/devstats_repos/kubernetes/* /root/devstats_repos/kubernetes-client/* /root/devstats_repos/kubernetes-csi/* /root/devstats_repos/kubernetes-graveyard/* /root/devstats_repos/kubernetes-helm/* /root/devstats_repos/kubernetes-incubator/* /root/devstats_repos/kubernetes-incubator-retired/* /root/devstats_repos/kubernetes-retired/* /root/devstats_repos/kubernetes-security/* /root/devstats_repos/kubernetes-sig-testing/* /root/devstats_repos/kubernetes-sigs/* /root/devstats_repos/linkerd/* /root/devstats_repos/lyft/* /root/devstats_repos/miekg/* /root/devstats_repos/nats-io/* /root/devstats_repos/open-policy-agent/* /root/devstats_repos/opencontainers/* /root/devstats_repos/openeventing/* /root/devstats_repos/opentracing/* /root/devstats_repos/pingcap/* /root/devstats_repos/prometheus/* /root/devstats_repos/rkt/* /root/devstats_repos/rktproject/* /root/devstats_repos/rook/* /root/devstats_repos/spiffe/* /root/devstats_repos/telepresenceio/* /root/devstats_repos/theupdateframework/* /root/devstats_repos/tikv/* /root/devstats_repos/torvalds/* /root/devstats_repos/uber/* /root/devstats_repos/virtual-kubelet/* /root/devstats_repos/vitessio/* /root/devstats_repos/vmware/* /root/devstats_repos/weaveworks/* /root/devstats_repos/youtube/* /root/devstats_repos/zephyrproject-rtos/* /root/devstats_repos/iovisor/* /root/devstats_repos/mininet/* /root/devstats_repos/open-switch/* /root/devstats_repos/opencord/* /root/devstats_repos/opennetworkinglab/* /root/devstats_repos/opensecuritycontroller/* /root/devstats_repos/p4lang/* /root/devstats_repos/tungstenfabric/*.

To sync maintainers:

  1. Open CNCF projects maintainers list
  2. Save "Name", "Company", "GitHub name" columns to a new sheet and download it as "maintainers.csv".
  3. Add "name,company,login" CSV header.
  4. Example file
  5. Run [ONLYNEW=1] ./maintainers.sh script. Follow its instructions.

Add new project (cncf or non-cncf) to get affiliation for it.

Please follow the instructions from ADD_PROJECT.md.

Geodata and gender

To add geo data (country_id, tz) and gender data (sex, sex_prob), do the following:

  • Download allCountries.zip file from geonames server.
  • Create geonames database via: sudo -u postgres createdb geonames, sudo -u postgres psql -f geonames.sql. Table details in geonames.info
  • Unzip allCountries.zip and run PG_PASS=... ./geodata.sh allCountries.tsv - this will populate the DB.
  • Create indices on columns to speedup localization: sudo -u postgres psql -f geonames_idx.sql.
  • If this is a first geousers run create geousers_cache.json via cp empty.json geousers_cache.json.
  • To use cache it is best to have stripped.json from the previous run. See step 22.
  • Enchance github_users.json via PG_PASS=... ./geousers.sh github_users.json stripped.json geousers_cache.json 2000. It will add country_id and tz fields.
  • Go to store.genderize.io and get you API_KEY, basic subscription ($9) allows 100,000 monthly gender lookups.
  • If this is a first genderize run create genderize_cache.json via cp empty.json genderize_cache.json.
  • Enchance github_users.json via API_KEY=... ./genderize.sh github_users.json stripped.json genderize_cache.json 2000. It will add sex and sex_prob fields.
  • You can skip API_KEY=... but only 1000 gender lookups/day are allowed then.
  • Copy enhanced json to devstats: ./strip_json.sh github_users.json stripped.json; cp stripped.json ~/dev/go/src/devstats/github_users.json
  • Import new json on devstats using ./import_affs tool.

Manual affiliations

  • To import manual affiliations from a google sheet save this sheet as affiliations.csv and then use ./affiliations.sh script.
  • Prepend with UPDATE=1 to only import those marked as changed: column changes='x'.
  • Prepend with DBG=1 to enable verbose output.
  • After finishing import add a status line to affiliations_import.txt file and update the online spreadsheet.
  • After importing new data run ./src/burndown.sh (from the src's parent directory). Do this after processing all data mentioned here, not after just importing new CSV.
  • Import generated csv/burndown.csv data into https://docs.google.com/spreadsheets/d/1RxEbZNefBKkgo3sJ2UQz0OCA91LDOopacQjfFBRRqhQ/edit?usp=sharing.
  • To calculate CNCF/LF ratio use number of CNCF found from last commit - number of CNCF found from some previous commit diveded by the same ratio for all actors.