Sam's Meeting Notes

#USI 2 Meeting Notes

3/31

What are the main streets? What are the similarities between main streets in different boroughs?

We have information to zip code level (census and acs)

PLUTO has buildings information

Reference USA has info on 500,000 businesses in nyc from the past 5 years

Data on CUSP server

First steps:

Look at last years work. Report and code.
Compile list of datasets we have (size and formats)
Quick viz of incoming datasets

Need to get access to shell

compute.cusp.nyu.edu

For next time:

look at business atlas
look at last years stuff (https://github.com/gdobler/nycep/commits?author=ak4706)

##4/2 Discussed accessing data on shell

##4/7 For zbp -> just select small business for analysis

Maybe big businesses are doing fine but small businesses suffer or vise versa.

This is a precursor to reference usa

Are there areas of the city that trend similarly? need clustering

can also try to subdivide retail vs non retail

how did hurricane sandy effect small businesses and large in flood zones? were small businesses forced out?

for each time series: plot (z_it - zbar_i)/sde_i

clusterpackage scikitlearn: http://scikit-learn.org/stable/modules/clustering.html

do research to see why small businesses succeed or fail

what is a small business? how do small business owners define it

##4/9 Got reference usa data

security issues:

no copying data period
when using github don’t hardcode paths

will have to make symbolic links from my directory to the secure data. structure like ln -s path data

begin time series analysis

##4/14 Sick

##4/16 use pandas to easily view rows of data

import pandas as pd

df = pd.read_csv(“Hist_Bus_2014.csv”)

df.head() #gives top 5 rows

df.head(x) # gives top 10 rows

list(df.columns) #names the columns

df.dtypes #types of data for each column

df.describe() #like summary in R

df1 = df[df[“INDFRM”] == 1] # creates new data frame where INDFRM equals 1

df2 = df[df[“INDFRM”] == 2] # df1 is the data for small businesses (we think)

df[“INDFRM”].value_counts() # tells you how many of each value there are

df[“EMPSDT”].describe(percentiles = [.05,.25,.5,.75,.95]) #gives the value for each percentile input

pandas resource http://pandas.pydata.org/pandas-docs/version/0.15.2/10min.html

Kevin Sheppard http://www.kevinsheppard.com/images/0/09/Python_introduction.pdf

for github (to make your changes shared):

navigate to nycem folder

git add .

git commit -m “message”

git push

to get other users’ changes:

git pull

##4/23 Long term goal is to understand small business (employees < 50) in low income areas

Looking at REFUSA we see that offices of physicians and lawyers are the top two. We are unsure that we want to include these as small businesses.

Next steps:

Columns - Counts of business type, aggregate revenues, employee counts
Compare zbp data and REFUSA data. Are they consistent with each other? If we group refusa data by zip code we can try to compare its values with zbp. This could potentially tell us how representative of the true population of businesses refusa data is. Ex: compare number of businesses in each zip code in refusa with number of businesses in each zip code of zbp

for sales per employee: what if we eliminate companies with over 50 employees first and then do the analysis for each business type

where are these businesses? manhattan seems to skew the data

how do we account for franchises?

to use plotting in shell (functionality doesn’t seem to work right now. need to email Foued):

ssh -X [shell login]
ssh -X compute

to create a python file in home directory in shell have to use vim editor

vi filename (opens file in vim editor)

escape :wq enter (leaves vim editor and saves)

list of vim commands

for presentation:

talk about what makes a small business. is it employee counts? revenue? what indicators are important?
Where is the project heading?
What have we done so far?

##4/28 Try:

pluto data has column “ComArea”, “UnitsRes”, etc. See if there are spaces where refusa does not have data that pluto says should have data.
also compare with zbp to see if they match as well. Can also check the distribution of naics codes

Greg and Tim meeting last week:

people had different interests
they are interested in viz tools and clustering (I will start looking into creating maps in d3)
we should know by the end of the week what they want to know from these maps

Going forward:

Focus on mapping.
Tie together all our datasets (zbp, refusa, census, maybe pluto?)
develop a clustering function that takes in two datasets: ** a spatial dataset (zip codes or census tracts) to cluster on ** refusa data that gives information about these spatial areas
For arcmap clusters plot timeseries for each cluster
Need to email Awais (for market value) and Rosemary (for refusa info: why so many physicians)

##4/30 Next steps:

Compare trends from zbp to trends in refusa.
Each of us needs to look at 5 industries to analyze. Look at sales per employee and other measures. Look at how they change by location and other ideas you may have
Bring in census data to look at businesses in areas with low income households (median income < 55k)
Need to redo clusters so normalization is done on each zip code instead of each year. Once this is done we can plot these clusters in a time series.

Notes from presentation:

Many students seemed to think employee count was all you needed to classify a business as small. I disagree with this
Some students believe that businesses like dentists and physicians can still be considered small. It is possible that we should include some of these companies as small business. This means we need to determine what qualifies as a small business for each industry. Some combination of employees and sales volume should be considered.

##5/5 In zip code business patterns totals there is payroll info. need to compare to refusa

Next steps:

See which kinds of businesses have most 0’s for sales volume.
Compare total employees and sales volume counts in refusa and zbp aggregated at zip code level
at neighborhood level: number of businesses in buckets of employee count, number of businesses older than 5 years (businesses that appear in every year of refusa), number of businesses by naics code, number of businesses by sales revenue, business by owner info sex/race (may be impossible), number of businesses that own or lease the space, number of home based businesses (also hard), number of lmi (low median income) businesses (low income < 55k for household)
in master dataset need to aggregate values to different spatial granularity (block, block group, tract, zip code)
After creating master dataset compare small businesses in low income areas vs non low income areas

Could try to find women/minority owned businesses (Julia knows how to get this data)

paper due thursday:

use blog posts
compare employee counts in refusa and zbp by bins. also plot difference vs business size.

##5/12 Discussed our newest map comparing Refusa and zbp employee counts:

zbp is a survey that is then projected. zbp could get underreporting from low income areas
refusa could be more accurate.
there are weird pockets in the map that refusa has much higher counts. the areas appear to be low income areas.
it is possible that their employee counts are considering different values. maybe one of them counts part time employees and the other doesn’t.
we should look at establishment count as well. there shouldn’t be any confusion about what an establishment is.

Thursday would be a good day for d3 demo

department of finance has all building transactions. we could try to tie in this data or try to learn about trends near small businesses

data: http://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page

##5/27 Going forward we will be meeting on Wednesday’s at 9AM

We had issues with the server today. ipython notebook didn’t work.

We need to determine which metrics to include in our database. The list suggested by Citi can be a starting point: https://github.com/gdobler/nycem/wiki/05-05-15-Citi-Suggested-Metrics

##5/28 Ben’s trash viz

import cartodb into html. need to research how to use it as well as leaflet within js files. could do everything in d3 if we choose to. also look at mapbox, it could be better if our data is too large for cartodb. I would like to look around before we decide how to do it.

Ben is sharing his source code with us

Chris Whong has done some good work we can try to learn from: http://chriswhong.com/ https://github.com/chriswhong

##6/10

Data Tasks: (TJ & JMS)

First map out difference between file types (rh vs. sa vs. f vs. fa) then pull NY QWI files for Q4 2014 from here.

Using the following columns, create a unified NYC data set (isolating NYC census tracts) which includes the following columns and status flags:

periodicity C 1 Periodicity of report
seasonadj C 1 Seasonal Adjustment Indicator
geo_level C 1 Group: Geographic level of aggregation
geography C 8 Group: Geography code
ind_level C 1 Group: Industry level of aggregation
industry C 5 Group: Industry code
ownercode C 3 Group: Ownership group code
sex C 1 Group: Gender code
agegrp C 3 Group: Age group code (WIA)
race C 2 Group: race
ethnicity C 2 Group: ethnicity
firmage C 1 Group: Firm Age group
firmsize C 1 Group: Firm Size group
year N 3 Time: Year
quarter N 3 Time: Quarter
HirA N 8 Hires All: Counts
HirN N 8 Hires New: Counts
FrmJbGnS N 8 Firm Gain stable jobs: Counts
FrmJbLsS N 8 Firm Loss stable jobs: Counts
FrmJbCS N 8 Firm stable jobs change: Net Change
EarnS N 8 Employees stable jobs: Average monthly earnings
EarnHirAS N 8 Hires All stable jobs: Average monthly earnings
EarnHirNS N 8 Hires New stable jobs: Average monthly earnings
EarnSepS N 8 Separations stable jobs: Average monthly earnings
Payroll N 8 Total quarterly payroll: Sum

Front-End Tasks: (KRL & SP)

Meet over the week to discuss MapBox documentation and any work samples of comparable d3.js work.

Metric Tasks: (JMS)

Review academic research from TS and pull useful success/failure metrics for consideration next week.

JMS to devise a project schedule for now through July 24th.

##6/17 Just use census tract polygons. It is what Citi requested and is a much smaller dataset.

What is our moneyplot/punchline? How can we measure success, especially in LMI areas? For a given region and time, what is the probability that a specific type of business survives for three more years?

From QWI we can get tenure information. From there we can see the distribution in each census tract. From there we can compare LMI census tracts to other census tracts to see if there is any difference. Possibly do clustering analysis as well.

For visualization we can have the user select the attribute they want to visualize. Then the visualization shows the years available for that attribute and the user can select the years they want to look at. A hover tool can show the time series for an individual census tract.

Once we have our final dataset we may do some statistical modeling to determine which attributes contribute most to successful business in an area. We can also narrow this analysis to LMI businesses.

##6/24 Need to incorporate LODES data into our database. May need to integrate for census tract. Need to see what columns are available. Can incorporate it with kenny’s data to see if it matches refusa and census data.

Kenny needs to create a data dictionary.

We should have a map of which census tracts are low median income.

Include summary of previous work until now including who did what. Include schedule for going forward.

##7/1 Do we want the option to select several years for out map? we should also have ALL options so you can see stats for entire industry and not need to be as specific

To copy a directory from shell to local desktop: scp -r [email protected]:/path/to/foo /home/user/Desktop/ recursively adds all files from that directory and puts it in Desktop

what other attributes do we want to add from acs? race -> percent minority. age -> percent between 25-35. household value. population.

aggregated files. one with just a total aggregation. Others can be added as well that have filtered out certain businesses (ex: only businesses with <100 employees).

Lodes data: has data for each census block and naics code.

revisit comparing refusa and zbp. need to see how similar they are within naics codes.

##7/15 mapping ideas:

add ‘no industry selected’ to sidebar when no census tracts
add styling to show when a button is clicked. right now you cannot see this.
add styling to clean up dropdown menus
add attributes
change polygon color to semi transparent red (or other brighter color)
add clustering to popup. make an option to click to see other similar census tracts
show LMI button. it could add color to all census tracts that qualify as LMI.
need new data set that says what cluster each census is and if it is LMI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sam's Meeting Notes

3/31

Clone this wiki locally