You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
to attach new volume in Virtual Box at run time without restarting
we use virtIO in storage setting
At AWS , we can attach new volumes without rebooting or stopping the system
$ partprobe -> to resync all HDD in
mount -a -> it runs the fstab file again and mounts the HDD if not mounted
to permit root login on AWS , open /etc/ssh/sshd_config file and set permitrootlogin to yes
lsblk --output=uuid device-name it gives uuid of specific HDD
SSH
ssh ip-address # it logins on other system with same user you are running this command as
ssh stands for Secure Shell
ssh -X root@ip -> used to give graphical interface to virtual box
LVM
LVM stands for Logical Volume Manager
-l extents - > it is used to provide size in terms of extents
lvcreate --name tech -l 30 vg-name
to display the content of lvm in short use lvs
to give a physical extent size to vg group usw -s
vgcreate -s 16M vg-name hdd-name1 hdd-name2
Data Engineering
A classifier requires an accurate data to use its processing algorithm
pre-processing of data before applying ML is know as Data Engineering
tasks in Data Engineering - >
Clean
Recycle
Auto Fill
pandas is similar to SQL and basically creates its own structure called DataFrame
Imputer - > replacing missing numerical value with relevant data is done with the help of imputer
Data Processing have a branch called Dummy variable
Dummy cariable works in a way that it encodes the string into a array format like [1,0,0] where 1 the value in row 1 is flagged as 1 and other as 0 and length of array is equal to number of different values
To calculate distance KNN uses distance formula that is
root((x1-x2) + (y1-y2) + (z1-z2))
Feature Scaling - > it is the method of data where we convert features in the range of each other
e.g., 1 feature has values (27,38,59) and other one has values (10000,239999,38888) so bring both features in similar range this method is applied
Imputing is a part of Data Mining & Engineering
Todays python jupyter code
importpandasaspd#Reading csv file from URLdf=pd.read_csv('http://13.234.66.67/summer19/datasets/info.csv')
df.info()
df#seperating out data or columnsx=df.iloc[:,0:].values#values is used to give only values not headers# To remove missing values or replacing missing values with some relevant datadf.describe() # it describes the numerical columnsfromsklearn.preprocessingimportImputer# to check data to be none in column/column oriented axis=0# strategy is used to replace the missing valueimp=Imputer(missing_values='NaN',axis=0,strategy='mean')
# fitting columns that we want to processimpute=imp.fit(x[:,1:3]) # needs only 2d array# fit is used to make a schema# now for transforming the fitted columnsx[:,1:3] =impute.transform(x[:,1:3])
x# printing the value of x, missing values are replaced by strategy# to label any string with some int or float valuefromsklearn.preprocessingimportLabelEncodercont=LabelEncoder() # object made for country labelling#Now applying label in column 1, that is it will replace string with integerx[:,0] =cont.fit_transform(x[:,0])
#Now replacing label in column last, that is it will replace yes/no with 0 or 1x[:,-1] =cont.fit_transform[:,-1]
# Now encoding first column i.e., making subcolumn of column 1fromsklearn.preprocessingimportOneHotEncoderfirstcl=OneHotEncoder(categorical_features=[0]) # Defining exact column number where we want to make category# fit_transofrm convers x into sparse matrix and toarray() convers it into ndarrayx=firstcl.fit_transform(x).toarray()
x.astype(int) # converts the array in proper integer type
# Diabetic Data#from sklearn.datasets import load_diabetesimportpandasaspddata=pd.read_csv('http://13.234.66.67/summer19/datasets/diabetest.csv')
# now printing schema of datadata.info()
# Describing the datadata.describe()
# printing original data top 5 columnsdata.head(5)
# plot a particular column with count using seabornimportseabornassbsb.countplot(data['Pregnancies'])
sb.countplot(df['Glucose'])
data.hist(figsize=(15,20))
sb.pairplot(data)
# Extract Attribute from Data Framefeatures=data.iloc[:,0:8].values# Extract Label from Data Framelabel=data.iloc[:,8].valueslabellabel.shape# Seperating Training and Testing datafromsklearn.model_selectionimporttrain_test_splittrainFeature,testFeature,trainLabel,testLabel=train_test_split(features,label,test_size=0.2)
fromsklearn.treeimportDecisionTreeClassifierclf=DecisionTreeClassifier()
trained=clf.fit(trainFeature,trainLabel)
predict=trained.predict(testFeature)
fromsklearn.metricsimportaccuracy_scoreaccuracy_score(testLabel,predict)
# Doing the same with KNNfromsklearn.neighborsimportKNeighborsClassifierkclf=KNeighborsClassifier(n_neighbors=5) # This is by default value of K# now Training Dataktrained=kclf.fit(trainFeature,trainLabel)
predict=ktrained.predict(testFeature)
accuracy_score(testLabel,predict)