Ops for data scientists
https://www.gnu.org/home.en.html
https://dagshub.com/blog/effective-linux-bash-data-scientists/
https://dagshub.com/blog/setting-up-data-science-workspace-with-docker/
https://linuxjourney.com/lesson/touch-command
https://www.redhat.com/en/topics/linux
- What is Linux?
- Why do people use it?
- What is Bash?
- How to use the terminal?
- How to exit vim?!
- Why would you use Linux, Bash, and other system tools?
- What's the smart way to do it, based on our subjective experience?
- What common problems will you come across, and how to solve them?
- What's the mental framework for working with these tools, to gain understanding and learn more by playing?
The curriculum and some of the tips are aimed at data scientists who want an introduction to the topics of Linux & Bash. However, the data science orientation mainly comes into play in a few domain specific tips, and in the stated motivations to learn these things - if you're an aspiring web developer, there's no reason not to benefit from this guide as well!
GNU is an operating system that is free software that is, it respects users' freedom. The GNU operating system consists of GNU packages (programs specifically released by the GNU Project) as well as free software released by third parties. The development of GNU made it possible to use a computer without software that would trample your freedom.
We recommend installable versions of GNU (more precisely, GNU/Linux distributions) which are entirely free software. More about GNU below.
you can find simple answer here
- A family of open source operating systems.
- Developed by Linus Torvalds, who also invented Git to manage the source code for Linux.
- An operating system is a program that takes over a bit after your computer turns on.
- For the first few seconds after your computer switches on, the motherboard runs a small
- hard-coded operating system called the BIOS, but it quickly hands control over to some operating system kernel, which is installed on one of the hard drives, a USB stick or CD.
- From that point on, the kernel decides which programs to run when, and how to control physical devices (via drivers).
- An operating system is a bundle of programs that come packaged together. The kernel is the most important part, but it comes with more programs which help the users communicate with the kernel.
- e.g. File explorers are part of the OS, but not the kernel - they're just graphical interfaces which sit between the user and the kernel.
- Operating systems normally also handle file systems, user permissions, memory management, and many other things.
- The thing that unites all the different operating systems in the Linux family is they all use the same Linux kernel - other parts differ. More on that later in the section about distributions.
An operating system is, surprisingly, just a type of system. Systems are designed by humans, and better designs lead to better performance, stability, and flexibility. Linux is simply a better designed operating system. It's super flexible and stable - "blue screens of death" are exceedingly rare in production Linux servers, and their performance is very reliable. Which is why a vast majority of production systems run on Linux, and that's also why it's good for anyone working in tech to be Linux literate. That includes you, dear reader.
Being open source leads to high quality, as bugs have fewer dark places to hide in. Developers can peer under the covers to make sure their Linux applications will work well, rather than guessing and relying on questionable documentation from closed source operating system developers.
But with great power and flexibility comes a great ability to shoot yourself in the foot. Linux makes that easy as well.
- Mac and Unix are very similar to, but are not Linux technically. You will have a hard time telling the difference, unless you dive deep.
- Unix is older than Linux and extremely similar - In fact, Linux is an open source re-implementation of Unix (which was closed source, but very good). This is pretty much historic trivia, as Unix is rarely seen nowadays, but know that some people use the words Unix and Linux interchangeably.
- In general, there’s a name for operating systems that look and feel like Unix – POSIX compliant, or *nix. When you see these words, translate them as “follows the conventions of Linux, such as basic commands for file manipulation (ls, cd, mkdir) and "/" as the root of the file system etc.”
- GNU is a large set of free software which is the foundation for much of Linux – compilers, C libraries, programs to zip files, and many others. It's also the name of an independent POSIX operating system, with more hardcore ideology around free software than Linux.
- All of the above systems, as well as Linux itself, are examples of POSIX compliant or *nix systems.
- There are (too?) many flavours of “real Linux”, called distros or distributions. It can be a headache to differentiate them.
- A distribution is like a "company", which invents a new operating system. They wrap the Linux Kernel with a new bundle of peripheral programs - i.e. they may use a different mix of GUI programs, support different hardware by default, etc. They release new versions occasionally.
- Ubuntu. It’s the most user friendly, widely supported, and easy to install.
- Red Hat Enterprise Linux, or RHEL, is a different distro which is used sometimes in heavy duty production servers.
- Fedora is the desktop equivalent of RHEL - usually, developers aiming to run their applications on RHEL servers will use Fedora for their development computers, to avoid compatibility issues.
- Alpine is a super minimal distro which is used for many Docker images.
- When people think of Linux, they usually associate it with a scary terminal (plus attached Anonymous hacker with a hoodie).
- Don't Panic – it’s not so scary! Today, it’s really easy to install Linux on a computer, with a regular GUI wizard, if you pick a distro that cares about that sort of thing (for example, Ubuntu).
- We'll focus on terminals / shells in this lecture, since that is always available, and generally where "real work" is done. Production servers will rarely have GUIs. Don't let that discourage you - after you get used to it, using the shell can become much more convenient than GUIs!
The linux command-line offers a stable of powerful tools that can really aid in boosting productivity as well as gaining an understanding of the current state of your machine (i.e. disk-space, running processes, RAM, CPU-usage).
Working on a remote linux instance is often a great way of becoming familiar with the command-line as you are forced to use it and cannot fall back on Mac’s Finder to navigate the file-system.
sudo apt update
sudo apt install hollywood byobu
hollywood
hollywood -s 4
hollywood -q
For more information hollywood, Genact, Blessed-contrib
- Standalone pc or lab-top have linux os on it
- Using virtual machines(VM) install in your pc or lab-top
- Having dedicate server in data-centers
- Having VM in data-centers
- Having VM in IaaS provider like AWS, GCP, Azure
- Having container like docker and etc
you can download it here based on your os
In this course we want to work with ubuntu desktop and server
sudo apt update
sudo apt upgrade
sudo apt install <package>
sudo apt remove <package>
sudo apt install openssh-server
form your os terminal in case windows must installed git bash
ssh <username>@<ip address>
sudo nano /etc/default/ufw
IPV6=yes
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow OpenSSH
sudo ufw allow 22
sudo ufw show added
sudo ufw enable
sudo passwd root
make root user can login with ssh #
sudo nano /etc/ssh/sshd_config
find PermitRootLogin in text ctrl + w and change it to
PermitRootLogin yes
ctrl + x for save and yes
sudo systemctl restart sshd
sudo service sshd restart
sudo reboot
cd ~/.ssh
ls -la
ssh-keygen
cd ~/.ssh
ls -la
cat id_rsa.pub
copy the content and add it in what ever service (in this case git hub ) you are using
if you don`t want to add password every time you can copy your os public ssh key in to your linux
ssh-copy-id <user name>@<ip or host name>
First you need already installed visual studio code
then you need to add remote ssh extension on your vs code
once you connect it you have open vscode in your linux machine and you can do what ever you want
cd < target folder >
code .
sudo apt install git
~/.gitconfig
[user]
email = <your email in github>
name = <your name>
[core]
excludesFile = ~/.gitignore
~/.gitignore
node_modules
cd
mkdir gitHub
cd gitHub
git clone [email protected]:jafarijason/ops_for_data_scientists.git
cd ops_for_data_scientists
export USE_HOSTNAME=<your host name>
sudo echo $USE_HOSTNAME > /etc/hostname
sudo hostname -F /etc/hostname
sudo apt-get update
sudo apt-get upgrade -y
Zsh and oh-my-zsh #
sudo apt install zsh
sh -c "$(curl -fsSL https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"
~/.zshrc
export ZSH="$HOME/.oh-my-zsh"
ZSH_THEME="fino-time"
plugins=(
git
docker
docker-compose
rsync
aws
cp
dash
pep8
pip
pipenv
postgres
python
sudo
tmux
ubuntu
ufw
aws
)
source $ZSH/oh-my-zsh.sh
add zsh at the end of ~/.bashrc
sudo visudo
YOUR_USERNAME_HERE ALL=(ALL) NOPASSWD: ALL
sudo apt -y install xrdp tigervnc-standalone-server
sudo systemctl enable xrdp
sudo systemctl start xrdp
sudo ufw allow 3389
sudo ufw reload
man echo
echo "Hello World! "
VAR="Ops for data scientists"
echo $VAR
VAR="Test for re define variable"
echo $VAR
A=2
B=3
C=$A+$B
echo $C
C=`expr $A + $B`
echo $C
C=$(expr $A + $B)
echo $C
C=$(($A + $B))
echo $C
example for bash file
cat ./bash/sum.sh
bash ./bash/sum.sh 10 11
sh ./bash/sum.sh 10 11
chmod +x ./bash/sum.sh
./bash/sum.sh 10 11
example2 for bash file
cat ./bash/sum2.sh
bash ./bash/sum2.sh
sh ./bash/sum2.sh
chmod +x ./bash/sum2.sh
./bash/sum2.sh
man pwd
pwd
cd .
pwd
cd ..
pwd
cd ...
pwd
cd ~
pwd
cd -
man whoami
whoami
whatis ls
whatis cat
whatis bash
man ls
sudo adduser test1
sudo usermod -G sudo test1
su - test1
exit
sudo deluser test1 --remove-home
touch <file name>
cat <file name>
cat ./data/geolocation.csv | more
more ./data/geolocation.csv
cat ./data/geolocation.csv | less
less ./data/geolocation.csv
cat ./data/geolocation.csv | head
head ./data/geolocation.csv
cat ./data/geolocation.csv | tail
tail ./data/geolocation.csv
history
clear
mkdir books paintings
mkdir -p books/hemmingway/favorites
cp ./imgs/distro-family-tree.png /tmp
mkdir -p /tmp/test
cp ./imgs/*.png /tmp/test
cp -r imgs /tmp/test/imgs-copy
cp -R imgs /tmp/test/imgs-copy
mv /tmp/distro-family-tree.png /tmp/distro-family-tree2.png
rm file1
rm -f file1
rm -i file
rm -r directory
rm -rf /tmp/test
ls
ls -la
ls -hlS
ls -lha
uname
uname --help
uname -a
lsb_release
lsb_release --help
watch ls -la
watch ls -la -3
ps
ps -a
top
htop
# sudo snap install btop
btop
mc
kill -9 PID
df
df -h
du
du -d 2 -h
du -d 2 -h /
du -d 2 -h /root
#scp stands for secure copy and is a useful command we can use to send files to and from a remote instance.
#Send to remote:
scp -i ~/.ssh/path/to_pem_file /path/to/local/file ubuntu@IPv4:./desired/file/location
#recursively copy directories
scp -i ~/.ssh/path/to_pem_file -r ubuntu@IPv4:./path/to/desired/folder/ ~/my_projects/
#Download from remote:
scp -i ~/.ssh/path/to_pem_file ubuntu@IPv4:./path/to/desired/file/ ~/my_projects/
nc -zv 10.0.250.2 22-500
nc -zv 127.0.0.1 20-100
crontab #
crontab -e
cat /etc/crontab
#* * * * * /bin/bash /root/gitHub/ops_for_data_scientists/bash/crontab.sh
sudo apt install at
at 1pm + 2 days
atq
at rm <id>
cat /etc/passwd | less
cat /etc/shadow | less
sudo apt install net-tools
ifconfig
ip add show
ip a s
ss
ss -l4
sudo ss -tulpn
sudo lsof -i -P -n
sudo lsof -i -P -n | grep LISTEN
less /etc/services
cat /etc/services
grep '22/tcp' /etc/services
# https://www.cyberciti.biz/faq/how-to-check-open-ports-in-linux-using-the-cli/
sudo netstat
sudo netstat -tulpn
sudo netstat -tulpn | grep LISTEN
wall
find $(pwd) -name geolocation.csv
find $(pwd) -type d -name data
tree
tree /
tree ~
tree $(pwd)
tmux
sudo apt install lynx
lynx
cd /tmp
wget https://github.com/jafarijason/ops_for_data_scientists/raw/master/imgs/vscode-ssh3.png
curl ifconfig.me
cd /tmp
fallocate -l 1G test.img
pv test.img > test3.img
# in past
cat /etc/resolv.conf
sudo nano /etc/systemd/resolved.conf
DNS=<dns server 4.2.2.4>
sudo systemctl restart systemd-resolved.service
sudo systemctl status systemd-resolved.service
systemd-resolve --status
grep is a command-line tool that searches for patterns within a file. grep will print each line within the file that has a pattern match to standard output (terminal screen). This can be especially useful when we maybe want to model or perform EDA on a subset of our data with a given pattern:
grep -n 'California' ./data/geolocation.csv > ./test/new_example_data1.csv
cat ./pythons/test1.py
python3 ./pythons/test1.py
man grep
cat /etc/passwd
cat /etc/passwd | grep 'root'
grep 'root' /etc/passwd
cat ./data/geolocation.csv | grep 'Cal' | more
grep 'Cal' < ./data/geolocation.csv
cat ./data/geolocation.csv | grep 'Cal' | grep 'Roseville'
cat ./data/geolocation.csv | grep 'Cal' | grep 'Roseville' | grep '29'
ls -la | grep 'data'
tree /
tree / | grep 'hollywood.png'
cat /var/log/syslog
cat /var/log/syslog | grep 'root'
cat ./data/mm
cat ./data/mm | grep -v -e '^$'
tail -f /var/log/auth.log | grep 'su'
tail -f /var/log/auth.log | grep 'su' &
grep -e j ./data/grepTest
grep -f ./data/grepTest /etc/passwd
grep -i -f ./data/grepTest /etc/passwd
grep -i -v -f ./data/grepTest /etc/passwd
cat /etc/ssh/ssh_config
cat /etc/ssh/ssh_config | grep -v '#'
cat /etc/ssh/ssh_config | grep -v ^#
cat /etc/ssh/ssh_config | grep -v ^# | grep -v ^$
grep 'root' /etc/passwd
grep `whoami` /etc/passwd
grep -c -w `whoami` /etc/*.*
grep -s -c -w `whoami` /etc/*
grep -l -s -w `whoami` /etc/*
grep -L -s -w `whoami` /etc/*
grep -H -w root /etc/passwd
grep -H -w root /etc/passwd | cut -f 1 -d :
grep -H -w root /etc/passwd | cut -f 2 -d :
grep -H -w root /etc/passwd | cut -f 8 -d :
grep -T -H -w root /etc/passwd
grep -T -n -H -w root /etc/passwd
grep -T -A 3 -B 3 -n -H -w root /etc/passwd
wc ./data/geolocation.csv
uniq -u ./data/geolocation.csv
cut -d"," -f2 ./data/geolocation.csv | uniq -u
head -n 5 ./data/geolocation.csv
head -n -5 ./data/geolocation.csv
tail -n 15 ./data/geolocation.csv
tail -n -15 ./data/geolocation.csv
column -s"," -t ./data/geolocation.csv
column -s"," -t ./data/geolocation.csv | head
column -s"," -t ./data/geolocation.csv | tail
head -n 5 ./data/geolocation.csv | column -s"," -t
cut -d"," -f2,5 ./data/geolocation.csv | head
tail -n +1 ./data/geolocation.csv | sort -t"," -k1,1g -k2,2gr -k2,2
cut -d"," -f2,5 ./data/geolocation.csv > ./test/new_example_data1.csv
grep -n 'Cal' ./data/geolocation.csv
grep -n 'Cal' ./data/geolocation.csv > ./test/new_example_data1.csv
shuf -n 4 ./data/geolocation.csv
tail -n +1 ./test/new_example_data1.csv | shuf -n 4
sudo apt install athena-jot
jot 10
jot -r 5 1 100
jot 10 555
# Printing Column or Field
awk '{print $3 "\t" $4}' ./data/marks.txt
awk '{print $0}' ./data/marks.txt
# Printing All Lines
awk '/a/' ./data/marks.txt
# Printing Columns by Pattern
awk '/a/' {print $3 "\t" $4}' ./data/marks.txt
# Printing Column in Any Order
awk '/a/' {print $4 "\t" $3}' ./data/marks.txt
# Counting and Printing Matched Pattern
awk '/a/{++cnt} END {print "Count = ", cnt}' ./data/marks.txt
sed 's/unix/linux/' ./data/sed_test.txt
echo "Welcome To The Geek Stuff" | sed 's/\(\b[A-Z]\)/\(\1\)/g'
sudo apt-get install \
ca-certificates \
curl \
gnupg \
lsb-release
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io
sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
sudo ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose
docker-compose --version