Python Programming and Data Analysis
- Table of contents
- 0. Common Syntax
- 1. Basics
- 2. Functions
- 3. Lambda Expressions
- 4. Module
- 5. Class
- 6. Regular Expression
- 7. Numpy
- 8. Pandas
- 9. Matplotlib
- 4 Bascis Data Types: String, Integer, Float and Boolean
- Logical Variable:
not
,and
,or
Operator | Name | Description |
---|---|---|
a / b |
True division | Quotient of a and b |
a // b |
Floor division | Quotient of a and b , removing fractional parts |
a % b |
Modulus | Integer remainder after division of a by b |
a ** b |
Exponentiation | a raised to the power of b |
- Membership Operators:
in
andnot in
- Identify Operators:
is
andis not
to identify if 2 variables are same class
x =5
type(x) is int #True
- iterable: types of iterables
- list/tuple/str/dict
- zip/enumerate/range/reversed
- iterator: An iterable can be passed to the built-in function
iter()
, which returns some object called iterator
it = iter([4, 3, 2, 1])
print(next(it))#4
print(next(it))#3
zip()
: to zip 2 lists togetherenumerate()
: to return both item & index corresponding to that item in the list
>>> integers = [1, 2, 3]
>>> letters = ['a', 'b', 'c']
>>> floats = [4.0, 5.0, 6.0]
>>> zipped = zip(integers, letters, floats) # Three input iterables
>>> list(zipped)
[(1, 'a', 4.0), (2, 'b', 5.0), (3, 'c', 6.0)]
l1 = ['h', 'e', 'l', 'l', 'o']
for idx, item in enumerate(l1):
print(idx,item)
0 h
1 e
2 l
3 l
4 o
-
2.1.1. Positional Arguments
-
2.1.2. Keyword Arguments
-
2.1.3. Default Arguments
-
2.1.4. Variable-Length Arguments
*args
(Non-Keyword Arguments): extra arguments can be tacked on to your current formal parameters (including zero extra arguments)**kwargs
(Keyword Arguments) : dictionary that maps each keyword to the value that we pass alongside it
Example of
*args
def info(name, *args): hobby = [] for a in args: hobby.append(a) print(name +"'s hobbies: " + ', '.join(hobby)) info('Mike') #Mike's hobbies: info('Mike', 'hiking', 'reading') #Mike's hobbies: hiking, reading
Example of
**kwargs
def info(name, **kwargs): hobby = [] for k, v in kwargs.items(): hobby.append(k+'-'+v) print(name +"'s hobbies: " + ', '.join(hobby)) info('Mike', first='hiking', second='reading') #Mike's hobbies: first-hiking, second-reading
-
There are 2 types of Variable:
Local
andGlobal
scopeglobal
variable
y = 'global' def test(): global y #This to declare y is global scope print(y) test() #will print 'global'
- Syntax:
lambda argument_list: expression
- argument_list (same as argument list in functions):
x,y, *arg, **kwargs
- expression (Output) must be single line
- argument_list (same as argument list in functions):
Example of lambda
lambda x, y: x*y #input: x, y; output: x*y
lambda *args: sum(args). #input: any number of parameters; output: their summation
lambda x: 1 #input: x; output: 1
- Syntax:
sorted(iterable, key=None, reverse=False)
sorts the elements in the given iterable by key
sorted([1, 2, 3, 4, 5], key = lambda x: abs(3 - x)) #[3, 2, 4, 1, 5]
- Filter syntax:
filter(function, iterable)
filters the given iterable (list) based on the given function - Map syntax :
map(function, iterable)
applies a given function to each item of the given iterable - Note: Both Filter and Map will return Iterable Object, so need to use
list()
function to convert to a lsit
Example of filter
and map
list(filter(lambda n: n % 2 == 1, [1, 2, 3, 4, 5])) #[1, 3, 5]
list(map(lambda x: x + 1, [1, 2, 3])) #[2, 3, 4]
import random
random.seed(42) #make results reproducible,
random.random() #return random number between [0.0 and 1.0)
>>> 0.35553263284394376
random.uniform(0, 10) #return a random floating point number N from uniform distribution such that a <= N <= b, where a=lower_end, b=higher_end
>>> 3.58
random.randint(0, 10) #generate a random integer between two endpoints in Python
>>> 7
random.gauss(mu, sigma) #return a random floating point number with gaussian distribution.
items = ['one', 'two', 'three', 'four', 'five']
random.choice(items) #choosing multiple elements from a sequence with replacement (duplicates are possible):
>>> 'four'
random.choices(items, k=2)
>>> ['three', 'three']
random.choices(items, k=3)
>>> ['three', 'five', 'four']
random.shuffle(items) #randomize a sequence in-place
>>> ['four', 'three', 'two', 'one', 'five']
- Aliasing: many variables (a,b) refer to the same object list
[1,2]
==
: compares the values of the objectis
: compares objects
a = [1,2]
b = [1,2]
print(id(a)) #2661200625736
print(id(b)) #2661202091528
a == b #True
a is b #False
- In Python, interpreters will typically cache small integers in the range of -5 to 256.
- When the Python interpreter is launched, these integer objects will be created and available for later use in the memory.
import copy
a = [[0, 1], 2, 3]
b = copy.copy(a)
c = copy.deepcopy(a)
- Shallow Copy will only create a new object for the parent layer.
- It will NOT create a new object for any of the child layer.
- Deep Copy will create new objects for the parent & child layers.
- Immutable (when values are changed, a new object will be created): integers, strings, and tuples
- Mutable (values can be changed after creation): lists, dictionaries, and sets
- Class is a "blue-print" for creating Object
- For example: Cars may not be exactly same, but the structures are same.
- Class attribute:
Student.num_of_stu
is an attribute for the whole class, cannot use self.num_of_stu - Init method:
__init__
& using self as the first argument - Class Method: at least one argument – self and can be include other method argument like
birth_year
class Student:
#Class attribute
num_of_stu = 0
#Special init method
def __init__(self, first, last): #use self as the first argument
self.first = first
self.last = last
self.email = first + '.' + last + '@smu.edu.sg'
Student.num_of_stu += 1 #attribute for the whole class, cannot use self.num_of_stu
def full_name(self, birth_year): #Method, we have at least one argument – self & birth_year
return self.first + ' ' + self.last + ' was born in ' + birth_year
print(Student.num_of_stu) #0
stu_1 = Student('Ryan','Tan')
stu_1.full_name('1995') # "Ryan Tan was born in 1995"
print(Student.num_of_stu) #1
- For example, Create
Representative
class based on theStudent
class super()
: to inherite all the attributes in parent class & Initiate more information than parent class- Override: to override the method of parent class
class Rep(Student):
def __init__(self, first, last, cat):
#parent class Student handles existing arguments
super().__init__(first, last)
#new information
self.cat = cat
def full_name(self): #override the full_name method of parent
return self.cat + ' representative: ' + super().full_name()
-
Magic methods in Python are the special methods that start and end with the double underscores __
-
Built-in classes in Python define many magic methods. Use the
dir()
function to see the number of magic methods inherited by a class.>>> dir(int) ['__abs__', '__add__', '__and__', '__bool__', '__ceil__', '__class__', '__delattr__', ...]
-
Magic methods are most frequently used to define behaviors of predefined operators in Python
-
For example:
__str__()
method is executed when we want toprint
an object in a printable format. We can override the functionality of the__str__()
method. As an instance:class Human: def __init__(self, id, name, addresses=[], maps={}): self.id = id self.name = name self.addresses = addresses self.maps = maps def __str__(self): return f'Id {self.id}: {self.name}' human = Human(1, 'Quan Nguyen', ['Address1', 'Address1'], {'London':2, 'UK':3}) print(human) #Id 1: Quan Nguyen
-
- Character class:
[]
specify a set of characters to match - Metacharacters:
\w
[a-zA-Z0-9_],\W
[^a-za-z0-9_],\d
,\D
,\s
(white-space),\S
(non white-space),.
match anything except \n \
to remove special meaning of the metacharacter. For example: [.] means match "." dot in the text, not mean match anything- Anchors:
^
,$
,\b
to get grid of \n at beginning & end of text:^
beginning of text line,$
end of text line: usere.M
to match the beginning ^ /end $ pattern in multiple lines\b
word boundary match until last word character[a-zA-Z0-9_]
- Quantifiers:
*
zero or more ,?
zero or one,+
one or more,{m}
m repetitions,{m, n}
any number of repetitions from m to n, inclusive: to repeating literal/metacharacter/group/backreference - Group: to keep certain part out of the entire match, or match a repeat with backref
- Backreference: Numbered groups:
\1
,\2
,\3
numbering: from out to in, from left to right - Look ahead & Look behind
- Regex: is a tiny programming language used for data manipulation
- re module: is a Python module containing re engine and providing the regular expression functionality
re.compile()
function compiles a pattern so that the re engine can perform the search.
pat = re.compile(r'abc')
print(pat)
print(type(pat))
re.compile('abc')
<class 're.Pattern'>
match()
: match the pattern from the beginning.
mat_abc1 = pat.match('ABC,ABc,AbC,abc')
mat_abc2 = pat.match('abc,ABc,AbC,abc')
print(mat_abc1) #None because pattern 'abc' not appear at the beginning
print(mat_abc2) #<re.Match object; span=(0, 3), match='abc'>
search()
: match the pattern in any position in the text and returns the match inre.Match
class.- BUT it only returns the first match
sear_abc1 = pat.search('ABC,ABc,AbC,abc')
sear_abc2 = pat.search('abc,ABc,AbC,abc')
print(sear_abc1) #<re.Match object; span=(12, 15), match='abc'>
print(sear_abc2) #<re.Match object; span=(0, 3), match='abc'>
print(type(sear_abc1))#<class 're.Match'>
findall()
method: finds all the matched strings and return them in a list.
find_abc1 = pat.findall('ABC,ABc,AbC,abc')
find_abc2 = pat.findall('abc,ABc,AbC,abc')
print(find_abc1) #['abc']
print(find_abc2) #['abc', 'abc']
- The
findall()
method returns all the matched strings in a list. finditer()
: returns an iterator that lazily splits matches one at a time.
finditer_abc = pat.finditer('abc,ABc,AbC,abc')
print(finditer_abc) #<callable_iterator object at 0x7ff650853040>
for m in finditer_abc:
#<re.Match object; span=(12, 15), match='abc'>
#<re.Match object; span=(0, 3), match='abc'>
print(m)
The metacharacters can be categorized into several types as below:
-
. ^ $ * + ? { } [ ] \ | ( )
-
"["
and"]"
-
Type 1
. [] - ^ \d \D \w \W \s \S
: Metacharacters that match a single character:-
.
Dot: match any single character except the newline \n characterp = re.compile(r'.at') m = p.findall('cat bat\n sat cap') #['cat', 'bat', 'sat']
-
[]
character class: specify a set of characters to match- Metacharacters lose their special meaning inside character class.
p = re.compile(r'[abcABC]') m = p.findall('abcABC') #['a', 'b', 'c', 'A', 'B', 'C']
-
-
hyphen: specify a range of characters to match- If you want to match a literal hyphen, put it in the beginning or the end inside [], for ex:
[-a-e]
or[a-e-]
p = re.compile(r'[a-z0-9]') m = p.findall('d0A3z6P') #['d', '0', '3', 'z', '6'] p = re.compile(r'[-a-e]') # or [a-e-] if you want to match a hyphen - m = p.findall('e-a-s-y, easy') #['e', '-', 'a', '-', '-', 'e', 'a']
- If you want to match a literal hyphen, put it in the beginning or the end inside [], for ex:
-
^
caret: match any character NOT in the character class- A caret ^ not at the beginning of a character class, it works as a normal character
- A caret outside a character class has a different meaning.
p = re.compile(r'[^0-9a-z]') #Pattern exclude 0-9 and lowecase of a to z m = p.findall('1 2 3 Go') #Result: [' ', ' ', ' ', 'G'] → Only match space + G p = re.compile(r'[0-9^a-z]')#if ^ not at the beginning of a character class, it works as a normal character m = p.findall('1 2 3 ^Go') #['1', '2', '3', '^', 'o']
-
\d
vs\D
digits: \d (numeric digits) \D (non-digit, including \n)p = re.compile(r'\d') m = p.findall('a1\nA#') #['1'] p = re.compile(r'\D') m = p.findall('a1\nA#') #['a', '\n', 'A', '#']
-
\w
vs\W
word characters: \w ([a-zA-Z0-9_]
) \W ([^a-zA-Z0-9_]
)p = re.compile(r'\w') m = p.findall('_#a!E$4-') #['_', 'a', 'E', '4'] p = re.compile(r'\W') m = p.findall('_#a!E$4-') #['#', '!', '$', '-']
-
\s
vs\S
white space: \s (white-space) \S (non white-space) match based on whether a character is a whitespacetext = 'Name\tISSS610\tISSS666\nJoe Jones\tA\tA\n' p = re.compile(r'\s') m = p.findall(text) #['\t', '\t', '\n', ' ', '\t', '\t', '\n']
-
-
Type 2: Escaping metacharacters:
\
Removes the special meaning of a metacharacterp1 = re.compile(r'.') p2 = re.compile(r'\.') m1 = p1.findall('smu.edu.sg') #['s', 'm', 'u', '.', 'e', 'd', 'u', '.', 's', 'g'] m2 = p2.findall('smu.edu.sg') #['.', '.'] p = re.compile(r'\d\\d') #First \d is to match any digit, then second \\d is to match "\d" m = p.findall('135\d') #['5\\d'] i.e: 5\d
-
Type 3: Anchors:
^
beginning of text,$
end of text,\b
word boundary-
^
beginning of text: We have seen a caret used in a character class. Here the caret is used without a character class.- It matches the starting position in the text.
- In the case of Multiline text, we can add flag
re.MULTILINE
orre.M
inre.compile
p = re.compile(r'^a[ab]c') m = p.findall('''aac\nabc''') #['aac'] p = re.compile(r'^a[ab]c', re.M) #Add flag re.M to match multiple text m = p.findall('''aac\nabc''') #['aac', 'abc']
-
$
end of text:- It matches the ending position in the text
- Similar to caret, dollar sign matches the ending position but not in each line in multiline text, but this behavior can also be changed with
re.MULTILINE
orre.M
p = re.compile(r'ab.$') m = p.findall('abc abd abe abf') #['abf'] p = re.compile(r'[ab]c$', re.M) #Add flag re.M to match multiple text m = p.findall('ac\nbc') #['ac', 'bc']
-
\b
word boundary: Match based on whether a position is a word boundaryp = re.compile(r'\b\d\d\b') m = p.findall('1 2 3 11 12 13 111 112 113') #['11', '12', '13'] p = re.compile(r'\b\w\w\b') m = p.findall('aa,ab;ac(AA)AB AC') #['aa', 'ab', 'ac', 'AA', 'AB', 'AC']
-
-
Type 4: Quantifiers:
*
: zero or more?
: zero or one+
: one or more{m}
: m repetitions{m, n}
: any number of repetitions from m to n, inclusive.
p = re.compile(r'a[ab]*c') m = p.findall('a ab ac abc aac aabc aaac ababc') #['ac', 'abc', 'aac', 'aabc', 'aaac', 'ababc'] p = re.compile(r'a[ab]+c') m = p.findall('a ab ac abc aac aabc aaac ababc') #['abc', 'aac', 'aabc', 'aaac', 'ababc'] p = re.compile(r'a[ab]?c') m = p.findall('a ab ac abc aac aabc aaac ababc') #['ac', 'abc', 'aac', 'abc', 'aac', 'abc'] p = re.compile(r'\d{3}') m = p.findall('1 2 3 11 12 13 111 112 113') #['111', '112', '113'] p = re.compile(r'\d{2,3}') m = p.findall('1 2 3 11 12 13 111 112 113') #['11', '12', '13', '111', '112', '113']
-
We can group pattern using
()
into sub-patternsp = re.compile(r'(\w+): (\d+)') #Sub-patterns are 2 group m = p.findall('Course: Grade\nMath: 89\nPhysics: 92\n English: 78') #[('Math', '89'), ('Physics', '92'), ('English', '78')] chapters = 'Chapter 12: Numpy\n\ Chapter 13: Pandas\n\ Chapter 14: Data Visualzation' p = re.compile(r'^Chapter (\d+: .+)', re.M) #['12: Numpy', '13: Pandas', '14: Data Visualzation'] m = p.findall(chapters)
- Match the sub-pattern before or the one after
p = re.compile(r'(\w+)\.(bat|zip|exe)') m = p.findall('game.exe auto.bat text.zip') #[('game', 'exe'), ('auto', 'bat'), ('text', 'zip')]
-
.groups()
: return all matched groups -
.group()
: allows users to choose different groups by giving the indices of the groups.-
group(0) returns the whole match.
-
group(1) returns the 1st captured group.
-
group(2, 3, 4) returns the 2nd, 3rd and 4th groups.
#Ex 1: re.Match.groups() vs re.Match.group() p = re.compile(r'(\w+\.\w+)\s(\w+\.\w+)') m = p.search('game.exe auto.bat text.zip') print(m.groups()) #('game.exe', 'auto.bat') print(m.group(1)) # game.exe #Ex 2: re.Match.group() pattern = r'(\w+)\W+(\w+)\W+(\w+)\W+(\w)+' p = re.compile(pattern) m = p.search('one,,,two:three++++++4') print(m.group(0)) #one,,,two:three++++++4 (i.e: the whole match) print(m.group(1)) #one (i.e: match only group 1) print(m.group(2, 3, 4)) #('two', 'three', '4')
-
'(\w+)-\1'
is different from'(\w+)-\w+'
'(\w+)-\1'
: when the first group is matched,\1
match the same literal string in group1- For example: two patterns both match ‘one-one’, but the one with backreference,
'(\w+)-\1'
, won’t match ‘one-two’.# pattern tries to match the type of number that starts with a few digits followed by one digit # and then repeats the first few digits. p = re.compile(r'((\d+)\d\2)') m = p.finditer('1234123, 11311, 123, 54345') for string in m: print(string.group(1, 2)) #('1234123', '123') (i.e: 123 - 4 - same as group 2, in this case is 123) #('11311', '11') #('434', '4')
Three common flags that are very useful are:
re.MULTILINE
orre.M
: make “^”/“$” match starting/ending position of each line.re.IGNORECASE
orre.I
: match letters in a case-insensitive way.re.DOTALL
orre.S
: make “.” match any character, including newlines \n.
p1 = re.compile(r'abc')
m1 = p1.findall('abc ABC aBC Abc') #['abc']
p2 = re.compile(r'abc', re.I)
m2 = p2.findall('abc ABC aBC Abc') #['abc', 'ABC', 'aBC', 'Abc'] because re.I means Ignore Case
- Using the module-level methods can skip the step compiling the pattern.
match = re.match(r'abc', 'abc')
search = re.search(r'abc', 'a abc')
findall = re.findall(r'abc', 'abc abc ab bc a b c')
finditer = re.finditer(r'abc', 'abc abc ab bc a b c')
print(f'match: {match}') #<re.Match object; span=(0, 3), match='abc'>
print(f'search: {search}') #<re.Match object; span=(2, 5), match='abc'>
print(f'findall: {findall}') #findall: ['abc', 'abc']
print(f'finditer: {finditer}') #<callable_iterator object at 0x7fb2e9796a30>
- By default, the
split()
method returns a list of strings broken down, excluding the matched strings. - It is also possible to make split() return the matched strings, simply by using a group to capture the whole pattern.
p = re.compile(r'\W+')
split = p.split('The~split*method-is%powerful') #['The', 'split', 'method', 'is', 'powerful'], by default
#It is also possible to make split() return the matched strings, simply by using a group to capture the whole pattern.
p = re.compile(r'(\W+)')
split = p.split('The~split*method-is%powerful') #'The', '~', 'split', '*', 'method', '-', 'is', '%', 'powerful']
sub()
returns a new string after replacement.subn()
returns a tuple containing the new string and the number of replacements.
p = re.compile(r'Toko')
sub = p.sub('Tokyo', 'Toko is a large city.') #Tokyo is a large city.
subn = p.subn('Tokyo', 'Toko is Toko') #('Tokyo is Tokyo', 2)
- Find expression A where expression B is matching:
A(?=B)
p = re.compile(r"\s(\w+(-\w+){1,3}(?=[\s.]))") #(?=[\s.]) match A if B=[\s.] is matching either space or dot.
m = p.findall('''
The man is good-looking and rich.
The eleven-year-old twenty-five-storey building was developed by a famous developer in town.
This art piece is one-of-a-kind.
There is a five-and-one-half-foot-long sign at the outskirt of the town.''')
#[('good-looking', '-looking'), ('eleven-year-old', '-old'), ('twenty-five-storey', '-storey'), ('one-of-a-kind', '-kind')]
- Find expression A where expression B does not follow:
A(?!B)
- Find expression A where expression B precedes:
(?<=B)A
- Find expression A where expression B does not precede:
(?<!B)A