Regex is a sequence of characters that form a search pattern. It can be used to search, match, or replace strings in text. For example, finding a specific word or validating formats like emails or phone numbers.
These are the characters you want to match exactly in the text.
a
matches the lettera
b
matches the letterb
Example:
Regex: cat
Text: "The cat is cute."
It will match the word "cat" in the text.
Special characters that have specific meanings in regex.
.
(dot) matches any single character except newlines.- Example:
a.b
matchesacb
,a1b
,a_b
, etc.
- Example:
^
matches the beginning of the string.- Example:
^abc
matches "abc" only at the beginning of a string.
- Example:
$
matches the end of the string.- Example:
abc$
matches "abc" only at the end of the string.
- Example:
Character classes define a set of characters to match.
-
[ ]
: Matches any one character inside the brackets.- Example:
[aeiou]
matches any vowel. - Example:
[0-9]
matches any digit.
- Example:
-
[^ ]
: Matches any character except the ones inside the brackets.- Example:
[^a-z]
matches any character that’s not a lowercase letter.
- Example:
Quantifiers define how many times a pattern should match.
*
: Matches 0 or more occurrences of the previous element.- Example:
a*
matches""
(empty),a
,aa
,aaa
, etc.
- Example:
+
: Matches 1 or more occurrences.- Example:
a+
matchesa
,aa
,aaa
, etc. (but not an empty string).
- Example:
?
: Matches 0 or 1 occurrence.- Example:
a?
matches""
(empty) ora
.
- Example:
{n}
: Matches exactly n occurrences of the preceding element.- Example:
a{3}
matchesaaa
.
- Example:
{n,}
: Matches n or more occurrences.- Example:
a{2,}
matchesaa
,aaa
,aaaa
, etc.
- Example:
{n,m}
: Matches between n and m occurrences.- Example:
a{2,4}
matchesaa
,aaa
, oraaaa
.
- Example:
-
\d
: Matches any digit (equivalent to[0-9]
).- Example:
\d
matches1
,2
,3
, etc.
- Example:
-
\w
: Matches any alphanumeric character (letters and digits) plus the underscore (_
).- Example:
\w
matchesa
,A
,3
,_
, etc.
- Example:
-
\s
: Matches any whitespace character (space, tab, newline).- Example:
\s
matches" "
,"\t"
, etc.
- Example:
-
\b
: Matches a word boundary (the position between a word and a non-word character).- Example:
\bcat\b
matches the word "cat" but not "scatter".
- Example:
-
( )
: Groups parts of the regex together to apply quantifiers or operations to them.- Example:
(abc)+
matchesabc
,abcabc
, etc.
- Example:
-
|
: Represents alternation (OR).- Example:
cat|dog
matches eithercat
ordog
.
- Example:
-
^
: Anchors the match to the start of the string.- Example:
^cat
matches "cat" at the beginning of the string.
- Example:
-
$
: Anchors the match to the end of the string.- Example:
cat$
matches "cat" at the end of the string.
- Example:
Certain characters have special meanings in regex. To use them as literal characters, you need to escape them with a backslash \
.
\.
: Matches a literal dot (.
).\*
: Matches a literal asterisk (*
).\+
: Matches a literal plus sign (+
).\?
: Matches a literal question mark (?
).
Example:
Regex: a\+b
Text: "a+b"
This matches "a+b" and not "ab".
-
Matching a simple email:
- Regex:
^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
- Explanation: Matches an email starting with one or more valid characters, followed by
@
, the domain, and then a dot and at least two letters for the TLD.
- Regex:
-
Matching a phone number:
- Regex:
^\(\d{3}\) \d{3}-\d{4}$
- Explanation: Matches a phone number in the format
(123) 456-7890
.
- Regex:
-
Matching a date (YYYY-MM-DD):
- Regex:
^\d{4}-\d{2}-\d{2}$
- Explanation: Matches a date like "2024-12-05".
- Regex:
- Testing Regex: Tools like regex101 let you test and understand your regex.
- Greedy vs Lazy Matching: By default, regex quantifiers are greedy, meaning they match as much text as possible. To make them lazy (match the smallest amount possible), use
?
after the quantifier. For example,.*?
is lazy.
- Regexr (Interactive learning tool): https://regexr.com/
- Regular-Expressions.info (Detailed guide): https://www.regular-expressions.info/
- Regex101 (Testing and explanation): https://regex101.com/
The string r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
is a regular expression (regex) pattern used to match email addresses. Let's break it down:
-
r''
:- This is a raw string literal in Python. The
r
before the string tells Python not to escape special characters (like backslashes). This is helpful when writing regex patterns because backslashes are used frequently and could cause errors or confusion.
- This is a raw string literal in Python. The
-
[a-zA-Z0-9._%+-]+
:[a-zA-Z0-9._%+-]
: This defines a character class, which matches one of the characters inside the square brackets. Specifically, it matches any lowercase letter (a-z
), uppercase letter (A-Z
), digit (0-9
), or any of the following special characters:.
(dot)_
(underscore)%
+
-
- The
+
after the character class means one or more occurrences of these characters. This part matches the local part of an email (the part before the@
).
-
@
:- This simply matches the at symbol (
@
) that separates the local part from the domain part of the email address.
- This simply matches the at symbol (
-
[a-zA-Z0-9.-]+
:- Similar to the previous part, this matches the domain part of the email address. It allows lowercase and uppercase letters, digits, dots (
.
), and hyphens (-
). - The
+
means one or more occurrences, ensuring there is at least one character for the domain name.
- Similar to the previous part, this matches the domain part of the email address. It allows lowercase and uppercase letters, digits, dots (
-
\.
:- The backslash (
\
) escapes the dot (.
), which normally matches any character in regex. Here, it ensures that the dot is treated literally (i.e., it matches an actual period/dot, like ingmail.com
).
- The backslash (
-
[a-zA-Z]{2,}
:[a-zA-Z]
: Matches any letter, either lowercase or uppercase.{2,}
: This specifies that the preceding character class must appear at least twice. This part ensures that the email domain has a valid top-level domain (TLD) like.com
,.org
,.net
, etc.
This regex pattern matches email addresses in the following format:
- One or more alphanumeric characters or special characters (
._%+-
) before the@
. - The
@
symbol separates the local part and domain. - The domain part consists of alphanumeric characters, dots, and hyphens.
- A dot (
.
) is followed by two or more alphabetic characters representing the top-level domain (like.com
,.org
, etc.).
This pattern does not match every possible email format defined by the full email specification (RFC 5322), but it works for common email formats used in practice.
A data sanitization strategy refers to the processes and techniques used to cleanse and protect data, ensuring that it is free from errors, inconsistencies, or unauthorized access. The goal is to ensure that data is accurate, valid, and secure while also meeting privacy, compliance, and security standards. Data sanitization can be performed across various stages of the data lifecycle, from collection to deletion.
- Identify sensitive data: Categorize data into different types (e.g., sensitive, personal, public) to understand the level of protection required. For example, personally identifiable information (PII), financial data, and medical records require strict sanitization.
- Tagging and labeling: Mark sensitive data for protection and ensure it is managed according to regulatory requirements (e.g., GDPR, HIPAA).
- Input Validation: Ensure that data entered into systems is sanitized to prevent invalid, malicious, or erroneous data from being collected.
- Use techniques like input validation to prevent SQL injection, cross-site scripting (XSS), and other forms of malicious data injection.
- Enforce data type checks, range checks, and format checks to ensure data integrity.
- Data Formatting: Enforce standard formats (e.g., dates, phone numbers, addresses) during collection to avoid inconsistencies and errors.
- Remove duplicates: Identify and remove duplicate records to avoid redundancy and potential data corruption.
- Standardize data: Normalize data to ensure consistency across systems (e.g., converting all text to uppercase/lowercase, standardizing units of measure).
- Correct errors: Use automated tools or manual procedures to identify and correct data inaccuracies.
- Data Masking: For sensitive information, apply techniques like data masking or tokenization to replace sensitive data with obfuscated or random values.
- This is especially important in development or testing environments where real data is not necessary.
- For example, replace actual social security numbers or credit card details with masked values like "XXX-XX-1234."
- Anonymize or Pseudonymize Data: Remove personally identifiable information (PII) from datasets or replace it with pseudonyms to preserve privacy. This is especially critical in analytics or data-sharing scenarios where personal identification should be avoided.
- Example: Anonymizing names, addresses, or phone numbers while retaining general trends.
- Encrypt sensitive data both at rest and in transit to protect against unauthorized access. Use industry-standard encryption algorithms (e.g., AES-256) and key management techniques.
- Ensure that encryption keys are stored securely and rotated periodically.
- Role-based Access Control (RBAC): Define clear policies on who can access, modify, or delete data based on their roles within the organization.
- Implement least privilege principles to ensure that only authorized individuals have access to sensitive or critical data.
- Monitor and log all access to sensitive data for auditing and security purposes.
- Secure Deletion: When data is no longer needed, ensure it is securely deleted to prevent unauthorized recovery.
- Use tools that comply with data destruction standards (e.g., DoD 5220.22-M, NIST SP 800-88).
- Overwrite data multiple times, or use physical destruction methods like shredding drives for highly sensitive data.
- Ensure that deletion practices are documented, auditable, and compliant with regulations.
- Continuous Monitoring: Implement real-time monitoring systems to detect unusual data access, modification, or anomalies.
- Auditing and Logging: Maintain an audit trail of all data sanitization activities, including access, modification, and deletion actions, to comply with regulatory and internal policies.
- Follow Regulations: Adhere to relevant data protection regulations, such as the General Data Protection Regulation (GDPR), Health Insurance Portability and Accountability Act (HIPAA), or California Consumer Privacy Act (CCPA).
- Regularly review compliance standards and ensure that the sanitization methods meet or exceed legal requirements for data protection.
- Ongoing Review: Regularly review and update the data sanitization strategy to adapt to evolving security threats, new compliance requirements, and data growth.
- Train staff and stakeholders on data privacy best practices and the importance of data sanitization.
- Leverage automated tools and AI to detect and clean sensitive or incorrect data at scale, especially in large datasets.
- Automate the regular sanitization tasks, like removing duplicates or standardizing formats, to improve efficiency and consistency.
By following a comprehensive data sanitization strategy, organizations can mitigate risks related to data breaches, maintain the integrity of their data, and ensure compliance with applicable laws and regulations.
- Importing the necessary libraries
- Loading the data
- Checking for missing values
- Checking for duplicates
- Checking for the data types
- Checking for the unique values in the columns
- Checking for the distribution of the target variable
- Checking for the correlation between the features
- Checking for the outliers
- Performing the feature engineering
- Splitting the data into train and test
- Performing the feature scaling
- Performing the feature selection
- Performing the model building
- Performing the model evaluation
- Performing the model tuning
- Performing the model deployment
An imputer is a class that is used to replace missing values in the data. It is used to handle missing data in a dataset by filling the missing values with a specific strategy. The imputer class provides different strategies to fill the missing values, such as mean, median, mode, or a constant value. It helps in preprocessing the data before building a machine learning model by handling missing values effectively.
from sklearn.impute import SimpleImputer
This code imports the SimpleImputer
class from the sklearn.impute
module. The SimpleImputer
class is used to replace missing values in the data with a specified strategy.
imputer = SimpleImputer(strategy='mean')
imputed_data = imputer.fit_transform(data)
In this code snippet:
- An instance of the
SimpleImputer
class is created with the strategy set to'mean'
.
An outlier is a data point that is very different from the rest of the points. It's like a student who has a much higher or lower score than everyone else in a class. Outliers can mess up the learning process of a machine learning model, so we often try to remove them to make the model work better.
This function helps remove outliers from the data using a method based on standard deviation (how spread out the data is).
-
Calculate the mean (average) of the data:
mean = np.mean(data, axis=0)
This calculates the average value of each feature (column) in your data.
-
Calculate the standard deviation of the data:
std = np.std(data, axis=0)
This measures how spread out each feature is. If the standard deviation is large, it means the data is spread out a lot.
-
Check if a data point is an outlier:
outliers = np.abs((data - mean) / std) > threshold
- This line checks for outliers. It calculates how far each data point is from the mean in terms of standard deviations.
- If a data point is more than
threshold
standard deviations away from the mean, it's considered an outlier. - The
np.abs()
part ensures you check the distance regardless of whether the value is above or below the mean.
-
Remove the outliers:
return data[~outliers.any(axis=1)]
- This line removes the outliers.
~outliers.any(axis=1)
means "select the rows (data points) where there is no outlier in any column". In other words, if any column in a row has an outlier, the whole row gets removed.
-
X_cleaned = remove_outlier(X)
:- This applies the
remove_outlier()
function to your dataX
(which represents the features) and removes the outliers fromX
. The cleaned data is stored inX_cleaned
.
- This applies the
-
y_cleaned = y[~np.any(np.abs((X - np.mean(X, axis=0)) / np.std(X, axis=0)) > 2, axis=1)]
:- This is cleaning the target variable
y
(which contains the labels or outcomes). - It removes the corresponding labels from
y
where the feature dataX
has outliers. - This ensures that if you remove a data point from
X
because it's an outlier, you also remove the corresponding target value iny
.
- This is cleaning the target variable
- Outlier removal: We check if a data point is "too different" from the others (using standard deviations). If it is, we remove it.
- Cleaning
X
: We remove outliers from the features (the input data). - Cleaning
y
: We also remove the labels iny
that match those outliers, so that the cleaned data stays aligned.
In short, this code helps to remove the "weird" data points that might mess up the analysis or model training.
y_cleaned = y[~np.any(np.abs((X - np.mean(X, axis=0)) / np.std(X, axis=0)) > 2, axis=1)]
This line is cleaning the target variable y
by removing the labels corresponding to data points that are outliers in the feature set X
. Here's how it works:
np.abs((X - np.mean(X, axis=0)) / np.std(X, axis=0)) > 2
This part of the code checks for outliers in the feature matrix X
(the data used to predict the target y
).
-
np.mean(X, axis=0)
:- This calculates the mean (average) of each column (feature) in the matrix
X
. - If
X
has multiple features (e.g., height, weight, age), it calculates the mean for each of these features separately.
- This calculates the mean (average) of each column (feature) in the matrix
-
np.std(X, axis=0)
:- This calculates the standard deviation of each column in
X
. The standard deviation tells us how spread out the values of each feature are. - A high standard deviation means the data is spread out a lot, and a low standard deviation means the data is concentrated around the mean.
- This calculates the standard deviation of each column in
-
(X - np.mean(X, axis=0)) / np.std(X, axis=0)
:- This standardizes the data. It means subtracting the mean and dividing by the standard deviation for each feature, so that each feature has a mean of 0 and a standard deviation of 1.
- This is essentially converting each feature into "how many standard deviations away" the data points are from the mean.
-
np.abs(... > 2)
:- This checks if the absolute value of the standardized data points is greater than 2.
- If a data point is more than 2 standard deviations away from the mean, it is considered an outlier (this is a common threshold).
np.abs()
ensures we're checking the distance regardless of whether the data point is above or below the mean.
-
np.any(..., axis=1)
:- This checks if any of the features for a given data point (row in
X
) are outliers (i.e., if any value in that row is greater than 2 standard deviations away from the mean). axis=1
means the check is done across columns (features) for each row (data point).- It returns
True
for any row that has at least one feature that is an outlier.
- This checks if any of the features for a given data point (row in
y[~np.any(..., axis=1)]
-
~np.any(..., axis=1)
:- The tilde
~
is a negation operator. So, it flips the result. - It converts
True
(outliers) intoFalse
, andFalse
(non-outliers) intoTrue
. - This means we are keeping the rows (data points) that are NOT outliers in
X
.
- The tilde
-
y[...]
:- This uses the Boolean mask (from the previous step) to select only the labels in
y
that correspond to the rows (data points) inX
that are not outliers. - The
y_cleaned
will now only contain the target values that correspond to non-outliers in the feature dataX
.
- This uses the Boolean mask (from the previous step) to select only the labels in
- Step 1: Find out which rows in
X
have outliers (where any feature is more than 2 standard deviations away from the mean). - Step 2: Remove the rows (data points) in
y
that correspond to these outliers, so thaty_cleaned
only contains the target labels for the "good" data points inX
.
If X
represents features like age, height, and weight, and y
is the target variable (like "score" or "label"), this code ensures that if a data point has any feature value that is an outlier (e.g., a very unusual age or height), the corresponding target value in y
is also removed.
In summary: The line ensures that both the feature data X
and the target data y
are "aligned" by removing rows where any feature is an outlier, keeping only the data points with valid (non-outlier) values.