Welcome to our in-depth case study where we tackle both data quality and storytelling. Dive deep into an e-commerce dataset, derive actionable insights, and overcome the hurdle of messy data!
- 📌 Objective
- 📂 Dataset Overview
- 🛠 Getting Started
- 🤔 Questions to Address
- 📝 Notes
- ⚖️ License
- Data Cleaning
- Data Categorization
- Data Exploration & Analysis
- Explanatory Plots
- Tools & Libraries
This analysis aims to address the following:
- Classifying products into different categories for simplified reporting.
- Assessing product price distribution across categories.
- Evaluating the extent of product discounts.
- Understanding sales patterns during special occasions like Christmas and Black Friday.
- Suggestions to improve data collection methodologies.
❗ Data inconsistency alert! Before jumping into the analysis, note that the data appears to have inconsistencies. But, as they say in the data science world, data cleaning is 80% of the job. Let's get our hands dirty!
- order_id: Unique identifier for each order.
- created_date: Timestamp for order creation.
- total_paid: Total amount paid by the customer, in euros.
- state: Order status (e.g., Shopping basket, Place Order, Pending, Completed, Cancelled).
- id: Unique identifier for each row.
- id_order: Corresponds to
orders.order_id
.
- sku: Stock keeping unit; unique identifier for each product.
- ... (continue with other columns)
- short: 3-character code for brand identification based on
products.sku
. - long: Full brand name.
Start by taking a glimpse of the data:
- Used
DataFrame.head()
andDataFrame.tail()
to visualize top and bottom rows. - Used functions like
DataFrame.describe()
,DataFrame.isna().any()
, and others to get a sense of data quality and structure.
As one dive deep, consider the following questions:
- How many orders are there?
- How should revenue be computed?
- What are the top 5 brands?
- How many products are there?
- Define the products category.
- What period of time do these orders comprise?
- How many orders are Completed?
- Where are the “key” data columns connection between them?
- How should be discount computed?
- What is the average % discount given to all products in completed orders?
- Define massive discount and low discount.
- Define Hight, medium and low value products depending on the average price.
- When it has been beneficial in terms of revenue to include discounts?
- What is the average difference between total_paid and unit_price_total?
- What is the distribution of these differences?
- Can all the differences be explained by shipping costs? If not, what are other plausible explanations?
- If there are differences that you can’t explain: what should you do with these orders?
- Check for null values using
data.isnull().sum()
. - Decide on imputation or removal.
- Use scatter plots or box plots to visualize outliers.
- Consider techniques like the IQR method to handle outliers.
- Convert categorical variables to numerical format.
- Normalize or standardize values when needed.
- Source columns:
name
,desc
. - Goal: Identify patterns or keywords linking a product to a category.
python pd.set_option('display.max_rows', 1000) pd.set_option("display.max_colwidth", 100)
MIT
- Identify missing data using
data.isnull().sum()
. - Determine whether to impute or remove the missing values based on the context.
- For categorical data: Consider filling with mode or a placeholder like "Unknown".
- For numerical data: Consider filling with mean, median, or using methods like forward-fill or backward-fill.
- Visualize potential outliers using scatter plots or box plots.
- Use methods like the IQR (Interquartile Range) to detect and manage outliers.
- Ensure consistency in categorical data. For instance, "Male" and "male" should be consistent.
- Convert categorical variables into a format suitable for analysis, e.g., one-hot encoding or label encoding.
- Normalize or standardize numerical features when necessary, especially if using algorithms sensitive to different scales.
- Compare
products.price
,orderlines.unit_price
, andorders.total_paid
to understand pricing dynamics. - Analyze the most profitable products or categories.
- Analyze sales trends over time.
- Identify any seasonal patterns or anomalies.
- Segment customers based on purchase behaviors.
- Identify high-value customers or those at risk of churn.
- Use bar charts for categorical data comparison.
- Line plots are ideal for time series data.
- Scatter plots can help identify relationships or trends between two numeric variables.
- Use color and size wisely to highlight trends and focus areas.
- Annotate plots to guide the viewer to key insights.
- Use visualization libraries like Matplotlib, Seaborn, or Plotly for Python-based visualization.
- Consider creating interactive dashboards using tools like Tableau or PowerBI.
- Alternatively, use Python libraries like Dash by Plotly for web-based dashboards.
While delving into the data analysis and visualization, always ensure that the findings are communicated in a comprehensible and impactful manner. Data insights are most valuable when they're easily understandable. Constantly refer back to this guide as you work through the dataset, ensuring all steps are meticulously followed and each analysis is underpinned by a clear rationale.
Mastering the art of data analysis and visualization is crucial for extracting valuable insights from your product data. By following the guidelines outlined above, you can transform raw data into actionable strategies and visually compelling stories. Remember, it's not just about what the data says, but how effectively you communicate it!