Use the assignment_1 folder in Piazza discussions. Check to see if your question has already been answered before starting a new topic.
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import seaborn as sns
df.to_excel('output.xlsx', index=False, engine='openpyxl')
df = pd.read_excel('a.xlsx')
df['a'] = df['Letters'].where(df['Letters']=='a','notA')
grouped = df.groupby('a')['Numbers'].apply(list).to_dict()
fig = plt.figure(figsize=(22,14))
plt.boxplot([grouped['a'], grouped['notA']])
plt.xticks([1,2],["a", "Not A"])
plt.title("Boxplot of a and not_a")
plt.ylabel("Number of day stay at hospital")
plt.xlabel("If they use something")
plt.show()
from scipy.stats import ttest_1samp
t,p = stats.ttest_1samp(SeaIce["Sum_Arctic"], 10)
print("The result of one sample test's t_value is",t)
print("The result of one sample test's p_value is",p)
t1,p1 = stats.ttest_ind(SeaIce["Sum_Antarctic"],SeaIce["Sum_Arctic"])
print("The result of two sample test's is", t1)
print("The result of two sample test's p value is", p1)
#correlation
from scipy.stats import pearsonr
corr, p_value = pearsonr(data[""],data[""])
print("The correlation is ",corr, " units" )
print("The p_value is ",p_value)
print("R square: ", corr * corr)
p_value: This is the p-value associated with that correlation. The p-value gives the probability of observing the current data, or something more extreme, when there's no actual correlation present. Typically:
If the p-value is less than a predetermined significance level (e.g., 0.05), we might reject the null hypothesis of "no correlation between the two variables," believing the relationship between them to be statistically significant. If the p-value is larger, we do not reject the null hypothesis, implying that we don't have sufficient evidence to suggest a statistically significant relationship between the variables.
#T-test ''' If the populations are normally distributed or nearly so, and want to compare the mean of one population with the mean of another population, then a t-test can be used (cf. nonparametric Wilcoxon test).
Null Hypothesis: The means of both populations are equal.
Alternate Hypothesis: The means of both populations are not equal.
A large t-score tells you that the groups are different.
A small t-score tells you that the groups are similar.
'''
T test performs a hypothesis test for the mean between two independent groups of scores eg. claiming the average marks between two similar courses are the same
t, p = stats.ttest_ind(p24_300mg, p24_600mg)
performs a hypothesis test for the mean between two related groups of scores eg. claiming that a particular student's average marks in two different courses are the same
t, p = stats.ttest_rel(p24_300mg, p24_600mg)
stats.ttest_1samp(data[''],76)
p value
The p-value is below 0.05, so reject the null hypothesis: the means of both ... are not equal
p-value is larger than 0.05 so it is out of the rejection region,
thus we can not reject the null hypothesis and decide that the is equivalent between and .
Line Plot: trends and relationships of continuous variables, such as time series data or variables changing with a parameter.
Scatter Plot: relationship between two continuous variables, helping to observe correlations or distributions between variables.
!Bar Plot: plt.bar()适用于比较不同类别或组之间的离散数据。 plt.hist()适用于展示连续变量的分布情况。
Histogram: display the distribution of numerical data, helping to understand the central tendency and dispersion of the data.
Box Plot:display the distribution of numerical data, including median, quartiles, and outliers, allowing the observation of outliers and distribution shapes.
Heatmap: Used to show the relationship between two categorical variables, often using colors to represent the degree of association or frequency.
Violin Plot: Combines the features of a box plot and a kernel density plot, used to display the distribution and density of numerical data.
Categorical Plot: Includes bar plots, count plots, box plots, etc., used to display data distribution and relationships between different categories. ''' #hue for different lines sns.lineplot(x = '', y='', hue = '' , data=q3q4_df) sns.lineplot(x=x, y=y) plt.title('') plt.show()
sns.scatterplot(x=x, y=y)
sns.barplot(x=x, y=y)
sns.histplot(data)
sns.boxplot(data=data)
sns.heatmap(data, cmap='YlGnBu', annot=True, fmt='.2f')
sns.violinplot(x=x, y=y,hue='')
#start of week2 start='2023-02-27'
#end of week2 end='2023-03-03'
#create a new column in dataframe to record the date of each row SeaIce["Date"] = pd.to_datetime(SeaIce[["Year","Month","Day"]])
is_date = (SeaIce['Date'] >= start) & (SeaIce["Date"] <= end)
x = SeaIce["Date"] y = SeaIce["Extent(Antarctic)"] z = SeaIce["Extent(Arctic)"]
plt.plot(x,y,label="Antarctic") plt.plot(x,z,label="Arctic")
plt.xlabel("Date") plt.ylabel("Sea Ice extents(10^6 sq km)") plt.title("Daily trend of the Antarctic and Arctic sea ice extents")
plt.legend() plt.show()