This site uses cookies.
By continuing to use this site, you are agreeing to our use of cookies. For more information, please review our Terms and Privacy.
OK
import pandas as pd
import numpy as np
numbers = pd.Series([10, 20, 30, 40, 50])
numbers = pd.Series(
[10, 20, 30, 40, 50],
index = [1, 2, 3, 4, 5])
df = pd.DataFrame(columns=["first_column", "second_column"])
odds = pd.Series([1, 3, 5, 7, 9])
evens = pd.Series([2, 4, 6, 8, 10])
df = pd.DataFrame({"Odd numbers": odds, "Even numbers": evens})
df = pd.DataFrame({
'Letter': ['c', 'e', 'z', 'o'],
'Anima': ['cat', 'elephant', 'zebra', 'orangutan']})
df = pd.DataFrame({
"A" : [1, 2, 3],
"B" : [4, 5, 6],
"C" : [7, 8, 9]},
index = ['u', 's', 't'])
csv_data = pd.read_csv("input.csv")
csv_data = pd.read_csv("input.csv", usecols=['col1', 'col4', 'col5'])
csv_data = pd.read_csv("input.csv", parse_dates=['Start Date', 'End Date'])
Below, we're reading a CSV file with two columns, which we'll name colA and colB.
csv_data = pd.read_csv("input.csv", header=None, names=['colA', 'colB'])
pipe_delimited_data = pd.read_csv("input.csv", sep='|')
excel_data = pd.read_excel("input.xlsx", sheet_name="Sheet3")
fwf_data = pd.read_fwf("input.txt")
json_data = pd.read_json("input.json")
parquet_data = pd.read_parquet("input.parquet.gzip", engine="pyarrow")
You will need the sqlalchemy and psycopg2 Python packages:
pip install sqlalchemy
pip install psycopg2
And then:
from sqlalchemy import create_engine
conn = create_engine('postgresql://user:[email protected]:5432/postgres')
sql_data = pd.read_sql_query("SELECT * FROM my_table", conn)
WARNING: do not store credentials in code!
data.to_csv('output.csv')
data.to_csv('output.csv', index=False)
data.to_csv('output.csv', sep='|')
You may need openpyxl and xlsxwriter:
pip install openpyxl
pip install xlsxwriter
And then:
data.to_excel('output.xlsx', engine='xlsxwriter')
data.to_excel('output.xlsx', engine='xlsxwriter', sheet_name='Sheet2')
data.to_parquet("output.parquet.gzip", compression="gzip")
Suggested method:
df["new_column_name"] = "default_value"
df.drop(columns=["column_1", "column_2"])
df.drop(columns=["column_1", "column_2"], inplace=True)
df1.append({'colA': 'some_value', 'colB': 'another_value'}, ignore_index=True)
df1.append(df2, ignore_index=True)
pd.concat([df1, df2])
pd.concat([df1, df2], axis=1)
df.drop(index=['a', 'b', 'z'], inplace=True)
The code below removes (in place) all the rows where the Customer Id value is 0:
indices_to_drop = df[ df['Customer Id'] == 0 ].index
df.drop(indices_to_drop, inplace=True)
pd.merge(df1, df2,
left_on="df1_column", right_on="df2_column",
how="left")
pd.merge(df1, df2,
left_on="df1_column", right_on="df2_column",
how="right")
pd.merge(df1, df2,
left_on="df1_column", right_on="df2_column",
how="inner")
pd.merge(df1, df2,
left_on="df1_column", right_on="df2_column",
how="outer")
The integers below refer to the row and column position (first, second, etc.). THEY DO NOT REPRESENT COLUMN NAMES OR ROW INDEX LABELS. Also note that iloc uses square brackets [ ], not parantheses ( ).
df.iloc[0]
df.iloc[8]
df.iloc[1:4]
df.iloc[-2:]
df.iloc[[0, 3, 4]]
df.iloc[:, 0]
df.iloc[:, [0]]
df.iloc[:, -1]
df.iloc[:, 1:-1]
df.iloc[:, [0, 2, 3]]
df.iloc[[1, 2], -2:]
df.iloc[[0, -1], [0, -1]]
df.iloc[[0, -1], ::2]
Any integers below refer to actual row index labels, in a hypothetical scenario where those row index labels are integers. THEY DO NOT REFER TO ACTUAL ROW OR COLUMN POSITIONS.
df["column_name"]
df[["column_name"]]
df[["column_1", "column_2"]]
Will raise an error if no rows have the specified index label.
df.loc["a"]
This is not the same as selecting the first row. There may be one or more rows with the row index label 0. Will raise an error if no rows have the specified label.
df.loc[0]
df.loc[datetime.datetime(2030, 12, 31)]
Will include the rows with labels 'a' or 'u'. May include no rows, one row or more rows.
df.loc['a':'u']
Will return a DataFrame with more than 3 rows if several rows have labels 'a', 'u' or 'd'.
df.loc[ ['a', 'u', 'd'] ]
Will return a DataFrame with more than 3 rows if several rows have labels 1, 5 or 6.
df.loc[ [1, 5, 6] ]
df.loc[:, 'Country']
df.loc[:. ['Country']]
df.loc[:, ['Country', 'Surface']]
Select the rows and columns with the specified labels:
df.loc[ ["row_label_1", "row_label_22"], ["column_name_1", "column_name_2 "] ]
The first list must have as many boolean values as there are rows. The second list must have as many booleans as there are columns. Only rows and columns corresponding to True values will be returned:
df.iloc[ [True, True, False, False, False], [True, True, False, False] ]
or
df.loc[ [True, True, False, False, False], [True, True, False, False] ]
The code below returns the first two columns of the second, third, fourth, fifth, sixth and seventh rows in the DataFrame.
df.iloc[1:7, [True, True, False, False]]
The code below returns the first two columns of all the rows in the DataFrame located between the row with label 1 and the row with label 7, including.
df.loc[1:7, [True, True, False, False]]
df.sample(n=50)
df.sample(frac=0.5)
Below
data.pivot_table(
index=['RepName', 'Location'],
columns=['Year', 'Month'],
values=['Helpfulness', 'Courtesy', 'Empathy'],
aggfunc=['mean', 'std']
)
data.pivot_table(
index=['RepName', 'Location'],
columns=['Year', 'Month'],
values=['Helpfulness', 'Courtesy', 'Empathy'],
aggfunc=['mean', 'std'],
margins=True
)
data[ ['Helpfulness',
'Courtesy',
'Empathy',
'RepName',
'Date'] ].groupby( ['RepName', 'Date'] ).mean()
data[ ['Helpfulness',
'Courtesy',
'Empathy',
'RepName',
'Date'] ].groupby( ['RepName', 'Date'] ).agg( ['mean', 'std'] )
df['new_column'] = (df['col_A'] * 3 + df['col_B']) / df['col_C']
df['uppercase_names'] = df['names'].str.upper()
df['lowercase_names'] = df['names'].str.lower()
df['name_initial'] = df['names'].str[0]
def low_score(row):
if (row['Helpfulness'] < 2 and
row['Empathy'] < 2 and
row['Courtesy'] < 2):
return 1
return 0
data['Low Score'] = data.apply(low_score, axis=1)
The code below updates all the missing values in the column Location to be N/A:
data.loc[ data['Location'].isnull(), ['Location'] ] = 'N/A'
The code below updates all the values in the column Amount that are less than 100 to be 150:
data.loc[ data['Amount'] < 100, ['Amount'] ] = 150
df[ df['customer_id'] == 1121 ]
Use & for logical and, | for logical or, and ~ for negations.
df[ (df['customer_id'] > 1121) & (df['customer_id'] < 2232) ]
The code below will return the values in columns Letter and Animal of all the rows where the first character in the Animal column is greater than 'd' (i.e. 'e', 'f', 'g', ...):
df.loc[ df['Animal'].str[0] > 'd', ['Letter', 'Animal'] ]
The code below will extract the data in the columns whose names have more than 6 characters:
df.loc[:, df.columns.str.len() > 6]
df.sort_values(by=['col1', 'col2'])
df.sort_values(by=['col1', 'col2'], inplace=True)
df.sort_values(by=["col_A"], ascending=False)
df.sort_index()
df.sort_index(inplace=True)
df.reset_index()
df.rename(columns={"old_column_1": "new_column_1", "old_column_2": "new_column_2"})
df.rename(columns={"old_column_1": "new_column_1", "old_column_2": "new_column_2"},
inplace=True)
In the code below, rows index label 0 is replaced with 1 and row index label 1 is replaced with 2.
df.rename(index={0: 1, 1: 2})
df.head()
df.tail()
The code below will print an object containing the column names. It can be iterated over.
df.columns
df.index
df.shape
df.info()
df.describe()
df.mean()
df.median()
df.std()
df.max()
df.min()
df.corr(method='pearson')
df.isnull().any()
df.count()
df.dropna()
df.dropna(axis=1)
data['Amount'].fillna(0, inplace=True)
df.dtypes
data['Country'] = data['Country'].astype('category')
*replace data_type with data types such as str, int, float, etc.