Machine learning and deep learning approach to predict Pharmaceuticals Sales using python

  1. Introduction
  2. Objectives
  3. Exploratory Data Analysis
  4. Feature Engineering
  5. Modeling
  6. Conclusion

Every organization can benefit from a sales forecast to help them make better decisions and accurate forecasting of sales is an important and inexpensive way to increase profits. Without effectively anticipating client demand and future product/service sales, no business can improve its financial performance. Business owners have traditionally forecasted revenues based on their years of experience, which has proven to be less accurate. These days, business owners don’t simply want to rely on their own judgment to anticipate sales; they want to use historical data to forecast sales accurately. The greatest way to forecast sales with considerably more accuracy is to use deep learning and machine learning. In this article, I will show how to apply machine learning to forecast future sales.

The finance team wants to forecast sales in all their stores across several cities six weeks ahead of time. It has been identified factors such as promotions, competition, school and state holidays, seasonality, and locality as necessary for predicting the sales across the various stores. The main objective is to build and serve an end-to-end product that delivers this prediction to analysts in the finance team six weeks ahead of time.

The data used for this project is from the Kaggle competition and it can be downloaded here.

Data fields:

Id — an Id that represents a (Store, Date) duple within the test set

Store — a unique Id for each store

Sales — the turnover for any given day (this is what you are predicting)

Customers — the number of customers on a given day

Open — an indicator for whether the store was open: 0 = closed, 1 = open

StateHoliday — indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None

SchoolHoliday — indicates if the (Store, Date) was affected by the closure of public schools

StoreType — differentiates between 4 different store models: a, b, c, d

Assortment — describes an assortment level: a = basic, b = extra, c = extended. Read more about assortment here.

CompetitionDistance — the distance in meters to the nearest competitor store.

CompetitionOpenSince[Month/Year] — gives the approximate year and month of the time the nearest competitor was opened.

Promo — indicates whether a store is running a promo on that day.

Promo2 — Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating.

Promo2Since[Year/Week] — describes the year and calendar week when the store started participating in Promo2.

PromoInterval — describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. “Feb, May, Aug, Nov” means each round starts in February, May, August, and November of any given year for that store.

There are two separate CSV files that will be used to train the model. These are:

We will join these two tables based on the store ID on each data for the training.

train_df =“train.csv”)
store_df = pd.read_csv(“store.csv”)

Data Overview


We can now merge the two tables based in the Store ID as follows:

data = pd.merge(train_df, store_df, on=’Store’)

Total Sales of Stores:

plt.figure(figsize=(12, 7))
sns.scatterplot(data=train_df, x=train_df[‘Store’], y=train_df[‘Sales’])

Most of the sales are within the range of 0–2,200 and there are multiple outliers that need to be handled.

Sales Correlation with Customers for each Store Type:

plt.figure(figsize=(12, 7))
sns.scatterplot(data=train_df[“Sales”, “Customers”], x=train_df[‘Customers’], y=train_df[‘Sales’])

The number of sales is positively correlated with number of customer for each store type.

Sales correlation with competition Distance

plt.figure(figsize=(12, 7))
sns.scatterplot(data=train_df[“Sales”, “CompetitionDistance”], x=train_df[CompetitionDistance], y=train_df[‘Sales’])

Interestingly, as the competition distance increases, the sales decreases. There are some sales for store type A as the competition distance is very high.

Sales on holidays

plt.figure(figsize=(12, 7))
sns.barplot(data=holiday_df, x=’’StateHoliday’, y=’Sales’)

There are fewer number of sales on holidays

Sales on each Month per store

sns.relplot(x=”Month”, y=”Sales”, hue=”StoreType”, data=train_df);

June has the highest amount of sales, January has the lowest

Sales on each Day

sns.relplot(x=”DayOfWeek”, y=”Sales”, hue=”StoreType”, data=train_df);

Store B is mostly open on Saturday and has a higher amount of Sales

  • Filling missing values
  • Fill numerical features with its median
miss_1 = [‘Promo2SinceYear’, ‘Promo2SinceWeek’, ‘CompetitionOpenSinceMonth’, ‘CompetitionOpenSinceYear’, ‘CompetitionDistance’]for col in miss_1: df[col] = df[col].fillna(df[col].median()
  • Fill categorical features with mode
Categ_var = [‘PromoInterval’]
for col in Categ_var:
df[col] = df[col].fillna(df[col].mode()[0])

df[‘Open] = df[‘Open].fillna(0)

These should fill all missing values

Most of the outliers are in the Sales and Customers column, I replaced outliers with IQR

Columns = [‘Sales’, ‘Customers’]

for col in columns:

Q1, Q3 = df[col].quantile(

0.25), df[col].quantile(0.75)

IQR = Q3 — Q1

cut_off = IQR * 1.5

lower, upper = Q1 — cut_off, Q3 + cut_off

df[col] = np.where(

df[col] > upper, upper, df[col])

df[col] = np.where(

df[col] < lower, lower, df[col])

categorical_columns = [‘PromoInterval’, ‘Assortment’, ‘StoreType’]

# Label encoding

label_encoded_columns = preprocess.label_encode(train_df, categorical_columns)

label_encoded_columns = []

# For loop for each columns

for col in categorical_columns:

# We define new label encoder to each new column

le = LabelEncoder()

# Encode our data and create new Dataframe of it,

# notice that we gave column name in “columns” arguments

column_dataframe = pd.DataFrame(

le.fit_transform(df[col]), columns=[col])

# and add new DataFrame to “label_encoded_columns” list


# Merge all data frames

label_encoded_columns = pd.concat(label_encoded_columns, axis=1)

train_df.drop(categorical_columns, axis=1, inplace=True)

# Merge DataFrames

train_df = pd.concat([train_df, label_encoded_columns], axis=1)

We can use machine learning models like Random Forest Regressor to predict sales as well as deep learning to have better accuracy for sales prediction.

Machine learning model (Random Forest Regressor)

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Article

The Game Claims 50 Cent and Jimmy Lovine Paid Him $1 Million to Stop Using ‘G-Unot’

Next Article

MotoGP Mugello: Front row ‘was always the target’ for Marini

Related Posts