Create and Discover the Panorama of Roles and Salaries in Knowledge Science | by Erdogan Taskesen | Jun, 2023



The knowledge science wage knowledge set is derived from ai-jobs.internet [1] and can be open as a Kaggle competitors [2]. The information set incorporates 11 options for 4134 samples. The samples are collected worldwide and weekly up to date from 2020 to the current time (someplace starting of 2023). The dataset is revealed within the public area, and freed from use. Let’s load the information and take a look on the variables.

# Import library
import datazets as dz
# Get the information science wage knowledge set
df = dz.get('')

# The options are as following

# 'work_year' > The yr the wage was paid.
# 'experience_level' > The expertise stage within the job in the course of the yr.
# 'employment_type' > Kind of employment: Half-time, full time, contract or freelance.
# 'job_title' > Title of the position.
# 'wage' > Whole gross wage quantity paid.
# 'salary_currency' > Forex of the wage paid (ISO 4217 code).
# 'salary_in_usd' > Transformed wage in USD.
# 'employee_residence' > Major nation of residence.
# 'remote_ratio' > Distant work: lower than 20%, partially, greater than 80%
# 'company_location' > Nation of the employer's principal workplace.
# 'company_size' > Common variety of those who labored for the corporate in the course of the yr.

# Collection of solely European international locations
# countries_europe = ['SM', 'DE', 'GB', 'ES', 'FR', 'RU', 'IT', 'NL', 'CH', 'CF', 'FI', 'UA', 'IE', 'GR', 'MK', 'RO', 'AL', 'LT', 'BA', 'LV', 'EE', 'AM', 'HR', 'SI', 'PT', 'HU', 'AT', 'SK', 'CZ', 'DK', 'BE', 'MD', 'MT']
# df['europe'] = np.isin(df['company_location'], countries_europe)

A abstract of the highest job titles along with the distribution of the salaries is proven in Determine 1. The 2 high panels are worldwide whereas the underside two panels are just for Europe. Though such graphs are informative, they present averages and it’s unknown how location, expertise stage, distant work, nation, and many others are associated in a specific context. For instance: Is the wage of an entry-level knowledge engineer that works remotely for a small firm roughly just like an skilled knowledge engineer with different properties? Such questions may be higher answered with the evaluation as proven within the subsequent sections.

Determine 1. The highest-ranked job titles. The 2 high panels are worldwide statistics whereas the underside two panels are for Europe. (picture by creator)


The information science wage knowledge set is a blended knowledge set containing steady, and categorical variables. We are going to carry out an unsupervised evaluation and create the information science panorama. However earlier than doing any preprocessing, we have to take away redundant options akin to salary_currency and wage to stop multicollinearity points. As well as, we are going to exclude the variable salary_in_usd from the information set and retailer it as a goal variable y as a result of we are not looking for that grouping happens due to the wage itself. Based mostly on the clustering, we are able to examine whether or not any of the detected groupings may be associated to wage. The cleaned knowledge set leads to 8 options with the identical 4134 samples.

# Retailer wage in separate goal variable.
y = df['salary_in_usd']

# Take away redundant variables
df.drop(labels=['salary_currency', 'salary', 'salary_in_usd'], inplace=True, axis=1)

# Make the catagorical variables higher to know.
df['experience_level'] = df['experience_level'].exchange({'EN':'Entry-level', 'MI':'Junior Mid-level', 'SE':'Intermediate Senior-level', 'EX':'Knowledgeable Govt-level / Director'}, regex=True)
df['employment_type'] = df['employment_type'].exchange({'PT':'Half-time', 'FT':'Full-time', 'CT':'Contract', 'FL':'Freelance'}, regex=True)
df['company_size'] = df['company_size'].exchange({'S':'Small (lower than 50)', 'M':'Medium (50 to 250)', 'L':'Giant (>250)'}, regex=True)
df['remote_ratio'] = df['remote_ratio'].exchange({0:'No distant', 50:'Partially distant', 100:'>80% distant'}, regex=True)
df['work_year'] = df['work_year'].astype(str)

# (4134, 8)

The following step is to get all measurements into the identical unit of measurement. To be able to do that, we are going to fastidiously carry out one-hot encoding and maintain multicollinearity that we unknowingly can introduce. In different phrases, after we remodel any categorical variable into a number of one-hot variables, we introduce a bias that permits us to completely predict a function primarily based on two or extra options from the identical categorical column (aka the sum of one-hot encode options is at all times one). That is referred to as a dummy entice and we are able to stop it by breaking the chain of linearity by merely dropping one column. The df2onehot bundle incorporates the dummy entice safety function. This function is barely extra superior than merely dropping a one-hot column pér class as a result of it solely removes a one-hot column if the chain of linearity shouldn’t be but damaged because of different cleansing actions, such at least variety of samples pér one-hot function or the elimination of the False state in boolean options.

# Import library
from df2onehot import df2onehot

# One scorching encoding and eradicating any multicollinearity to stop the dummy entice.
dfhot = df2onehot(df,

# work_year_2021 ... company_size_Small (lower than 50)
# 0 False ... False
# 1 False ... False
# 2 False ... False
# 3 False ... False
# 4 False ... False
# ... ... ...
# 4129 False ... False
# 4130 True ... False
# 4131 False ... True
# 4132 False ... False
# 4133 True ... False

# [4134 rows x 115 columns]

In our case, we are going to take away one-hot encoded options that comprise lower than 5 samples (y_min=5), and take away multicollinearity to stop the dummy entice (remove_multicollinearity=True). This leads to 115 one-hot encoded options for a similar 4134 samples.