Search
  • Sundar Anand

High Performance Language vs High Productivity Language

Updated: May 13



This blog is for engineers who are starting to pace their carrier in the field of software domain.

"Figure out your data structure and the code will follow"

Besides logical thinking, the realm of coding is highly dependent on the coding language we use. There is a stereotype that software engineers work heavily on C, C++ and Java while data scientists invest most of their time in python, Matlab and R lab. Have you ever wondered why this is the case?


Does the answer to this question tell us more about the difference between those programming languages?


What will be the impact if the software engineers use python and data scientists use C++?


In this blog, you will be able to find answers to these very fundamental questions.


“Do the hard jobs first. The easy jobs will take care of themselves.”

The hard job to do is to choose the languages you want to code with proper reasoning.


First, let's classify the programming languages as two:

  1. High Performance Language (Eg: C++)

  2. High Productivity Language (Eg: Python)


High Performance Language


"I feel very comfortable going at full speed."


The programming languages that fall under this category are low level programming languages. As they are very low level, the compiler can easily compile and execute the instructions. This way high performance languages are a lot faster. Low level in the sense, the instructions are very straight forward and no extra processing is required by the compiler or the processor to understand the algorithm. But at the same time scripting in low level language can become a complicated tedious process. As these languages are drafted in such a way that they to be executed swiftly by the compiler, the coder has to take responsibility for managing the memory allocation, datatype of the variables that are used, stringent syntax, etc. Now we can intuitively understand why software engineers prefer high performance languages over others. Their role is basically is to run the script that will be triggered most times and so they want it to be highly optimised.


High Productive Language


“Productivity is being able to do things that you were never able to do before.”

Just like how this quote suggests despite performance languages can be executed easily and lightly, they have some limitations. High Productive Languages are designed to address those limitations with some tradeoff. One of the most crucial limitations of the performance languages is that it becomes tough to code if we are incorporating high complicated logic for specific usage. For example, in the data science domain, we may have to use some advanced statistical and probability concepts with linear algebra. It becomes tedious and challenging to code these concepts following the constrained syntax formats and memory management that high performance language demands. Moreover, languages like python are open sourced which means anyone can build their class and upload it as a library to the python space. The tradeoff here is the speed of execution. Even though performance languages give us the freedom to script complicated logic with ease, the compiler won't be able to understand these instructions the same way. So an ideal high productive language compiler converts this code to a low level language code and then executes it. This costs some extra time and optimisation on the execution side.


The Comparison...


Let's consider the time when Python is better than C++. For example let's just observe the hello world code of both of them


Python:

print("Hello, World!")

C:

#include <stdio.h>
int main()
{
 printf("Hello, World!");
 return 0;
}

This by itself gives a very clear difference on why Python may look like a simpler language for coding. Now we will see some complicated code, let's talk about decision trees supervised learning algorithm from the data science domain.


Python:

# Required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Reading the file
dataset = pd.read_csv("")

# Separating input and output
X = dataset.drop('Class', axis=1)
y = dataset['Class']

# Splitting train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

# Model initialisation
classifier = DecisionTreeClassifier()

# Model Fitting
classifier.fit(X_train, y_train)

# Model Prediction
y_pred = classifier.predict(X_test)

C:


#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <cmath>
#include <map>
using namespace std;

class Table {
	public:
		vector<string> attrName;
		vector<vector<string> > data;

		vector<vector<string> > attrValueList;
		void extractAttrValue() {
			attrValueList.resize(attrName.size());
			for(int j=0; j<attrName.size(); j++) {
				map<string, int> value;
				for(int i=0; i<data.size(); i++) {
					value[data[i][j]]=1;
				}

				for(auto iter=value.begin(); iter != value.end(); iter++) {
					attrValueList[j].push_back(iter->first);
				}
			}
		}
};

class Node {
	public:
		int criteriaAttrIndex;
		string attrValue;

		int treeIndex;
		bool isLeaf;
		string label;

		vector<int > children;

		Node() {
			isLeaf = false;
		}
};

class DecisionTree {
	public:
		Table initialTable;
		vector<Node> tree;

		DecisionTree(Table table) {
			initialTable = table;
			initialTable.extractAttrValue();

			Node root;
			root.treeIndex=0;
			tree.push_back(root);
			run(initialTable, 0);
			printTree(0, "");

			cout<< "<-- finish generating decision tree -->" << endl << endl;
		}

		string guess(vector<string> row) {
			string label = "";
			int leafNode = dfs(row, 0);
			if(leafNode == -1) {
				return "dfs failed";
			}
			label = tree[leafNode].label;
			return label;
		}

		int dfs(vector<string>& row, int here) {
			if(tree[here].isLeaf) {
				return here;
			}

			int criteriaAttrIndex = tree[here].criteriaAttrIndex;

			for(int i=0;i<tree[here].children.size(); i++) {
				int next = tree[here].children[i];

				if (row[criteriaAttrIndex] == tree[next].attrValue) {
					return dfs(row, next);
				}
			}
			return -1;
		}

		void run(Table table, int nodeIndex) {
			if(isLeafNode(table) == true) {
				tree[nodeIndex].isLeaf = true;
				tree[nodeIndex].label = table.data.back().back();
				return;
			}

			int selectedAttrIndex = getSelectedAttribute(table);

			map<string, vector<int> > attrValueMap;
			for(int i=0;i<table.data.size();i++) {
				attrValueMap[table.data[i][selectedAttrIndex]].push_back(i);
			}

			tree[nodeIndex].criteriaAttrIndex = selectedAttrIndex;

			pair<string, int> majority = getMajorityLabel(table);
			if((double)majority.second/table.data.size() > 0.8) {
				tree[nodeIndex].isLeaf = true;
				tree[nodeIndex].label = majority.first;
				return;
			}

			for(int i=0;i< initialTable.attrValueList[selectedAttrIndex].size(); i++) {
				string attrValue = initialTable.attrValueList[selectedAttrIndex][i];

				Table nextTable;
				vector<int> candi = attrValueMap[attrValue];
				for(int i=0;i<candi.size(); i++) {
					nextTable.data.push_back(table.data[candi[i]]);
				}

				Node nextNode;
				nextNode.attrValue = attrValue;
				nextNode.treeIndex = (int)tree.size();
				tree[nodeIndex].children.push_back(nextNode.treeIndex);
				tree.push_back(nextNode);

				// for empty table
				if(nextTable.data.size()==0) {
					nextNode.isLeaf = true;
					nextNode.label = getMajorityLabel(table).first;
					tree[nextNode.treeIndex] = nextNode;
				} else {
					run(nextTable, nextNode.treeIndex);
				}
			}
		}

		double getEstimatedError(double f, int N) {
			double z = 0.69;
			if(N==0) {
				cout << ":: getEstimatedError :: N is zero" << endl;
				exit(0);
			}
			return (f+z*z/(2*N)+z*sqrt(f/N-f*f/N+z*z/(4*N*N)))/(1+z*z/N);
		}

		pair<string, int> getMajorityLabel(Table table) {
			string majorLabel = "";
			int majorCount = 0;

			map<string, int> labelCount;
			for(int i=0;i< table.data.size(); i++) {
				labelCount[table.data[i].back()]++;

				if(labelCount[table.data[i].back()] > majorCount) {
					majorCount = labelCount[table.data[i].back()];
					majorLabel = table.data[i].back();
				}
			}

			return {majorLabel, majorCount};
		}


		bool isLeafNode(Table table) {
			for(int i=1;i < table.data.size();i++) {
				if(table.data[0].back() != table.data[i].back()) {
					return false;
				}
			}
			return true;
		}

		int getSelectedAttribute(Table table) {
			int maxAttrIndex = -1;
			double maxAttrValue = 0.0;

			// except label
			for(int i=0; i< initialTable.attrName.size()-1; i++) {
				if(maxAttrValue < getGainRatio(table, i)) {
					maxAttrValue = getGainRatio(table, i);
					maxAttrIndex = i;
				}
			}

			return maxAttrIndex;
		}

		double getGainRatio(Table table, int attrIndex) {
			return getGain(table, attrIndex)/getSplitInfoAttrD(table, attrIndex);
		}

		double getInfoD(Table table) {
			double ret = 0.0;

			int itemCount = (int)table.data.size();
			map<string, int> labelCount;

			for(int i=0;i<table.data.size();i++) {
				labelCount[table.data[i].back()]++;
			}

			for(auto iter=labelCount.begin(); iter != labelCount.end(); iter++) {
				double p = (double)iter->second/itemCount;

				ret += -1.0 * p * log(p)/log(2);
			}

			return ret;
		}

		double getInfoAttrD(Table table, int attrIndex) {
			double ret = 0.0;
			int itemCount = (int)table.data.size();

			map<string, vector<int> > attrValueMap;
			for(int i=0;i<table.data.size();i++) {
				attrValueMap[table.data[i][attrIndex]].push_back(i);
			}

			for(auto iter=attrValueMap.begin(); iter != attrValueMap.end(); iter++) {
				Table nextTable;
				for(int i=0;i<iter->second.size(); i++) {
					nextTable.data.push_back(table.data[iter->second[i]]);
				}
				int nextItemCount = (int)nextTable.data.size();

				ret += (double)nextItemCount/itemCount * getInfoD(nextTable);
			}

			return ret;
		}

		double getGain(Table table, int attrIndex) {
			return getInfoD(table)-getInfoAttrD(table, attrIndex);
		}

		double getSplitInfoAttrD(Table table, int attrIndex) {
			double ret = 0.0;

			int itemCount = (int)table.data.size();

			map<string, vector<int> > attrValueMap;
			for(int i=0;i<table.data.size();i++) {
				attrValueMap[table.data[i][attrIndex]].push_back(i);
			}

			for(auto iter=attrValueMap.begin(); iter != attrValueMap.end(); iter++) {
				Table nextTable;
				for(int i=0;i<iter->second.size(); i++) {
					nextTable.data.push_back(table.data[iter->second[i]]);
				}
				int nextItemCount = (int)nextTable.data.size();

				double d = (double)nextItemCount/itemCount;
				ret += -1.0 * d * log(d) / log(2);
			}

			return ret;
		}

		/*
		 * Enumerates through all the nodes of the tree and prints all the branches 
		 */
		void printTree(int nodeIndex, string branch) {
			if (tree[nodeIndex].isLeaf == true)
				cout << branch << "Label: " << tree[nodeIndex].label << "\n";

			for(int i = 0; i < tree[nodeIndex].children.size(); i++) {
				int childIndex = tree[nodeIndex].children[i];

				string attributeName = initialTable.attrName[tree[nodeIndex].criteriaAttrIndex];
				string attributeValue = tree[childIndex].attrValue;

				printTree(childIndex, branch + attributeName + " = " + attributeValue + ", ");
			}
		}
};


class InputReader {
	private:
		ifstream fin;
		Table table;
	public:
		InputReader(string filename) {
			fin.open(filename);
			if(!fin) {
				cout << filename << " file could not be opened\n";
				exit(0);
			}
			parse();
		}
		void parse() {
			string str;
			bool isAttrName = true;
			while(!getline(fin, str).eof()){
				vector<string> row;
				int pre = 0;
				for(int i=0;i<str.size();i++){
					if(str[i] == '\t') {
						string col = str.substr(pre, i-pre);

						row.push_back(col);
						pre = i+1;
					}
				}
				string col = str.substr(pre, str.size()-pre-1);
				row.push_back(col);

				if(isAttrName) {
					table.attrName = row;
					isAttrName = false;
				} else {
					table.data.push_back(row);
				}
			}
		}
		Table getTable() {
			return table;
		}
};

class OutputPrinter {
	private:
		ofstream fout;
	public:
		OutputPrinter(string filename) {
			fout.open(filename);
			if(!fout) {
				cout << filename << " file could not be opened\n";
				exit(0);
			}
		}

		string joinByTab(vector<string> row) {
			string ret = "";
			for(int i=0; i< row.size(); i++) {
				ret += row[i];
				if(i != row.size() -1) {
					ret += '\t';
				}
			}
			return ret;
		}

		void addLine(string str) {
			fout << str << endl;
		}
};

int main(int argc, const char * argv[]) {
	if(argc!=4) {
		cout << "Please follow this format. dt.exe [train.txt] [test.txt] [result.txt]";
		return 0;
	}

	string trainFileName = argv[1];
	InputReader trainInputReader(trainFileName);
	DecisionTree decisionTree(trainInputReader.getTable());

	string testFileName = argv[2];
	InputReader testInputReader(testFileName);
	Table test = testInputReader.getTable();

	string resultFileName = argv[3];
	OutputPrinter outputPrinter(resultFileName);
	outputPrinter.addLine(outputPrinter.joinByTab(test.attrName));
	for(int i=0;i < test.data.size(); i++) {
		vector<string> result = test.data[i];
		result.push_back(decisionTree.guess(test.data[i]));
		outputPrinter.addLine(outputPrinter.joinByTab(result));
	}

	/* for answer check */
	/*
	   InputReader answerInputReader("dt_answer1.txt");
	   Table answer = answerInputReader.getTable();
	   int totalCount = (int)answer.data.size();
	   int hitCount = 0;
	   for(int i=0;i < test.data.size(); i++) {
	   if(answer.data[i].back() == decisionTree.guess(test.data[i])) {
	   hitCount++;
	   }
	   }
	   cout << "Accuracy: " << (double)hitCount/totalCount*100 << "%";
	   cout << "(" << hitCount << "/" << totalCount << ")" << endl;
	   */
	return 0;
}

From this example, we are able to understand why most people's first choice of coding is python and not C/C++. It's not just the simplicity of the language but also that the fact that python language is open-sourced, which allows anyone to build their own library and upload. One such example we used here for reference is sklearn (contains decision tree algorithm).


Then why should people even go for C/C++?


Let's see a graph to understand why C/C++ is still preferred over Python in few places...

The above graph shows the performance (ie operations per second) that the programming languages can achieve. This comparison was carried out between two programming languages, namely C (High Performance) and Python (High Productive). The task was to multiply two huge integer values. The Y-axis represents the performance whereas the X-axis shows the length of the integer considered (IMUL + num_of_digits). In order to conduct an unbiased experiment, the same logic was used on both the programming languages and was run on two systems - CMU ECE cluster (ECE) and AWS EC2 instance (EC2). Here we can clearly see that the performance of the C language outshines the performance of the python language in both machines. But at the same time, the time taken to perfect a C code with all memory managements was heavier than the time taken to implement the same in Python. Python also gave the access to the "NUMPY" library which made it even more easier and less complicated. This is the basic difference between high performance languages and high productive languages.


Also, another reason for the IoT chips or any lightweight hardware boards to prefer C to Python is that C is not just time optimised but is also memory optimised. When we talk about time and memory, the algorithms that come to our mind are search, sort, and insertion algorithms. The below graphs shows the memory and time consumption of these algorithms in both Python and C.


Memory comparison


Time Comparison

From these above graphs we can observe that C (High Performance Language) is a lot optimised in terms of both memory and time when compared to Python (High Productive Language). Now we can cherish the importance of a high performance language like C/C++. These are the primary reasons for the high performance languages to still dominant the software half of the world.


After explaining High Performance and Productive languages separately and also after comparing both of their scripting style, time and memory optimisation, we are clearly able to understand the difference between the High Performance Language and High Productive Language and in what ways they both stand out from each other.


Now lets try to answer the questions...

  • Software people -> High Performance Language; Data Scientist -> High Productive language, why?

Answer: As software engineers develop scripts that are fundamentally required for any system to run, their script should be able to execute at the most optimised state. But the data scientists develop logic integrating advanced mathematical libraries, so they have to go with the high productive language selection with the trade off of execution speed.

  • Does the answer to the previous question tell us more about the difference between those programming languages?

Answer: By understanding the roles and responsibilities of the software engineer and a data scientist, we were able to see how they choose their programming language. This understanding of the reason behind it enables us to observe the difference between the high performance language and a high productive language and to cherish their own advantages.

  • What will be the impact if the software engineers use python and data scientists use C++?

Answer: If the software engineers use a high productive language like python, they will have to face the execution time delay problem which is not at all good for any system. On the other hand, if the data scientists use high performance languages like C++ then they will have to spend a lot of time coding all the complicated mathematical logics for all their experiments, this will lead to a situation where the engineer spends most of the time in drafting the code and not in performing research to get the most optimal solution.


If you are intrigued by thoughts like these and are interested to learn more then please consider going through the 18647 - Computation Problem Solving for Engineers course content offered by the Carnegie Mellon University's Electrical and Computer Engineering department.


Some super-useful reference links to get a deeper understand of high performance language and high productive language -

  1. https://lemire.me/blog/2017/01/16/best-programming-language-for-high-performance-january-2017/

  2. https://www.bbc.co.uk/bitesize/guides/z4cck2p/revision/1

  3. https://www.bitdegree.org/tutorials/python-vs-c-plus-plus/

  4. https://www.geeksforgeeks.org/difference-between-python-and-c/

If you are intrigued by the thoughts like these and are interested to learn more then please consider the 18647 - Computation Problem Solving for Engineers course content offered by the Carnegie Mellon University's Electrical and Computer Engineering department.


If you have any feedback feel free to write to sundaranand1998@gmail.com



23 views0 comments