Software Engineering Practices in Data Science

Chetna Shahi
4 min readJun 27, 2021

Data scientists need software engineering skills while implementing solution in production as well as while collaborating with software engineers. One of the most important software engineering skills is writing efficient & readable code. In this article, I would be mentioning few best coding practices that data scientists should follow before pushing the code to production. Let’s dive in.

Modular Code: Modular code is separated into logical Functions or modules which impacts reusability and readability. Following points should be kept in mind while writing modular code:

  • Do not Repeat Yourself (DRY): Write reusable code using functions or modules
  • Minimize number of functions, classes or modules
  • Function should do one thing and should be very focused
  • Try to use less than 3 arguments in functions

Refactoring Code: Refactoring the code implies restructuring code without changing its functionality. As shown in the example below, usage of replace instead of rename function to put underscore between each words of the string. This helps in saving time as well as reduces the chances of error.

df.columns = [label.replace(‘ ‘, ‘_’) for label in df.columns]df.rename(columns={"TV_Price": "TV Price", "CD_Price": "CD Price"})
  • Meaningful descriptive names for variables & functions
  • Variables should be named in a way that implies the data type for that variable. Few basic examples would be: age_list (could be List of ages) versus ages (could be an Integer). Prefix with word like is or has in Boolean variables to make it clear that its conditions, eg: is_senior
  • Use Verbs for functions & Nouns for variables.
  • Don’t use abbreviations or single letter names, except for counters, function arguments etc.
  • Be descriptive but not with more characters than necessary

Use whitespace judiciously:

  • Organize code with proper indentation.
  • Separate sections with blank lines.
  • Limit lines to 79 characters only. Check out link below for writing clean code:

While refactoring the code, make sure its still efficient i.e. it reduces run time and utilizes low memory space. For eg., Batch data processing might not need to be optimized if it runs for few minutes weekly/monthly, while code written to show posts on social media needs to be relatively fast.

Optimizing Code:

  • Use vector operations over loops wherever possible. Use numpy & pandas to replace loops with vector operations. As shown in example below, use numpy’s intersect1d method to get the intersection of the recent_books and coding_books arrays instead of a nested for loop.
start = time.time()
recent_coding_books = np.intersect1d(recent_books, coding_books)
print(len(recent_coding_books))
print(‘Duration: {} seconds’.format(time.time() — start))
  • Know if any data structures can be used to accomplish the task efficiently. Ex: Sets in the below example
start = time.time()
recent_coding_books = set(recent_books).intersection(coding_books)
print(len(recent_coding_books))
print('Duration: {} seconds'.format(time.time() - start))

The above example shows it took 0.03s using intersect1d and 0.007s using Sets. Hence, usage of sets to compute intersection is most efficient.

Another example of refactoring code: Using the numpy operation to select elements from a list and then perform operation on the selected items. One could have done it using a for loop and then if-else condition on each of the elements but that wouldn't have been the most optimized solution.

start = time.time()
total_price = (gift_costs[gift_costs < 25]).sum() * 1.08
print(total_price)
print(‘Duration: {} seconds’.format(time.time() — start))

Documentation: Additional text or illustration that helps clarify complex part of program and makes code easy to navigate. Different types of documentation are: Line level using inline comments, Function/Module level using docstring, Project level using ReadMe file.

Version Control using Git: Its a great way to share and sync our work between different team members. Proper use of Git Commits & Branches can help you work on multiple features at same time. Commit history messages help to retrace your code backwards. You can also merge code changes done by multiple users.

Testing: Testing is one of the most essential practices by software developers before deploying in production. While in the world of data scientists, this concept is quite fuzzy. Most of the data scientists don’t test the model which results in faulty business decisions as well as execution errors due to software issues.

Unit testing: A type of testing that covers a unit of code, usually single function independently from rest pf the program. You can use PyTest module to do it.

Integration Testing: It tests two or more part of application and also tests interaction between different part of codes.

Test Driven Development (TDD): Write test before writing the code that is to be tested. So your test fails at first and you know the implementation is finished when it passes. This would also make sure that your code doesn’t break when you refactor it.

Logging: Logging is the process of recording messages helps you understand events that occur while running your program. Always know the appropriate level for logging: Debug — Use this level for anything that happens in the program, Error — Use this level to record any error that occur or Info — Use this to record all actions that are user-driven or system specific.

Code Reviews: Code reviews can help catch errors in code, ensure readability, ensure the standards are met & share knowledge among team members. You can check Code Review questionnaire here . You can also use Python’s code linter Pylint that automatically checks for coding standards & PEP 8 guidelines.

Conclusion:

The above mentioned few tips will help Data Scientists to organize their code in a more efficient and usable format. Hope you guys enjoy this quick read and feel free to comment your reviews or tips.

References:

https://classroom.udacity.com/

--

--