PostsNuke TipsDemo ReelGalleryResumeAbout

Data Classes in Python

Huey Yeng

There are many articles on Python’s dataclass so I’ll focus on my usage at work and personal projects.

TLDR, embrace dataclass if you’re on Python 3.7 and newer! Most major animation suites that complies with VFX Platform 2020 should be using Python 3.7.

What Went Wrong

Previously, I used dict when I needed to store something to a variable. Sure dict is flexible but when the data schema grows, it becomes… more difficult to keep track in a predictable way in an IDE like PyCharm or VS Code.

Also with dict, I got no idea what the data type is.

1	employee_id = response_data["employee_id"]
2	employee = Employee.objects.get(pk=employee_id)
3	# Cues traceback error of invalid type
4

Is employee_id str or int? Spoiler: it is

Fun Fact: I’ve dealt with a Django codebase that have an employee_id CharField for… a model named Employee. It led to many funky code smell during debugging that got me refactor and reformat the entire codebase.

1	...
2	crews = response_data["projects"][0]["crews"]
3	# Wait what the heck is `crews` supposed to be (other than reading
4	# the next several dozens/hundreds of lines that possibly mutates)
5	...
6

What does it mean with crews? It is a list/array of something? Or gasp, a string that uses comma to separate the values?

Or maybe it is None. No one knows until you inspect the data!

My main concern when working off an existing codebase from previous developer is figuring out the structure of the dict as we need to guess the key and value.

Yes, guessing is the name of the game.

I don’t even understand how the developer can remember what the dict looks like in the future for code refactoring or bug fixes besides adding meaningful comment… if the naming convention and label makes sense (I’ll cover this in a future post).

I admit I’m guilty of using dict and relying on unit tests at my old job to ensure nothing breaks when there are changes to the key or value. At least I have an excuse because it is running on Python 3.6!

What prompted me to embrace dataclass? After working with projects that use TypeScript. The moment VS Code highlighted the error and possible key/value for a JavaScript object, isn’t it nice if I can have that when writing Python code!

1// Example from my personal React site code
2export interface ITheme {
3  autoPrefetch: "in-view";
4  menu: Array<string>;
5  isMobileMenuOpen: boolean;
6  skeletonDebug: boolean;
7  skeletonDebugDuration: number;
8  featured: {
9    showOnList: boolean;
10    showOnPost: boolean;
11  };
12  relatedPosts: {
13    showThumbnail: boolean;
14  };
15} 
16

Enter dataclass

To keep this section short and simple, I’ll explain how I usually use dataclass in the next section.

1# The sample dict  
2response_data = {  
3    "id": 1,  
4    "name": "Winamp",  
5    "description": "It Really Whips the Llama's Ass",  
6    "install_path": "C:\\Program Files\\Winamp",  
7    "installed": True,  
8}
9

Here we have a dict structure that (finger-cross) will not mutate. Assuming it is happy flow, the above dict can be describe using dataclass like this:

1from dataclasses import dataclass  
2  
3@dataclass  
4class Software:  
5    id: int  
6    name: str  
7    description: str  
8    install_path: str  
9    installed: bool = False
10

So how do we use it? Refer to the following screenshot taken from PyCharm:

As installed have a default value of False, Python will use that if one doesn’t explicitly set it which is already a nice quality of life versus dict.

To wrap up this section, dataclass allows for more predictable coding input while minimising human and AI errors (when GitHub Copilot and ChatGPT replaces junior developers gasp).

Refer to the following code block for comparison of using dict versus dataclass.

1# pretend there is 'softwares' key in response.json() that is a list of dict  
2softwares = response.json()["softwares"]  
3for software in softwares:  
4    print(f"Checking {software['name']} is already installed...")  
5    install_path = Path(software["install_path"])  
6    if not install_path.exists():  
7        print(f"{software['name']} install path is not found!")  
8      
9    software["installd"] = install_path.exists()  
10  
11# dataclass approach  
12softwares = [foobar, ...]  
13for software in softwares:  
14    print(f"Checking {software.name} is already installed...")  
15    install_path = Path(software.install_path)  
16    if not install_path.exists():  
17        print(f"{software.name} install path is not found!")  
18          
19    software.installed = install_path.exists()
20
21

If you manage to spot a typo, congratulations on reading the code thoroughly. As far as I know, there is no reliable type hinting with dict in either PyCharm or VS Code so typo when specifying the key can lead to many funny moments.

My Typical Usage of dataclass

dataclasses include helper functions such as to_dict and from_dict which offer flexibility when we need to convert to dict data for third party libraries and external API.

Before we start, it is best to create a base class that has the following:

1# the sample code for this section will be familiar for those using Django
2import inspect
3from dataclasses import asdict, dataclass
4from typing import Any, Dict
5
6@dataclass  
7class BaseModel:  
8    def to_dict(self) -> Dict[str, Any]:  
9        dict_ = {  
10            k: v for k, v in asdict(self).items() if v  
11        }  
12        return dict_  
13  
14    @classmethod  
15    def from_dict(cls, dict_):  
16        params = inspect.signature(cls).parameters  
17  
18        return cls(  
19            **{  
20                k: v for k, v in dict_.items()  
21                if k in params  
22            }  
23        )
24

What you want to do is inherit from this base class and you can use the to_dict and from_dict method.

1...
2# continuing from above
3
4@dataclass
5class Achievement(BaseModel):
6	id: int
7	name: str
8	description: str = ""
9	is_hidden: bool = True
10
11...
12# pretend we are dealing with 3rd party API and want to store/deserialize the data
13def scrap_achievement_data(blah, bruh):
14	response = requests.get(blah, bruh)
15	if response.ok:
16		response_data = response.json()
17		achievement = Achievement.from_dict(response_data)
18		return achievement
19
20...
21# now we want to serialize the dataclass to a dict object (not necessarily JSON compatible!)
22player = Player.objects.get(nick_name="Spam", team="Monty Python")
23player.metadata = achievement.to_dict()  # metadata is JSONField for this example
24player.save()
25

Is dataclass worth using?

I don’t really see any cons with using dataclass unless you need your code to be portable with Python 2.7 or Python 3.6 and older.

Obviously you need to have a decent IDE with IntelliSense. I’m not sure if the latest Notepad++ or Sublime Text have similar features.

As I can’t dictate your development environment, try to use dataclass for the code readability, IDE autocompletion and type checking (either through the IDE or MyPy) whenever possible.

References

https://realpython.com/python-data-classes/

https://docs.python.org/3/library/dataclasses.html

Social
GitHubGitHub
© 2024 Huey Yeng
Back to Top
Dark BackgroundDefault Background