Improving database performance in django part-2: Implement caching
This is second part of the series: Django beyond CRUD, and also second article following the topic: Improving database performance in django, which can be read independetly. So no problem if you haven’t read first one. In this article you will learn fundamental idea about caching and how to implement in django project.
Caching fundamentals
On improving database performance, there are two techniques. First is handling database itself, writting optimized query and choosing best practices while designing the schema etc. Second one is to reduce the overhead on database. If we able to have something which will take the work which our database does, then that amount of work can be reduced. This middle man is in memory database. The Relational databases and some non relational databases nothing but a space in hard disk. Our CPU will communicate with the hard disk - Database - as per requirement like reading, writting the data. If we get some of the data from database and store it in our memory then for interacting with that data, it does not require interaction with main database. This is called caching. We cache the data in our memory, so our CPU does not need to interact with - main database - it follows - less overhead on main database. This also comes with extra benifiet of speed. As obvious it is, the interaction with memory is way way faster than interaction with hard disk. It makes operations blazzingly fast.
Now, memory is limited source of storing the data, even if the computrers having 64gb of ram or higher, it is not an ideal for storing the heavy data for multiple reason. Like RAM is responsible for computing, and storing the data of running processes, and many more system related works. It does not stores the data permanantly. That’s why we have limited amount of advantage of using RAM as an option of storing the data. We need to choose carefully what we want to cache and utilize that limited advantage. Also that cache is not permanant, it has a time span for which it stays in memory. Another thing is, we have to update the main database if the cache data is being updated. Now this is very important point to understand. Why we do caching ? So we can reduce our interaction with main database → resulting into less overhead on main database and faster performance. What if cache data is frequently getting updated? Well, it seems we added extra operations here → caching + interaction with main database. So what’s the conclusion? Caching is efficient only when the data we cache is not getting updated frequently, means that data is read heavy.When I say data get updated it doesn’t mean only update opeartion but delete also, It means overall change in the database. Second when cached data which frequently getting updated, but does not require to be permanantly stored in main database.
Application like netflix, where there are lot of content to watch - in technical terms - read operation. In app like netflix user is not going to upload movies, or series of movies and songs or anything. It’s admin task. The users of application are only going to watch the content. This is read heavy application, where caching can improve performance drastically. The watchlist of user, favourites, bookmarks which are in control of user. User can remove, add frequently as he wants, in this part caching won’t serve the purpose as it will only make extra operations. So the caching will serve it’s purpose where the application is read heavy and data does not get frequently updated. Second when the data is not require to be store permanantly for example, the session data of user, which is stored for only that particular session, and gets deleted ones the session expires.
The concept we discussed here are more important than coding the caching. once we understand why we need caching, where we should apply and in which scenario it will be effective and not, then implementing caching is easy.
Caching in django
Let us start implementing caching in django. django now provides cache frameworl(That is what written in official documentation) which we can levrage. I will give on basic example then one scenario based startegically implementing caching in Learning management system application. For caching we need cache layer - in memory layer, for that we can use redis, but as this article is focusing on caching itself, how to setup redis and using it in django is beyond the scope of article. So even without setting up redis we can use default local memory cache. In settings.py file configure the cache related setting. Location can be named anything, used for seprating the cache layer if we have multiple. Now if we were using redis or memecache we have to define Ip and port of them.
CACHES = {
'default': {
'BACKEND': 'django.core.cache.backends.locmem.LocMemCache',
'LOCATION': 'unique-snowflake',
}
}
Django core has cache framework. Now cache stores the data in key-value pair. each data is associated witrh key, which we have to set when we cache the data and use get when we will retrive it. So strategy is we look in the cache if data exist then we retrive it, if not then we cache the data with key and timeout. timeout is time which data will stay cached in memory, in seconds. Now for that timeout when we will retrive data it will be retrived from the cache and no database operations will be performed. Key assignment to particular data should be done very carefully. When we look at the example of Learning management system this will be more clear.
from django.core import cache
from django.shortcuts import render
from django.http import JsonResponse
# A view to demonstrate caching
def my_view(request):
# Check if data exists in cache
data = cache.cache.get('my_key')
if data is None:
# Data not in cache, fetch or compute the data
data = {'message': 'Hello, world!'}
cache.cache.set('my_key', data, timeout=360)
else:
print("Serving from cache")
return JsonResponse(data)
There are another type of cache also. Caching entire view, caching the template. Now as we understood the concept that if data is getting updated then, caching will make extra operatiion resulting into overhead instead of improved performanece. So If your view is doing just one work and that view has no data which is getting changed, then and only then you should cache the entire view. Probably the user profile section view is one of example. Although user changes the data of profile, but most of time we can bet on it that it won’t be the case. Even if it’s the case, then the data related to profile is not much heavy(Like username, profile pic, address etc, which are normal string image data, not large amount of data). This is where the carefull consideration of implementing caching becomes important. The template caching is also same. For example footer and header section of templates are static part which we can cache. But if header has notification icon, which get updated frequently then we should only cache the header template part excluding notification.
Here’s the basic syntax of how you would do that
@cache_page(60 * 15) #cache the entire view
def my_view(request): ...
{% load cache %}
{% cache 500 sidebar %}
.. sidebar ..
{% endcache %}
Implementing caching, startegy in particular scenario
Now we will implement all of the theory and carefull consideration i mentioned above. I will present you one application, what it does, it’s models and views, then we will apply caching strategy on it.
There are two type of users: students and tutor, User model and userProfile model are linked with one to one relationship. A Classroom model, many to many relationship with userProfile model as students. Means there can be multiple students of one classroom and student can be part of multiple classrooms. One classroom can have only one tutor. Tutor is also userProfile model, so userProfile model has ForiegnKey relationship with Classroom as tutor. Now rest of things are self explainatory in the models. Other models are also very self explainatory, still i will describe a little bit on that. Classroom can have multiple assignments, Sections,Lectures and Announcements Which i didn’t included as it’s irrelavant here. Details you can look into models. I request you to carefully look into this and whenever in further article you get confused, please look into the design of models once again. Because for giving example of implementing caching strategy i can’t explain it with simple one or two model and syntax of caching, no we learned it in above example, now we have to look into more detailed example. If there is bad design in the models please ignore this because the sole purpose of this is to demonstrate caching. And also this models have the methods which i didn’t included because they are irrelvant here, so if you are getting in doubt that why this is not handled properly, it is, but mentioning it here is out of the scope of article and only focus is caching.
class User(AbstractBaseUser):
student = 1
tutor = 2
role_choices = (
(student, "student"),
(tutor, "tutor"),
)
firstname = models.CharField(max_length=50)
lastname = models.CharField(max_length=50)
username = models.CharField(max_length=50)
email = models.EmailField(max_length=254, unique=True)
role = models.PositiveSmallIntegerField(choices=role_choices, default=1)
is_active = models.BooleanField(default=False)
#Other fields which is not required to consider here...
USERNAME_FIELD = 'email'
REQUIRED_FIELDS = ['username', 'firstname', 'lastname']
objects = userManager()
def __str__(self):
return self.email
class userProfile(models.Model):
user = models.OneToOneField(User, on_delete=models.CASCADE)
profile_pic = models.ImageField(upload_to='students/profile_pics', blank=True, null=True, default='students/profile_pics/male.png')
created_at = models.DateTimeField(auto_now_add=True)
modified_at = models.DateTimeField(auto_now=True)
def __str__(self):
return self.user.username
class Classroom(models.Model):
name = models.CharField(max_length=50)
description = HTMLField(blank=True, null=True)
tutor = models.ForeignKey(userProfile, on_delete=models.CASCADE, related_name='classroom_tutor')
students = models.ManyToManyField(userProfile, related_name='classroom_students', blank=True)
requests = models.ManyToManyField(userProfile, related_name='classroom_requests', blank=True)
cover_pic = models.ImageField(upload_to='class_coverpics', blank=False, null=True)
code = models.CharField(MinLengthValidator(6), max_length=16, unique=True)
password = models.CharField(MinLengthValidator(8), max_length=16, unique=True)
created_at = models.DateTimeField(auto_now_add=True)
def __str__(self):
return self.name
class Announcement(models.Model):
title = models.CharField(max_length=50)
file = models.FileField(upload_to='class/announcements',null=True, blank=True)
content = models.TextField()
classroom = models.ForeignKey(Classroom, related_name='announcements', on_delete=models.CASCADE)
upload_date = models.DateTimeField(auto_now_add=True)
link = models.URLField(max_length=1000, null=True, blank=True)
tutor_link = models.URLField(max_length=1000, null=True, blank=True)
def __str__(self):
return self.title
class Notification(models.Model):
title = models.CharField(max_length=50)
content = models.TextField()
user = models.ForeignKey(userProfile, related_name="notifications", on_delete=models.CASCADE)
link = models.URLField(max_length=500, null=True, blank=True)
timestamp = models.DateTimeField(auto_now_add=True)
assignment = models.ForeignKey(Assignment, on_delete=models.CASCADE, null=True, blank=True)
read = models.BooleanField(default=False)
def __str__(self):
return self.title
Now in views, in almost every view of my django app, i have to verify that if the user is member of class or not. For example in every page/view of classroom, like assignment, lectures, home page and so on, there is one operation i have to perform for all is to verfiy if the requested user is in that classroom or not. for that i have a function called check_classroom_participant() which takes two argument one is user and second is id of the classroom. for example….
@login_required(login_url='login')
def announcements(request, id):
user = request.user
if check_classroom_participant(user, id):
Another thiing is we have to read the user data everytime, like every single time, we need to read the user related data. User data gets updated also. But how much frequently? In our app, be it checking login, or permission of particular access(Like here classroom) going to profile section to comment, bookmark and every operation requires accessing user related data and that is more frequent than it gets updated. Is user data gets updated as frequent as it get accessed, not at all. The amount of time user data get accessed probably more than any other and very high compare to it gets updated. This is the key. “What is the ratio of data being accessed and get updated“. Reminding you again, Update is not just update but delete also, overall change in the data. You have to decide based upon that weather you want to cache the data or not, and when you cache the data how you will going to handle the update of data.
Let us first cache the user data. Now once again i request you to look into models again. we will cache the user data fields like username, email, firstname, lastname, role etc which are get access, fields like password, last_login etc are not at all required. It is up to you which field you want to select while caching the user model. We will cache the profiel pic of userProfile. created_at and modifield_at will not get used in application. Now see the notification model, can we cache that too? Let’s meassure the ratio of it being accessed and updated. Notification are present at header, which means every page we load evey request we make - if it reloads the page it will require the notification count - see the read field in Notification model, which will be used to show the count of unread notification to user. Now notifiction let’s say comes at maximum 1 per minute. This is far hypothetical scenario which in real life won’tt be true. But for worst case analysis let’s assume that. But in 1 minute how much time user will get from one page to another page, or if not then how much time page will get reload. This makes clear that notification access is much more than it’s update. So we have to cache the notification, and we will cache complete notification data betting on that it will be not frequently updated like 10 times per minute, and even if that is case it won’t affect much as you can see in the Notification model there are only text, date and boolean fields. Bettng on this will give you chance to make your application faster in real scenario, considering that we are taking hypothetical worst case scenarios while considering caching.
Now it is decided what to cache. Now what would be the cache_key? cache_{user_id} right? for each user this key should be unique. How frequent we want to store, just 180 seconds/ 3 mins are enough as this data has chance to get updated. So we will cache the user at backend level. We will override the default ModelBackend class, which is responsible for authenticating and returning the user. It has method get_user() which we will override. Now, here’s if you haven’t read the first article or not familiar with select_related(), only() and Q objects Count then you will not understand below code. What i did is, instead of seprately caching the user, userprofile and notification, i retrived it in single query of getting user and cached the user.
from django.contrib.auth.backends import ModelBackend
from django.core.cache import cache
from accounts.models import User
from django.db.models import Q, Count
class CachedUserBackend(ModelBackend):
def get_user(self, user_id):
cache_key = f"user_{user_id}"
user = cache.get(key=cache_key, default=None)
if not user:
try:
user = User.objects.select_related('userprofile').prefetch_related(
'userprofile__notifications').annotate(
unread_notifications_count=Count(
'userprofile__notifications', filter=Q(userprofile__notifications__read=False)
)).only(
'firstname', 'lastname', 'username', 'is_active', 'email', 'role', 'password',
'userprofile__profile_pic',
"userprofile__notifications__title", "userprofile__notifications__content", "userprofile__notifications__link",
"userprofile__notifications__timestamp", "userprofile__notifications__assignment_id", "userprofile__notifications__read"
).get(pk=user_id)
cache.set(cache_key, user, timeout=180)
except User.DoesNotExist:
return None
return user if self.user_can_authenticate(user) else None
Briefly, select_related() is used for selecting one to one field and many to one field relationship field data in the same query, prefetch_related() is used for same purpose but for many to one relationship field data(here one user can have multiple notifications). Q objects allow filtering data with & and OR condition and in complex scenarios. Count object used with annotate function, annotate “unread_notifications_count“ to the user(means we can access it as user.unread_notifications_count). filter=Q() mention which field of the user model to count with given filter condition. Here we mentioned userprofile__notifications__read=False. You need to look at the models, foriegn key and their related name if you didn’t understand what double underscored follows here. How amazing is this, in single query you can be able to do this amount of things.
Now, how do we will handle the update scenario. For example when user will update it’s proifle pic or notification will be created, how do we updated it? I will walk you through the strategy step by step, we will se what problems we have to consider.
The strategy:
Set the cache data with timestamp.
If the update of cache is related to one user or not more, update it directly.
If the update of cache is related to multiple user, set the extra cache key for checking if cache require reload, while retriving the data, if reload is required, set new cache data
If cache does not requires reload use the existing cache data
This is the base concept. In our application when user request to join the classroom will be accepted/rejected, there going to be notification for that user. Now for single user we can update the cache
#here student_request is object of userProfile model
StudentClassroom.objects.create(classroom=Class, student=student_request)
Notification.objects.create(
user = student_request,
title="Classroom Join request accepted!",
content=f"""Hello {student_request.user.username} \n Your requst to join classroom:
{Class.name} accepted by the tutor""",
link = f'/classroom/{Class.id}'
)
cache_key = f"user_{student_id}"
_student = cache.cache.get(key=cache_key, default=None)
if _student:
_student.unread_notifications_count += 1
cache.cache.set(cache_key, _student, timeout=180)
But when the classroom is being deleted and we have to notify all the users, rather than updating cache in view level, we will set the cache key for checking if cache require reload, and add the timestamp to the data.
class CachedUserBackend(ModelBackend):
def get_user(self, user_id):
cache_key = f"user_{user_id}"
user, last_reload_at = cache.get(key=cache_key, default=(None, None))
user_data_reload, reload_data_at = cache.get(key=f'reload_user_data', default=(None, None))
if not user or (user_data_reload and reload_data_at > last_reload_at):
try:
user = User.objects.select_related('userprofile').prefetch_related(
'userprofile__notifications').annotate(
unread_notifications_count=Count(
'userprofile__notifications', filter=Q(userprofile__notifications__read=False)
)).only(
'firstname', 'lastname', 'username', 'is_active', 'email', 'role', 'password',
'userprofile__profile_pic',
"userprofile__notifications__title", "userprofile__notifications__content", "userprofile__notifications__link",
"userprofile__notifications__timestamp", "userprofile__notifications__assignment_id", "userprofile__notifications__read"
).get(pk=user_id)
cache.set(cache_key, (user, timezone.now()), timeout=180)
except User.DoesNotExist:
return None
return user if self.user_can_authenticate(user) else None
What change I made is, setting the cache as tuple (user, last_reload_at). And also we check if key = “reload_user_data” exist or not. We have to set that key when we require to update - reload the user data. Which will be tuple - (True/False, timestamp at which it being required). if it exist and set to True, then we compare it to last time cache data being set. Because the key reload_user_data will be in cache for 3 or n minutes, but for each time we will not reload - set the cache, we will set the cache and associate it with timestamp of current time and then we will compare it to the time of cache_key f'reload_user_data’ : reload_data_at.
Here’s how we will set the cache for all the user to reload the cache.
cache.cache.set(f'reload_user_data', (True, timezone.now()), timeout=180)
Now, here’s one problem. For one classroom delete, cache reload should be of only users who are in that classroom. So we have to set the key accordingly such as only that classroom student user data will get reloaded. so in reload_user_data cache key, beside setting it True and timestamp we have to pass the classroom, which we will check in get_user() method, that user is belonged to that class or not.
#In views we set the cache key, classroom here is Classroom model object
cache.cache.set(f'reload_user_data', (True, timezone.now(), classroom), timeout=180)
#In cache backend while retriving the user data
class CachedUserBackend(ModelBackend):
def get_user(self, user_id):
cache_key = f"user_{user_id}"
user, last_reload_at = cache.get(key=cache_key, default=(None, None))
user_data_reload, reload_data_at, Class = cache.get(key=f'reload_user_data', default=(None, None, None))
user_in_class = user.userprofile in Class.students.all() if user else False
if not user or (user_in_class and (user_data_reload and reload_data_at > last_reload_at)):
#same code.......
Now take a look, think about this code. What we have acheived and at which cost! Retriving the user data now involves retriving classroom and it’s students and checking if the user is in that student list or not. We made extra overhead here just for retriving the user. That is my friend, why we require strategical consideration what to cache and how to cache. From above we can conclude, it is not feasible to cache the notifications. First i convinced you (May be or not) that caching notification is good idea, but then we get to know by implementing, it’s not. There will be such scenario throughout the application when you will start implementing caching. And in my experience the only way to know what to cache and how to cache is, first of all understand model and view logic of application very well, then create a draft like caching solution. While implementing caching make sure every view logic is considered, and make sure that your cache data is consistently getting updated when required. If your cache data being updated very frequently, or while caching it require extra operation to perform which overall reduces the efficiency which we initially tried to acheive then do not cache that data. By implementation you will get to know this problems and scenarios. Only way is to gets your hand dirty on the project. So in above example what we can conclude is remove the notification data from cache. Also we have to remove the code which updates the cache of user directly when notification get created.
So when you add the cache with cache key or remove the cache, Do not forget to change add/remove code which is directly related to that cache. Now as i mentioned the first time, that checking user is member of the classroom, can be cached, let’s cache it. Here’s the function which checks if user is participaant of classroom or not. Q object used for applying OR condition, checking if user’s userprofile either belonged as tutor or student to the classroom.
def check_classroom_participant(user, id):
is_participant = False
is_participant = Classroom.objects.only('id').filter(
Q(tutor=user.userprofile) | Q(students=user.userprofile), id=id
).exists()
return is_participant
we will not perform the database operation and directly retrive is_participant from cache via implementing caching. What to cache? is participant for the user. What should be the key? for each user, for each classroom there should be unique key, so unique factors here are two: classroom and user. so cachekey would be f’user_{user_id}_participant_of_class_{classroom_id}’. Now if there would be any other scenario which comes it will only became clear once we implement caching with initial draft as discussed above.
def check_classroom_participant(user, id):
is_participant = False
cache_key = f"classroom_participant_{user.id}_{id}"
is_participant = cache.cache.get(cache_key)
if is_participant is None:
is_participant = Classroom.objects.only('id').filter(
Q(tutor=user.userprofile) | Q(students=user.userprofile), id=id).exists()
cache.cache.set(cache_key, is_participant, timeout=180)
return is_participant
This was pretty streight forward. Now we have to consider all scenarios where this data will get updated. Classroom participant data get updated when user will leave the classroom, or will be removed from classroom. So wherever the code of student being removed or user leaving the classroom. At that level of view logic we should update the cache data. When classroom will get deleted. Here’s the same problem as before. As in notification we have to update the notification cache data for all of the students related to the classroom, here when classroom will get deleted, we have to updte is_participant cache key. Is it the case here? Here’s my friend comes, undrstanding of buisness logic of your application. Sending notification is done frequently, like per minute, 2 minutes, or 10 minnuites or even per hour let’s say. Deleting the class is one of the rare operation. On hypothetical world, tutor can create 100 classroom and delete it. But let be honest and undrstand the requirement of application. Classroom delete is one of the operation which will be performed on day basis not on minutes or hourly bases. So when classroom gets deleted, we can take that extra overhead of deleting the cache of every classroom participants. In the case of notification, it was not good idea, because every 10 seconds 1 minute or even hour, we have to update the cache of every student. Here the operation of deleting the class is not daily bases operation in real world scenario. So here we can safely take that extra overhead once in a week or month for fast performance and less database overhead overall time period.
student_cache_keys = [
f'classroom_participant_{student.user.id}_{id}' for student in Class.students.all()
]
for key in student_cache_keys:
cache.cache.delete(key=key)
What i wanted to teach you is, what is caching fundamentally, how to implement caching in django. What scenarios to consider, how to make strategy for caching and what factors will affect it. I hope that you have understood it. Finally i want you to start implementing the caching in your django project and that is the only way you will be able to understand practically what i mentioned in this article. Pick up complex project which has more complex database design, implement caching in such project. Implementing caching in small project with simple scenarios like caching the user all post, articles or whatever won’t make you understand challanges behind real caching. So start now and learn practically. I hope i could help you otheriwse official documentation and chatgpt is your friend.
Thank you for reading