Project #1 - Buffer Pool Notes

Notes of the 1st project of the course 15-445/645 Database Systems, 2020 Fall version.

Introduction

This project is to implement the buffer pool using some of its associated components as described in Lecture 5 of this course. The LRU replacer also needs to be implemented.

Components include:

Interpretation of specification

We have a set of operations, all of which could potentially change the state of the components.

Implementation choices

In the provided code

To represent the buffer pool, one might be tempted to make an array of frames, and each array element stores a pointer to the pages. This might be logically straightforward, but the resulting buffer pool might be scattered in memory.

In the provided code, the buffer pool was represented as an array of pages:

pages_ = new Page[pool_size_];

which ensures a consecutive memory space for the buffer pool.

This is also emphasized in the project description:

The BufferPoolManager will reuse the same Page object to store data as it moves back and forth to disk. This means that the same Page object may contain a different physical page throughout the life of the system. The Page object’s identifier (page_id) keeps track of what physical page it contains.

Choices left for us

For some parts of the project, it was left to us to decide on which data structures to use. It turned out that a poor choice could lead to an extremely poor performance and thus not passing the Gradescope tests.

The following test case was designed to test the behaviour of the buffer pool manager with a large buffer size. In this test, the buffer pool size is set to 100k. 50k new pages are created and unpinned immediately, followed by creating another 100k new pages. Since there is not enough space to store the second half of the 100k pages, the unpinned 50k new pages would be replaced.

TEST(BufferPoolManagerTest, HugeBufferTest) {
 const std::string db_name = "test.db";
 int buffer_pool_size = 100000;
 
 auto *disk_manager = new DiskManager(db_name);
 auto *bpm = new BufferPoolManager(buffer_pool_size, disk_manager);
 
 page_id_t page_id_temp;
 
 for (int i = 0; i < buffer_pool_size / 2; ++i) {
   EXPECT_NE(nullptr, bpm->NewPage(&page_id_temp));
   EXPECT_EQ(i, page_id_temp);
   EXPECT_TRUE(bpm->UnpinPage(i, false));
 }
 
 for (int i = 0; i < buffer_pool_size; ++i) {
   EXPECT_NE(nullptr, bpm->NewPage(&page_id_temp));
 }
 
 // Shutdown the disk manager and remove the temporary file we created.
 disk_manager->ShutDown();
 remove("test.db");
 
 delete bpm;
 delete disk_manager;
}

During the whole process, the Pin, Unpin, and Victim methods of the replacer would be called multiple times.

The following code shows the initial implementation of these methods. In the header file, we use a std::list to represent the unpinned frames:

std::list<frame_id_t> unpinned_frames_;

In the cpp file:

bool LRUReplacer::Victim(frame_id_t *frame_id) {
 if (unpinned_frames_.empty()) {
   return false;
 }
 *frame_id = unpinned_frames_.front();
 unpinned_frames_.pop_front();
 return true;
}
 
void LRUReplacer::Pin(frame_id_t frame_id) {
 unpinned_frames_.remove(frame_id);
}
 
void LRUReplacer::Unpin(frame_id_t frame_id) {
 // If the frame isn't on the unpinned frames list yet, we add it to the back.
 auto result = std::find(std::begin(unpinned_frames_), std::end(unpinned_frames_), frame_id);
 if (result == std::end(unpinned_frames_)) {
   unpinned_frames_.emplace_back(frame_id);
 }
}

For Unpin, we need to check if the unpinned frame is already on the list to avoid having duplicated frames. We use std::find to iterate through the list. For Pin, we could have used the same check, but the way that the remove method of std::list works ensures that it won’t be a problem if the element doesn’t exist.

The execution time was 93 seconds.

[ RUN      ] BufferPoolManagerTest.HugeBufferTest
[       OK ] BufferPoolManagerTest.HugeBufferTest (93522 ms)

The improved implementation is as follows:

In the header file, add:

std::set<frame_id_t> frames_set_;

In the cpp file:

bool LRUReplacer::Victim(frame_id_t *frame_id) {
 if (unpinned_frames_.empty()) {
   return false;
 }
 *frame_id = unpinned_frames_.front();
 unpinned_frames_.pop_front();
 frames_set_.erase(*frame_id);
 assert(unpinned_frames_.size() == frames_set_.size());
 return true;
}
 
void LRUReplacer::Pin(frame_id_t frame_id) {
 if (frames_set_.find(frame_id) != frames_set_.end()) {
   unpinned_frames_.remove(frame_id);
   frames_set_.erase(frame_id);
 }
 assert(unpinned_frames_.size() == frames_set_.size());
}
 
void LRUReplacer::Unpin(frame_id_t frame_id) {
 // If the frame isn't on the unpinned frames list yet, we add it to the back.
 if (frames_set_.find(frame_id) == frames_set_.end()) {
   unpinned_frames_.emplace_back(frame_id);
   frames_set_.insert(frame_id);
 }
 assert(unpinned_frames_.size() == frames_set_.size());
}

Alongside the list of the frames, we use an auxiliary set to contain the frames as well. We add and remove elements to the list and set simultaneously, so their sizes are always the same. The key difference between the two implementations is that in the second implementation, we check if a frame is in the free list by using the find method of the frame_set_. Also, we check if an element is in the list before removing it. The execution time of the second implementation is:

[ RUN      ] BufferPoolManagerTest.HugeBufferTest
[       OK ] BufferPoolManagerTest.HugeBufferTest (760 ms)

We achieved a 100x speedup!

How Unpin gets faster?

A std::list is usually implemented as a doubly-linked list. Because in our test code we don’t have duplicated frames, every time the iteration would need to check the entire existing unpinned frames list to come to the conclusion that result == std::end(unpinned_frames_. Therefore, the time complexity is O(N^2). Unlike a std::list, a std::set is usually implemented as a red-black tree, which provides O(log(N)) search time. Therefore, the time complexity of Unpin becomes O(Nlog(N)).

How Pin gets faster?

In the initial implementation, remove actually does a comparison between the to-be-removed value and each element in the list, and we rely on this fact to throw away the “is-in-list” check. The time complexity is O(N^2). In the improved implementation, we do the check using the std::set (O(log(N)) complexity). Because in the test case there is actually no duplicate frames in the list of the unpinned frames, the if (frames_set_.find(frame_id) != frames_set_.end()) statement doesn’t hold, thus the expensive operation unpinned_frames_.remove(frame_id) is not executed. Overall, the time complexity of Pin is reduced to O(Nlog(N)).

Tips

Understanding the specs is important. In my case, the time spent on understanding the specs was more than the time spent on writing the code. Sources of specs include lecture notes/videos, project description, comments in the code, and sample test cases.