Project #1 - Buffer Pool Notes

February 7, 2021

Notes of the 1st project of the course 15-445/645 Database Systems, 2020 Fall version.

Introduction

This project is to implement the buffer pool using some of its associated components as described in Lecture 5 of this course. The LRU replacer also needs to be implemented.

Components include:

A Buffer Pool Manager, which manages/maintains the following components:
- A Buffer Pool: the actual buffer pool where data is cached in memory.
- A Page Table: a mapping between pages and the frames they are put in.
- A Replacer: responsible for maintaining unpinned frames which are candidates for being replaced, for example, when a page needs to be created or a page needs to be fetched from disk.
- A Free List: list of free frames which can be used as the first choice for a new page or page fetched from disk.
A Disk Manager: handles disk I/O. Operations include: allocate/deallocate and read/write.
A Log Manager: is not used in this project.

Interpretation of specification

We have a set of operations, all of which could potentially change the state of the components.

NewPage: creates a new page in the buffer pool.
- Buffer Pool: NewPage is only possible when there is space in the buffer pool. (Either there is a free frame or there is a frame which can be safely replaced.)
- Page Table: If a free frame is used to host the new page, this (frame, page) pair needs to be added to the page table; if the new page replaces a victim page in the buffer pool, the page table needs to be updated accordingly.
- Replacer: The replacer should mark the frame where the new page is on as not replaceable (i.e. not on the list of replaceable frames).
- Free List: If the new page is put on a free frame, this frame needs to be removed from the free list.
- Disk: If the new page replaces another one in the buffer pool, the replaced page needs to be persisted on disk if it is dirty.
DeletePage: deletes a page from the buffer pool.
- Buffer Pool: DeletePage has an effect only when:
  1. The requested page exists in the buffer pool;
  2. The requested page isn’t used by someone else (which can be known by checking the pin count).
- Page Table: If the requested page can be safely deleted, its entry needs to be removed from the page table.
- Replacer: If the requested page can be safely deleted, the frame where it’s put in needs to be removed from the replacer (if it’s there). The frame goes to the free list. A frame can’t be in the replacer list and the free list at the same time.
- Free List: If the requested page can be safely deleted, the frame where it’s put in needs to be added to the free list.
- Disk: If the requested page can be safely deleted, it has to be deallocated on the disk as well.
FetchPage: fetches the requested page from the buffer pool.
- Buffer Pool: If the requested page is already in the Buffer Pool, it can be fetched immediately. Otherwise, the page needs to be read from disk to the buffer pool, which is only possible when there is space in the buffer pool. (Either there is a free frame or there is a frame which can be safely replaced.)
- Page Table: The page table needs to be updated if the page is fetched from disk. The rules for updating the page table are the same as in the case of making a new page.
- Replacer: The replacer needs to be updated if the page is fetched from disk. The rules for updating the replacer are the same as in the case of making a new page.
- Free List: If the requested page is fetched from disk and put on a free frame, this frame needs to be removed from the free list.
- Disk: Same as in the case of making a new page.
UnpinPage: unpins the target page from the buffer pool.
- Buffer Pool: The function unpin does two things: decrementing the pin count of the requested page by 1, and potentially setting the dirty flag. Unpin can be done only when the page is in the buffer pool, and has a positive pin count. The dirty flag is set to true as long as one call to unpin has set it true.
- Page Table: No change.
- Replacer: If the pin count of the requested page has been decremented to 0, the replacer needs to mark the frame containing the unpinned page as replaceable.
- Free List: No change.
- Disk: No change.
FlushPage: flushes the target page to disk.
- Buffer Pool: No need to reset the page in the buffer pool! It doesn’t hurt to keep it in the buffer pool as long as there is space for a new/fetched page. In addition, if the flushed page needs to be read again, then it’s better to have it in memory instead of fetching it from the disk. Only the dirty flag needs to be reset if the page is indeed flushed to disk.
- Page Table: No change.
- Replacer: No change.
- Free List: No change.
- Disk: If the requested page exists and has a valid page ID, it needs to be flushed to disk if dirty. Doing the dirty flag check will prevent unnecessary writes to disk, but we must make sure that the dirty flag is indeed set when we do other operations. For instance, if we call NewPage and change the new page’s data, we need to explicitly call UnpinPage(page_id, true) to set the dirty flag. On the other hand, we can give away this dirty flag check, i.e., every time we call FlushPage the page will be flushed to disk, even if the data on disk is the same as in memory.

Implementation choices

In the provided code

To represent the buffer pool, one might be tempted to make an array of frames, and each array element stores a pointer to the pages. This might be logically straightforward, but the resulting buffer pool might be scattered in memory.

In the provided code, the buffer pool was represented as an array of pages:

pages_ = new Page[pool_size_];

which ensures a consecutive memory space for the buffer pool.

This is also emphasized in the project description:

The BufferPoolManager will reuse the same Page object to store data as it moves back and forth to disk. This means that the same Page object may contain a different physical page throughout the life of the system. The Page object’s identifier (page_id) keeps track of what physical page it contains.

Choices left for us

For some parts of the project, it was left to us to decide on which data structures to use. It turned out that a poor choice could lead to an extremely poor performance and thus not passing the Gradescope tests.

The following test case was designed to test the behaviour of the buffer pool manager with a large buffer size. In this test, the buffer pool size is set to 100k. 50k new pages are created and unpinned immediately, followed by creating another 100k new pages. Since there is not enough space to store the second half of the 100k pages, the unpinned 50k new pages would be replaced.

TEST(BufferPoolManagerTest, HugeBufferTest) {
 const std::string db_name = "test.db";
 int buffer_pool_size = 100000;
 
 auto *disk_manager = new DiskManager(db_name);
 auto *bpm = new BufferPoolManager(buffer_pool_size, disk_manager);
 
 page_id_t page_id_temp;
 
 for (int i = 0; i < buffer_pool_size / 2; ++i) {
   EXPECT_NE(nullptr, bpm->NewPage(&page_id_temp));
   EXPECT_EQ(i, page_id_temp);
   EXPECT_TRUE(bpm->UnpinPage(i, false));
 }
 
 for (int i = 0; i < buffer_pool_size; ++i) {
   EXPECT_NE(nullptr, bpm->NewPage(&page_id_temp));
 }
 
 // Shutdown the disk manager and remove the temporary file we created.
 disk_manager->ShutDown();
 remove("test.db");
 
 delete bpm;
 delete disk_manager;
}

During the whole process, the Pin, Unpin, and Victim methods of the replacer would be called multiple times.

The following code shows the initial implementation of these methods. In the header file, we use a std::list to represent the unpinned frames:

std::list<frame_id_t> unpinned_frames_;

In the cpp file:

bool LRUReplacer::Victim(frame_id_t *frame_id) {
 if (unpinned_frames_.empty()) {
   return false;
 }
 *frame_id = unpinned_frames_.front();
 unpinned_frames_.pop_front();
 return true;
}
 
void LRUReplacer::Pin(frame_id_t frame_id) {
 unpinned_frames_.remove(frame_id);
}
 
void LRUReplacer::Unpin(frame_id_t frame_id) {
 // If the frame isn't on the unpinned frames list yet, we add it to the back.
 auto result = std::find(std::begin(unpinned_frames_), std::end(unpinned_frames_), frame_id);
 if (result == std::end(unpinned_frames_)) {
   unpinned_frames_.emplace_back(frame_id);
 }
}

For Unpin, we need to check if the unpinned frame is already on the list to avoid having duplicated frames. We use std::find to iterate through the list. For Pin, we could have used the same check, but the way that the remove method of std::list works ensures that it won’t be a problem if the element doesn’t exist.

The execution time was 93 seconds.

[ RUN      ] BufferPoolManagerTest.HugeBufferTest
[       OK ] BufferPoolManagerTest.HugeBufferTest (93522 ms)

The improved implementation is as follows:

In the header file, add:

std::set<frame_id_t> frames_set_;

In the cpp file:

bool LRUReplacer::Victim(frame_id_t *frame_id) {
 if (unpinned_frames_.empty()) {
   return false;
 }
 *frame_id = unpinned_frames_.front();
 unpinned_frames_.pop_front();
 frames_set_.erase(*frame_id);
 assert(unpinned_frames_.size() == frames_set_.size());
 return true;
}
 
void LRUReplacer::Pin(frame_id_t frame_id) {
 if (frames_set_.find(frame_id) != frames_set_.end()) {
   unpinned_frames_.remove(frame_id);
   frames_set_.erase(frame_id);
 }
 assert(unpinned_frames_.size() == frames_set_.size());
}
 
void LRUReplacer::Unpin(frame_id_t frame_id) {
 // If the frame isn't on the unpinned frames list yet, we add it to the back.
 if (frames_set_.find(frame_id) == frames_set_.end()) {
   unpinned_frames_.emplace_back(frame_id);
   frames_set_.insert(frame_id);
 }
 assert(unpinned_frames_.size() == frames_set_.size());
}

Alongside the list of the frames, we use an auxiliary set to contain the frames as well. We add and remove elements to the list and set simultaneously, so their sizes are always the same. The key difference between the two implementations is that in the second implementation, we check if a frame is in the free list by using the find method of the frame_set_. Also, we check if an element is in the list before removing it. The execution time of the second implementation is:

[ RUN      ] BufferPoolManagerTest.HugeBufferTest
[       OK ] BufferPoolManagerTest.HugeBufferTest (760 ms)

We achieved a 100x speedup!

How `Unpin` gets faster?

A std::list is usually implemented as a doubly-linked list. Because in our test code we don’t have duplicated frames, every time the iteration would need to check the entire existing unpinned frames list to come to the conclusion that result == std::end(unpinned_frames_. Therefore, the time complexity is O(N^2). Unlike a std::list, a std::set is usually implemented as a red-black tree, which provides O(log(N)) search time. Therefore, the time complexity of Unpin becomes O(Nlog(N)).

How `Pin` gets faster?

In the initial implementation, remove actually does a comparison between the to-be-removed value and each element in the list, and we rely on this fact to throw away the “is-in-list” check. The time complexity is O(N^2). In the improved implementation, we do the check using the std::set (O(log(N)) complexity). Because in the test case there is actually no duplicate frames in the list of the unpinned frames, the if (frames_set_.find(frame_id) != frames_set_.end()) statement doesn’t hold, thus the expensive operation unpinned_frames_.remove(frame_id) is not executed. Overall, the time complexity of Pin is reduced to O(Nlog(N)).

Tips

Understanding the specs is important. In my case, the time spent on understanding the specs was more than the time spent on writing the code. Sources of specs include lecture notes/videos, project description, comments in the code, and sample test cases.