Mattan Erez Courses : Fa10EE460N

This page contains a collection of questions and answres received through email. Question are in reverser chronolgical order.

VAX VM (10/23/2010)
- Continued …
Question about DIMMs (10/13/2010)
Question about LRU 10/11/2010
Question regarding Conflict miss and Capacity miss 10/11/2010
Another VM question 10/11/2010
couple of questions of VM 10/6/2010
- Continued on another email:
- Continued several days later (after reading VAX paper and a brief discussion in office hours)

VAX VM (10/23/2010)

> I am thinking about the question you raised in the class that what is the maximum main
> memory size in VAX, and got some implications. My observation is that the PFN field in
> the system PTE is 21-bit long, which is exactly the same as the VPN field in the system
> virtual address. If all 21 bits in system PTE are used, that means all the system pages
> (1G Bytes) could be active in physical memory at the same time, which shouldn’t be the
> case (virtual memory size == physical memory size). Then I took a look at the notes I
> took in class, and interestingly found that the size of PFN field I wrote is actually
> 20-bit long. I think that should be what you wrote on the blackboard in the lecture.
> If this is not a mistake, I somehow can explain the above weird situation. For system
> PTE, only 20 bits out of 21 bits are used to indicate the physical page number. After
> locating a physical address, and if that contains PTE of the P0/P1 page, the bit-20 of
> PFN field would be used. In this sense, the maximum physical memory size is 1GB, in
> which 512MB are for system pages and remaining 512MB are for P0/P1 pages. Am I right?

I made a mistake — 21 bits are used to represent each VPN.

Honestly, I don’t really remember the question I asked — it’s hard to both lecture and remember what happened. I believe that I asked about the specific example on the board that only had ~16 pages and not the actual full VAX machine. What I was getting at is that the page table will take up significant space. By the way, the most expensive VAX only had 512MB of memory — it was a technology and cost issue.

> Another question about VAX TLB. If a VA is in P0 address, is it true that only the
> process-specific TLB would be involved? That is to say, VA in P0 doesn’t need to be
> first translated to system VA, and then goes through system TLB to get its PTE, and the
> get its PA. One look-up in process TLB is enough? Besides, is it possible to perform
> TLB look-up and page-table look-up at the same time? If that is possible, and the cache
> is virtually address, can we even perform cache, TLB, and page-table look-ups all at the
> same time? The flow-chart in the lecture said sequential, but I think parallel should
> be feasible.

Yes, the TLB will give you the actual physical address, not an an address in system space. Yes, you can do everything in parallel, but typically memory operations are not initiated until we know we need them.

Continued …

> If 21 bits are used for VPN, does this mean that all virtual system pages can be present
> in the physical memory at the same time because 21 bits are used to represent VPN in
> virtual address? Assuming this is true, we need at least 2^21 * 512 = 1GB. If the
> maximum physical memory is 512MB, does this mean that not all bits in the 21-bit PFN
> field in system PTE are used?

Indeed, there are more bits available than needed for many configurations of VAX. This was true of x86 at one time in the past as well. When designing an ISA you do want to think a little bit into the future, but still not waste too many resources unnecessarily.

Question about DIMMs (10/13/2010)

> I have not understood the difference between external banks/channels and ranks. Thus I
> have tried to answer the following. Please let me know if it is correct, if not where am
> I going wrong?
> In the document DRAM Reference uploaded today on slide 10:

> How many chips per rank? 8
> How many bytes per rank? 512MB
> How many ranks per module? 2
> I have considered the following arrangement:

> 128M x 64 => 64Mx8 * 16 pieces => 4 internal banks , 2 external banks

> Each chip = 4 banks= 64Mx8 * 4 banks
> Hence for 64b interface, 1 rank= 8 chips = 64Mx8 * 8 pieces
> 2 external banks = 2 ranks = 2* (64Mx8 * 8)= 128M x 64
Your analysis below is correct. The particular table on that slide says nothing about internal banks (other than the fact that it’s DDR). Details on internal banks are not necessary to figure out the rank information actually.

Question about LRU 10/11/2010

> I don’t quite understand what does LRU way mean? Is there supposed to have a
> dedicated way in each set containing the LRU cache line? Or it is only a logical
> concept?

LRU is a logical concept. In fact, to get to it you need to maintain the relative usage time information of all lines in the set (like a stack).

Question regarding Conflict miss and Capacity miss 10/11/2010

> If there are some empty lines in the cache at the time one of the filled up lines was
> replaced and that results in a miss, then it is clearly a conflict miss.
> But if the cache is filled completely and a line is replaced, it could be because of
> capacity or limited associativity. How would the miss of this line be classified as?

You have to “run” to caches at the same time. One would be fully associative and the other the parameters you are testing. If the miss occurs in both it’s capacity or compulsory. If the miss only occurs in the non-fully-associative cache then it’s a conflict miss.

> Also, I was wondering how these types of misses are used to analyse the cache
> performance, as the numbers would spread out in different categories (compulsory,
> capacity, conflict) depending on the running application program. Still there would be
> some general expectation on these numbers. That got me thinking if this kind of analysis
> would make more sense if cache is used with ASIC processors and SOCs where the
> application programs have limited functionality. Am I thinking on the right track? \\

Even though CPUs are considered “general purpose”, they are still designed to run well on a certain set of applications — it’s just that this set is very very large. When designing CPUs, we run lots and lots of tests and applications through to figure out what parameters to choose. Associativity is expensive, so it’s nice to be able to evaluate it and figure out what’s the best way of achieving a high hit rate. Understanding what the sources of inefficiencies are is very important.

Another VM question 10/11/2010

> 1. In x86, the page table for each process should have all the PTEs for the whole
> virtual address space, right?
x86 is hierarchical. The “page table” has a Page Directory level, which is always resident for a running process. Each entry in the Page Directory maps to the start of a Page Table — really a portion of the real “page table”. Each entry in a Page Table is really a PTE. Only a subset of the total number of Page Tables are resident typically. Also, the Page Tables are really created on the fly as needed, so you can have lots of holes in the address space.

> 2. In VAX, since the number of PTEs (i.e. PxLR) can dynamically vary, when the PxLR
> increases, will the OS create a new page table again at a different location? (since
> there might not be space to add the new entries in the page table contiguous to the
> original page table)

Yes. All PTEs for a segment in VAX have to be contiguous. If there isn’t enough space the page table needs to be relocated.

couple of questions of VM 10/6/2010

> Hi Mattan,

> I got some questions with relate to the virtual memory.

> (1) Although pure segment is obsolete, I still want to know how it works. Basically, I
> am wondering if the base reg and bound reg are manually managed by the programmers? If
> the address requested is out of bound, what happens? The program simple halts, or some
> fragments of that segments are swapped out to disk?

Segments are not obsolete entirely. PowerPC has a combination of segments and virtual memory. An out-of-bounds address results in a segmentation fault, typically killing the running process.

> (2) Just to confirm, the page directory is always resided in physical memory, and only
> the active page tables are swapped into the physical memory. Upon context switch
> (process/thread switch), new page directory and related page tables are swapped in in
> place of the old ones.

That’s correct. Very often, the page directory is not swapped out when another process is running, so the only thing required on a process switch is to update CR3.

> (3) If the valid bit of the PTE is 0, this indicates that that page is in disk, so the
> content of the PTE should contain the disk address. I just don’t know how the disk
> address is represented? Can it be represented in the same length of the physical memory
> address?

Generally, it’s not exactly an address on disk, but rather an offset within a file. The offset can be represented with the same number of bits. The OS keeps track of where the swap file is on disk. How files are maintained is a question for an OS class so I won’t go into details here.

> (4) Why we need untranslated address mode? Seems like it is used to avoid the page
> fault of page tables?

In normal operation, every address issued by a load instruction is a virtual address. Addresses stored in the page table (page directory, …) are all physical addresses. Let’s say software reads an entry in the page directory or page table. This entry points to the location of a page table or PTE respectively. But when software tries to dereference this address, VM address translation would take over and end up trying to translate the physical address, which will result in a wrong actual address. The value read from the page directory/table needs to be used directly as the physical address of the next load so cannot be translated. Hence, the special untranslated mode. This mode can only be used by trusted software (the OS), because it violates protection. Does this make sense?

> (5) In your slide, you mentioned how the free list is organized (FIFO or LIFO). Does
> it really matter?

Probably not, although they aren’t exactly the same. When a frame is put on the free list the data stored in it is still valid. The data is only replaced when the frame is allocated to a new page. If a page that has recently been put on the free list is touched, the OS can perhaps re-allocate the same frame to it without having to go to disk. This isn’t commonly done, but will indicate that FIFO will be better. With LIFO, pages get reused very quickly, so for applications that need a lot of memory it tends to be quick and lead to better locality (Iess fragmentation), I believe.

Continued on another email:

>>> (4) Why we need untranslated address mode? Seems like it is used to avoid the page
>>> fault of page tables?

>> In normal operation, every address issued by a load instruction is a virtual
>> address. Addresses stored in the page table (page directory, …) are all physical
>> addresses. Let’s say software reads an entry in the page directory or page table. This
>> entry points to the location of a page table or PTE respectively. But when software
>> tries to dereference this address, VM address translation would take over and end up
>> trying to translate the physical address, which will result in a wrong actual
>> address. The value read from the page directory/table needs to be used directly as the
>> physical address of the next load so cannot be translated. Hence, the special
>> untranslated mode. This mode can only be used by trusted software (the OS), because it
>> violates protection. Does this make sense?

> Please check if my following rephrase is correct. All the addresses are sent out from
> CPU, and they need to go through MMU, which if in the normal mode would regard the
> address as VA. Therefore, for those cases where the CPU does send out a physical
> address, we have to switch the MMU to an untranslate mode. Those untranslated scenarios
> are not limited to the page hit case you mentioned above. For example, upon a page
> fault, the CPU would issue commands that copy disk page into memory, and this memory
> address should be physical. Likewise, in a two-level page-table system, I think the MMU
> should have three modes: level-2 translation mode, level-1 translation mode, and
> untranslated mode.

Yes, but I have no idea what is 2-level translation.

>>> (5) In your slide, you mentioned how the free list is organized (FIFO or LIFO). Does
>>> it really matter?

>> Probably not, although they aren’t exactly the same. When a frame is put on the free
>> list the data stored in it is still valid. The data is only replaced when the frame is
>> allocated to a new page. If a page that has recently been put on the free list is
>> touched, the OS can perhaps re-allocate the same frame to it without having to go to
>> disk. This isn’t commonly done, but will indicate that FIFO will be better. With LIFO,
>> pages get reused very quickly, so for applications that need a lot of memory it tends to
>> be quick and lead to better locality (Iess fragmentation), I believe.

> Did you mean the case where the page swapped out is requested very soon. In FIFO,
> that page is probably not reused so we can reclaim that page as valid easily. With
> LIFO, that page may have been already used, thus degrades the performance. But I don’t
> understand your last sentence. Why LIFO could be quick and lead to better locality for
> applications requiring many pages?

I couldn’t figure it out when I was writing to you yesterday and I’m traveling so can’t do it now. That’s why my last sentence had the “I believe” in it — I was recounting something that I remembered, but I don’t have a solid explanation right now. FIFO and LIFO are fairly equivalent for free frame management at the end.

EDITED: FIFO maintains LRU order and was chosen for VAX. The idea is that pages on the free list might still be re-requested by the application and then re-allocated more quickly. Some studies later showed that LIFO ends up with better locality (i.e., less fragmentation). It doesn’t matter too much when memory is large enough.

Continued several days later (after reading VAX paper and a brief discussion in office hours)

> Just to confirm what we talked today. In a one free-list system, we, by some heuristic,
> speculatively free pages, and add their frame numbers to the free-list. PTEs pointing
> to them are set invalid, and if that page is dirty, we have to write it back. The
> reason we do this speculation is that typically more than one page would be swapped in
> so that we need to keep sufficient room for them.

The reason is to keep software overheads low. The pager needs to do it’s job quickly so as long as it’s invoked it scans for multiple potential pages to swap out in case more space is needed. If it was wrong and a page on the FL was accessed again, it can supply it relatively quickly. The VAX FL is a FIFO to try and approximate LRU behavior — the first page that is really overwritten is the last page on the FL that was touched.

> However, when it comes to VAX which has a free-list and modifier-list, I am a little bit
> confused. That paper says if the page to be removed is dirty, it is added to free-list,
> otherwise to the modified-list. Since free-list is the only source indicating the
> available pages, does this mean that pages in the modified-list is not available? It
> shouldn’t be the case because we want all the pages swapped out to be available. Why
> don’t add the dirty page number to both free-list and modified-list? I can understand
> why VAX uses split lists for clean and dirty pages---because it wants to perform
> clustering of write back. If we write back as long as the page is swapper out, one list
> is enough I believe.

It’s actually the opposite — if the page is clean then it is put on the FL. Dirty pages are put on the modified list. Pages are written back from the modified list to disk and then put on the FL — this takes time and the pager doesn’t want to wait before returning. It can’t put the dirty pages on the FL directly because if it allocates them before they are written to disk then that data is lost or a long stall occurs. You basically have to maintain two lists to deal with efficient clustered writebacks, so what would be the point of adding a page to both lists? If a page is in both lists it may slow allocation down.