ARM Cortex A9 Peak FLOPS

Par zogzog, jeudi 26 mai 2011 à 14:40 :: Dev :: #54 :: rss

I have just measured the peak gflops performance for the cortex A9, using this snippet of code:

@@ int ni = 65536*16;

   __asm__ volatile("mov  r5, #0\n"
                    "vdup.f32 q0, r5\n"
                    "vdup.f32 q1, r5\n"
                    "vdup.f32 q2, r5\n"
                    "vdup.f32 q3, r5\n"
                    "vdup.f32 q4, r5\n"
                    "mov r5, %0\n"
                    "1:\n"
                    "vmla.f32 q0, q0, q0\n"
                    "vmla.f32 q1, q1, q1\n"
                    "vmla.f32 q2, q2, q2\n"
                    "vmla.f32 q3, q3, q3\n"
                    "vmla.f32 q4, q4, q4\n"
                    "subs r5, r5, #1\n"
                    "bne 1b\n" : : "r"(ni) : "q0", "q1", "q2", "q3", "q4", "r5");@@

Each loop does (4 float per vec) * (5 vmla operations) * (2 flop per vlma) = 40 flop, repeated ni times. The result for the 1GHz cortex A9 of my pandaboard: 3.99919 GFlops , which was quite expected but it is always good to see a confirmation. Reducing the number of VMLA (fused multiply add) in the loop results in a lower performance (the vlma latency is 5 cycles if I recall correctly, so that is also expected). Using separate VMUL and VADD operations produces 2 GFlops instead of 4.

Interestingly, replacing the five VLMA on size-4 vectors by 10 VLMA on size-2 vector produces almost the peak 4 GFlops: 3.64 GFlops.

lun	mar	mer	jeu	ven	sam	dim
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

uuuurg

ARM Cortex A9 Peak FLOPS

Trackbacks

Commentaires

Ajouter un commentaire

Calendrier

Rechercher

Catégories

Archives

Liens

Syndication