I am working on raspberry pi3 and it has the ARM cpu which prefers to do memory read and write at 32 bits. I have two processess zma and zmc. Zmc processes video frames from an rtsp camera feed, does a bit of hardware h264 encoding only to produce sideinfo consisting of motion vectors. The video frame is basically subdivided into tiles of 16x16 pixels and they span left to right and then top to bottom. I need to check if each macroblock's motion vector has had significant displacement and if it has, I need to communicate this information to zma. Zma checks the coordinates of the macroblock if it fits in polygon region of interest. It then counts the total number of these macroblocks that fit in the polygon and comes up with a score for motion. Zma and Zmc communicate via mmap. Zmc writes to a buffer and zma loads the buffer from memory. They do this with each frame and the buffer is part of a ringbuffer. If zma is too slow in processing the buffer, then zmc catches up with it on the next round and a buffer overrun happens. I suspected that the polygon inclusion test was too slow for this and I wanted to exploit the fact that the macroblocks are tiled and also wanted to save some memory. So I decided to have zmc utilize a buffer that is 1024 bytes long ( enough bits to store the state of each macroblock with a frame size of (1920x1080)/256/8=1012.5). I have some spare bytes at the end ( or the front ) for some header statistical info. So in ZMC each bit of this data buffer will represent the macroblocks in order and each bit will have to be set to 1 to indicate the motion vector has significant displacement, 0 otherwise. ZMA will as part of its initialization, create its own 1024 byte buffer and it will check to see if the coordinates of each "tile" in the frame is included in the polygon ROI. If a "tile" is inside the ROI, then it marks the bit as 1, 0 otherwise. This moves the polygon test into initialization, away from realtime where I hope simple bit operations would be a faster implementation. So when zmc sends the info to zma, zma just needs to perform an AND operation for each bit . It then counts the total of bits that are set after the AND operation and derives a score from that.
On top of all this, the memory reads and writes must be 32 bit aligned for performance. I'm open to any suggestions as my knowledge of this topic is limited.
My solution was to load the bits into a 4 byte word and then do memcpy operation every 4 bytes using a mvect_buffer that is allocated with malloc.
Q1. Would it be possible/better to allocate the entire 1024 byte as a local function variable so that it is in the stack ( and presumably avoiding multiple read and writes to heap) and then just do a single memcpy of 1024 bytes at the end?
Q2. Is there a fast solution to accessing individual bits in the heap so I can operate one bit at a time on the buffer? This would certainly give me a more flexible solution for the case when the macroblocks do not come in any order ( which is the case for the second type of macroblocks derived from software h264 decode ).
Q3. If my implementation as far as memory reads and writes is the most performant solution, is there a faster technique of manipulating the bits than what I have so far?
Q4. Is there a faster abs function implementation?
Right now the rtsp camera is sending 20 fps. However, zmc and zma process at 15 fps. I think the extra computation is reducing the capture rate. I want to increase the capture rate to as close as the camera fps.
Here is the implementation for ZMC
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
|
uint8_t * mvect_buffer;
/..../
if(buffer->flags & MMAL_BUFFER_HEADER_FLAG_CODECSIDEINFO) {
uint16_t t_offset=0;
mmal_buffer_header_mem_lock(buffer);
uint16_t size=buffer->length/sizeof(mmal_motion_vector);
struct mmal_motion_vector mvarray[size];
uint32_t registers;
//copy buffer->data to temporary
memcpy(mvarray,buffer->data,buffer->length);
mmal_buffer_header_mem_unlock(buffer);
//START CODE
registers=0;
uint16_t count=0;
uint16_t wcount=0;
for (int i=0;i < numblocks/4 ; i++) {
mmal_motion_vector mvs;
motion_vector mvt;
memcpy(&mvs,mvarray+i,sizeof(mmal_motion_vector));
if ((abs(mvs.x_vector) + abs(mvs.y_vector)) > 5) //Ignore if did not move pixels.
registers =registers | (1 << count);
count++;
if (( count == 32) || (i == numblocks-1)) {
memcpy(mvect_buffer+t_offset , ®isters, 4 ) ;
count=0;
wcount+=1;
t_offset+=4;
registers=0;
}
}
//END CODE
}
| |
Here is the code for ZMA initialization.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
|
void Zone::SetVectorMask() {
uint16_t numblocks=0;
uint16_t count=0;
uint16_t wcount=0;
uint32_t registers;
uint16_t offset=0;
numblocks= (monitor->Width()*monitor->Height())/256;
Info("Setting up the motion vector mask with numblocks %d ", numblocks);
for (uint16_t i=0 ; i< numblocks ; i++) {
uint16_t xcoord = (i*16) % (monitor->Width() + 16); //these blocks are tiled to cover the entire frame and are 16x16 size
uint16_t ycoord = ((i*16)/(monitor->Width() +16))*16;
if (polygon.isInside(Coord(xcoord,ycoord))) {//coordinates inside polygon
registers =registers | (1 << count);
}
count++;
if (( count == 32) || (i == numblocks-1)) {
memcpy(zone_vector_mask+offset , ®isters, 4 ) ;
//std::cout << "Count in "<< count << std::endl;
count=0;
wcount+=1;
offset+=4;
registers=0;
}
}
Info("Done setting up zone vector mask");
}
| |
Here is the code for ZMA as it handles the buffer in realtime from ZMC.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
|
uint8_t * mvect_buffer;
uint16_t width;
uint16_t height;
uint16_t numblocks=(width*height)/256;
/..../
if (mvect_buffer) {
//Info("Analysing mvect buffer with numblocks %d", numblocks);
uint16_t offset=4;
uint16_t wcount=0;
uint32_t mask=0;
uint32_t buff=0;
uint32_t res=0;
uint16_t c=0;
for (int i = 0; i < numblocks/4; i++) {
memcpy(&mask, zone_vector_mask+offset, sizeof(mask));
memcpy(&buff, mvect_buffer+offset, sizeof(mask));
res= mask & buff;
offset=offset+4;
//got this from the net
c = ((res & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
c += (((res & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
c += ((res >> 24) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
vec_count+=c;
}
}
| |
Thanks,
Chris