I need to point this out about lip sync and facial anims;
Many people wonder why facial animation can be so good in an animated film, but stuff in video games (even when we use a hell of a lot of technology, ala L.A. Noir). The reason is focus.
3D Animation is done in layers called "passes". The first is a Camera Pass, where the camera is animated for the shot, even though there's no objects in the scene. Then a rough block-out is done with T-Posed characters, showing how they move about the scene. Depend on studio, they'll either then do the limb animation passes, where each limb gets progressively more fleshed out with smoother movements, or they'll do the lip sync/facial rig pass, where animators (whose job it is to SPECIFICALLY focus on this part) will carefully align the face bones with the audio in the background (final voices and temp music/SFX are laid out first, before the cameras are done, to give the timings for everything else). These guys have the whole benefit of working on a platform that's constant, where the audio does exactly what they expect it too, and there's far less overall work, so they can dedicate more time to get the specific timings and smoothness of the animation curves correct.
When working on the facial rig for a game, it's extremely infeasible to have an animator do EVERY motion 100% Pixar-Quality accurate by hand, unless you have a very big budget (with a big focus on storytelling) and can afford for people to just power their way through the work from Day 1 to Day End. The system most games use (for non-cinematic faces, generally) is to have a series of pre-defined face shapes, and for each dialogue file have a kind of "script" that tells the animation system after 'X' amount of seconds to switch to 'Y' face, giving the impression of speaking with minimal resources spent. Then, on the most important animations (like cinematic faces), traditional animation is used, but only a rough pass is done and there's generally nothing to stop the animation and the audio falling out of sync with each other.
Why can't they sync up? Most games use fire and forget audio (an event triggers off a sound which plays until end), which is cheaper to deal with than tracked, looping sounds (generally left for the background audio systems like music and ambience). A big problem, when combined with micro-stutter, is that audio can become a couple of frames out of sync with the animation it's supposed to be tied to. This isn't a problem, for example, if you were playing the a sound over two gears rotating, as there's no real midpoint that a player can connect with the gear rotation. But, we humans can easily recognise the subtle changes in mouth and eye shapes as somebody speaks, hence why simplified animations seem so weird to us.
Remember, the Metal Gear Solid franchise heavily, heavily relied on its cinematics (MGS4 sits at about 9 hours, and MGS5:TPP had about 5 hours), and hence they hired and focused people to make sure they were done to the highest quality possible. Other studios are focusing far more on gameplay, so they probably don't have the talent, time or money to do the same tier of quality.