Study identifies weaknesses in how AI systems are evaluated