can bigcode-evaluation-harness eval results match or at least be close to published results by popular models like llama3, qwen2, etc.?