It has been an ongoing, decades-long debate about how SAT solvers and in general di?erent or new algorithms should be evaluated and compared both in competitions and more importantly in papers. Evaluations are usually performed on existing benchmarks. Cross-validation and other means to avoid over-?tting are rarely used. In this paper we revisit the old idea of scrambling benchmarks also used in early competitions. Scrambling has the goal to make results of such evaluations more robust. We present a new method for scrambling CNFs, which allows to gradually increase the e?ect of scrambling, from keeping the scrambled CNF close to the original CNF, to complete random permutation of variables, clauses, and phases of literals. We used this method to scramble benchmarks from the last two SAT competitions and solved them with the best solvers in the main track of the last SAT competition. As expected our experimental results suggest that scrambling has a substantial e?ect on the performance of individual solvers but surprisingly has little e?ect on rankings among solvers. As a consequence we argue that only using our method of scrambling is not enough to increase robustness of competitions and evaluations in general.