Preface
Today I had the second-round interview with my Dream Company. The interviewer moved super fast, and it was a phone interview too—he basically didn’t let you talk much, just listened for the conclusion he wanted…. That made me a bit nervous, plus I really cared about this interview… so I got even more nervous 😭. I don’t feel like I performed particularly well—maybe that’s just how big companies are, but honestly it’s still just me being not good enough….. I’ve also been an interviewer myself and talked with candidates. From my perspective, our goals are aligned: I want to know the candidate’s real level, and the candidate wants to show their real level. But today’s second round felt like it was with a technical leader—probably busy, and interviewing a lot of people.
Still, I did get something out of it. I heard some design questions that were different from the usual ones. After the interview I thought about them a bit and found them pretty interesting, so I’m summarizing and reviewing them here. Of course, there were also lots of questions like designing a leaderboard, which everyone asks all the time and which I’ve also implemented broadly in my own projects—so I won’t write about those.
The related code for this post has been uploaded to github
How do you deduplicate 4 billion QQ numbers?
This is actually a pretty common type of question. For example: massive log data: extract the IP with the most Baidu visits on a given day, find hot queries, count the top 10 hottest queries among 3 million query strings, given file a and file b each storing 5 billion URLs (64 bytes each), memory limit 4G, find the common URLs between a and b? These are all classic massive-data design problems. Most “standard answers” use a bitmap to balance time and space. But since I’ve done big data related work before, if it’s deduplication I’d lean more toward using MapReduce. Of course, if the interviewer asked for median or top-k, then I’d definitely use a bitmap and do one pass of sorting to finish it.
If the question is purely about deduplication, the first thing most people think of is hashing. But if you directly use a hash set, memory usage will exceed the limit. And MapReduce exists to solve exactly this kind of problem. MapReduce is a computation model: it decomposes work or data for large-scale processing, and finally merges the results. It’s essentially like merge sort in spirit—divide and conquer. Hadoop is an open-source distributed parallel processing framework that implements the MapReduce model.
So for QQ number deduplication, we just map/split the data and hand it off to different machines to process (this is the map phase), then continuously reduce/aggregate (this is the reduce phase). Concretely, we can:
- First use a mapping method, e.g.
%10000, to split the file containing QQ numbers. This produces 10,000 files, then distribute them to different nodes. Each node is responsible for its own two files. - After the big file becomes small files, each node can load its file into memory and deduplicate locally. Using a normal hash-based dedup is fine here.
- Each pair of small files will eventually become an even smaller deduplicated file, and then we keep merging files from two nodes, merging all the way until there’s only one file left.
With this kind of data partitioning and final reduction, you end up with one deduplicated file. This theory can also answer problems like hash counting and hash dedup. Bitmap is a great answer, but answering with MapReduce won’t lose points either. In fact, these scenario design questions mainly test your usual accumulation and ideas. At least I wouldn’t restrict candidates from brainstorming (and my interviewer didn’t either—he actually认可’d this answer, hhhh).
How is the WeChat red packet amount split?
Rough idea
I feel like I saw this question before on some public account, but I forgot after reading it. Today I got asked it out of nowhere. My first reaction wasn’t to answer—I genuinely wanted to know. That damn curiosity. Things we take for granted every day, we rarely think about from an engineer’s perspective. Honestly, that’s on me.
Once I heard the question, I realized it was a blind spot. I was super nervous. I don’t even remember how many seconds I thought before blurting something out. My idea was: since it’s “grabbing” a red packet, it must be random. So how do we randomize? We need to determine the upper and lower bounds for each person’s random amount. The lower bound is naturally 1 cent per person, and the upper bound is remaining amount - number of people * 1 cent (because we must guarantee everyone gets at least 1 cent). So the random range is [1 cent, remaining amount - people * 1 cent]. But this distribution is unbalanced, i.e., unfair: early grabbers have a big advantage because they can random a larger number, and the later you are, the smaller your expected share becomes. I actually wrote some code to test it—under the default random seed, after a few rounds the big chunk gets taken early.
Suppose there are 10 people, total red packet amount is 100 cents.
The first person’s random range is (1, 100 - 9), expected value is 45.5 cents, suppose this person got 45.5
The second person’s random range is (1, 45.5 - 8), expected value is 18.75 cents, suppose this person got 18.75
The second person’s random range is (1, 18.75 - 7), expected value is 11.75 cents, suppose this person got 11.75
……
Although I’m using expected values in the example, it’s inevitable that it gets smaller and smaller as you go.
Thinking more carefully
Afterwards, thinking carefully, I realized this approach is actually workable. Determining the random upper/lower bounds is the first step. The problem is the current upper bound is unfair. So if we make the upper bound dynamically adjusted to a reasonable range based on the remaining people, it could be better. But how do we find that value? I searched around and found something called the double-mean method. It uses the double-mean method to determine each person’s upper bound: divide the remaining amount by the remaining number of people, then multiply by 2. So the range becomes [1, remaining amount / remaining people * 2]. My understanding is that this upper bound is adjusted each time based on remaining amount and remaining people, which can help keep the distribution fair to some extent. Since it follows a normal distribution, in terms of overall mathematical expectation it’s actually fair. For example:
Suppose there are 10 people, total red packet amount is 100 cents.
The first person’s random range is (1, 100/10*2), expected 10 cents, suppose this person got 10
The second person’s random range is (1, 90/9*2), also expected 10 cents, suppose this person got 10
….
The tenth person’s random range is (1, 10/1*2), expected 10 cents, suppose this person got 10
If it follows this distribution, everyone ends up with roughly the same amount. But since it’s random, you’ll still see fluctuations around the mean, like:
Suppose there are 10 people, total red packet amount is 100 cents.
The first person’s random range is (1, 100/10*2), suppose this person got 15
The second person’s random range is (1, 85/9*2 = 18.888), suppose this person got 15
The third person’s random range is (1, 70/9*2 = 15.555), suppose this person got 15
…
You can see that in this case it’s still not very fair for the later people, because the first person’s [0, 20] range—just because the first person randomly got a bit more—shrinks the later ranges. But dynamically adjusting the upper bound can at least prevent the values from diverging too much. I implemented it with code and the results looked okay.
The code is too long, so I uploaded it to github.
30个人抢100分钱
第1抢到了4分, 还剩下96分, 区间是[1, 6.666667]
第2抢到了5分, 还剩下91分, 区间是[1, 6.620690]
第3抢到了2分, 还剩下89分, 区间是[1, 6.500000]
第4抢到了1分, 还剩下88分, 区间是[1, 6.592593]
第5抢到了3分, 还剩下85分, 区间是[1, 6.769231]
第6抢到了4分, 还剩下81分, 区间是[1, 6.800000]
第7抢到了4分, 还剩下77分, 区间是[1, 6.750000]
第8抢到了1分, 还剩下76分, 区间是[1, 6.695652]
第9抢到了5分, 还剩下71分, 区间是[1, 6.909091]
第10抢到了4分, 还剩下67分, 区间是[1, 6.761905]
第11抢到了5分, 还剩下62分, 区间是[1, 6.700000]
第12抢到了5分, 还剩下57分, 区间是[1, 6.526316]
第13抢到了6分, 还剩下51分, 区间是[1, 6.333333]
第14抢到了4分, 还剩下47分, 区间是[1, 6.000000]
第15抢到了1分, 还剩下46分, 区间是[1, 5.875000]
第16抢到了3分, 还剩下43分, 区间是[1, 6.133333]
第17抢到了1分, 还剩下42分, 区间是[1, 6.142857]
第18抢到了4分, 还剩下38分, 区间是[1, 6.461538]
第19抢到了2分, 还剩下36分, 区间是[1, 6.333333]
第20抢到了5分, 还剩下31分, 区间是[1, 6.545455]
第21抢到了1分, 还剩下30分, 区间是[1, 6.200000]
第22抢到了6分, 还剩下24分, 区间是[1, 6.666667]
第23抢到了1分, 还剩下23分, 区间是[1, 6.000000]
第24抢到了4分, 还剩下19分, 区间是[1, 6.571429]
第25抢到了3分, 还剩下16分, 区间是[1, 6.333333]
第26抢到了4分, 还剩下12分, 区间是[1, 6.400000]
第27抢到了5分, 还剩下7分, 区间是[1, 6.000000]
第28抢到了1分, 还剩下6分, 区间是[1, 4.666667]
第29抢到了3分, 还剩下3分, 区间是[1, 6.000000]
最后一个已经抢完, 金额为3
已经抢完, 总共金额为100分
Besides that, I also found a drawback: the last person is absolutely treated unfairly, because they can only take whatever is left after everyone else is done. I thought about it carefully and this is really a paradox: on one hand it’s random, on the other hand you want everyone to have as close as possible to the same probability distribution of amounts. I read a lot of blogs and most people aren’t very satisfied with this either. They propose a “linear cutting method” algorithm with higher time and complexity. The idea is to abstract the total amount as a rope and cut it, making a total of (people - 1) cuts. Then each person’s amount is the proportion of rope they get. I wanted to implement it, but there are too many things to consider and it’s pretty hard, so I gave up. Reference 3 uses a similar algorithm to split a number—if you’re interested, you can check it out.
Shuffling a deck of cards
To be honest, I did this problem back when I first learned C in school. I don’t remember if it was from some self-study video or a textbook I bought. But because I was nervous, I didn’t come up with a good method. I was so nervous when I heard the question that I even asked the interviewer to repeat the scenario…. Actually I was just trying to buy time so I could think more, but I could feel the interviewer rolling his eyes…. This is also a downside of phone interviews: it’s awkward to stay silent for a long time while thinking. With video or onsite interviews, you can at least make eye contact or let the other person see you thinking.
In the end I cautiously asked if I could use the JDK API. The interviewer was friendly and said yes, so I gave Collections.shuffle();. He asked if I knew how it was implemented. I really didn’t—I could only say I hadn’t read that part of the source code. The interviewer probably wanted me to implement shuffle() first. After the interview I checked: it uses Random() to do a random shuffle. Sigh—if I had been calmer, my impression would’ve been better. During the interview I also mentioned doing it with multithreading. I think that could work too, but I didn’t have a solid idea at the time so I didn’t continue. Here I’ll share the code I thought through afterwards.
/**
* 花色的list
*/
List<String> colors = Lists.newArrayList("黑桃", "红心", "梅花", "方块");
/**
* 字面值的list. 使用双流合并, 因为很懒不愿意打2-10, 就用IntStream来生成了
*/
List<String> numbers = Stream.concat(IntStream.rangeClosed(2, 10).mapToObj(String::valueOf),
Stream.of("A", "J", "K", "Q")).collect(Collectors.toList());
First define the structure: two Sets, one for suits and one for face values. Reminder: the code is uploaded to github.
Collections.shuffle()
@Test
public void collectionShuffleTest() {
// 进行组合
List<String> result = colors.stream()
.flatMap(color -> numbers.stream().map(number -> color + number))
.collect(Collectors.toList());
// 进行打乱
Collections.shuffle(result);
// 在console输出(为了方便看, 转成pretty的json)
System.out.println(JSON.toJSONString(result, SerializerFeature.PrettyFormat));
}
Random swap
@Test
public void randomSwapTest() {
// 进行组合
List<String> result = colors.stream()
.flatMap(color -> numbers.stream().map(number -> color + number))
.collect(Collectors.toList());
// 进行打乱52次
for (int count = result.size(); count > 0; count--) {
// 生成对换的随机数
int site1 = RandomUtil.randomInt(0, result.size() - 1);
int site2 = RandomUtil.randomInt(0, result.size() - 1);
// 进行wap
String temp = result.get(site1);
result.set(site1, result.get(site2));
result.set(site2, temp);
}
// 在console输出(为了方便看, 转成pretty的json)
System.out.println(JSON.toJSONString(result, SerializerFeature.PrettyFormat));
}
The code is actually very simple—basically like writing business code. How did I not think of it on the spot? But looking at it, I feel the space complexity is pretty high: on one hand you need to generate two random indices each time; on the other hand each swap needs extra space. But since this isn’t numeric swapping, there’s no way to optimize with bit operations. And I also looked into how Collections.shuffle() is implemented—it’s pretty similar to mine.
/**
* Swaps the two specified elements in the specified array.
*/
private static void swap(Object[] arr, int i, int j) {
Object tmp = arr[i];
arr[i] = arr[j];
arr[j] = tmp;
}
If I come up with a better approach next time, I’ll add it here.
Trying multithreading
Because it’s random in nature, I thought about a multithreaded unfair-lock contention approach. If you put the items to be shuffled into one queue, then no matter how you execute it, it’s still ordered—unless you read randomly or place randomly, but then it has nothing to do with multithreading.
The simplest version is to use the JDK’s ForkJoinPool.commonPool() thread pool, and use Stream parallelism to scatter each element and concurrently add them into the final result container. In practice, maybe due to low parallelism, it can shuffle somewhat, but you still see many consecutive suits. You could replace the Stream thread pool and tune the parallelism yourself; the effect should be much better.
@Test
@SuppressWarnings("all")
public void parallelStreamTest() {
// 先构建扑克牌
List<String> rawList = colors.stream()
.flatMap(color -> numbers.stream().map(number -> color + number))
.collect(Collectors.toList());
// 创建容器
List<String> result = new ArrayList<>(sumCount);
// 使用副作用进行并行添加到容器中
rawList.parallelStream().forEach(result::add);
// 在console输出(为了方便看, 转成pretty的json)
System.out.println(JSON.toJSONString(result, SerializerFeature.PrettyFormat));
System.out.println(result.size());
}
Then I tried implementing it myself and found it way harder than I expected. I thought about grouping all cards by suit, then having four groups (threads) compete for a ReentrantLock unfair lock, and using condition to control random order. Each group would be a Queue containing all cards of that suit, and each time a thread gets picked randomly it dequeues one element into the result container. I also thought about splitting into suit groups and face-value groups, then having each suit group scan face values in parallel, and face values also being a queue where you dequeue one each time. But checking boundary conditions and operation logic couldn’t guarantee atomicity, and thread-safety issues kept popping up. After thinking for an entire afternoon, I ended up with this approach:
@Test
@SneakyThrows
@SuppressWarnings("all")
public void multithreadingTest() {
// 把花色先分好
List<Queue<String>> rawList = colors.stream()
.map(color -> numbers.stream().map(number -> color + number)
.collect(Collectors.toCollection(LinkedList::new)))
.collect(Collectors.toList());
List<String> result = new ArrayList<>(sumCount);
// 偷懒使用jdk自带的线程池
ExecutorService executor = Executors.newCachedThreadPool();
CountDownLatch countDownLatch = new CountDownLatch(colorCount);
for (int i = 0; i < colorCount; i++) {
int current = i;
ReentrantLock lock = new ReentrantLock();
for (int count = 1; count <= numberCount; count++) {
executor.execute(() -> {
lock.lock();
if (CollectionUtil.isNotEmpty(rawList.get(current))) {
result.add(rawList.get(current).poll());
}
lock.unlock();
});
}
countDownLatch.countDown();
}
// 防止主线程提前结束
countDownLatch.await(10, TimeUnit.SECONDS);
// 在console输出(为了方便看, 转成pretty的json)
System.out.println(JSON.toJSONString(result, SerializerFeature.PrettyFormat));
System.out.println(result.size());
}
Just to be clear: this has thread-safety issues and can’t reliably behave as expected every time. But after grinding for an afternoon I still needed to give myself something to show for it. When I have time later I’ll keep optimizing it. I also hope the experts here can give me some ideas or point out the problems.
Afterword
These are all pretty simple problems, and none of them have a single standard answer. When you write programs, you can always brute-force something naive and still get AC. But the thinking, the approaches, and the complexity analysis are the most interesting parts. When I have time, I want to keep digging into them.
References
- A brief introduction to WeChat red packet architecture design
-
Implementing the double-mean algorithm and red packet grabbing in golang
- Linear optimal splitting algorithm
All articles in this blog, unless otherwise stated, are licensed under @Oreoft . Please indicate the source when reprinting!