En.wiktionary.org mdx 20231001 (10月数据完成)

Akira · 2023 年10 月 30 日 21:34

You meant that in this dump each entry may have many revisions each of which has a unique identifier?

So there is no way to tell if two lines in the ndjson file are indeed revisions of the same entry?

meandmyhomies · 2023 年10 月 30 日 21:42

Yes, I never pay attn to the identifier field. They are meaningless for the purpose of detecting identical visible (human readable) contents, which is what matters.

In other words, most of the meta data have no meaning to the consumer, only meaningful to the producer of content.

Akira · 2023 年10 月 30 日 21:45

Logically, different revisions of the same entry should have the same url, right?

Of course, url is not meta data…

meandmyhomies · 2023 年10 月 30 日 21:46

It depends on the URL, some / many URL are disambiguation pages (leading to totally different separate pages)

Akira · 2023 年10 月 30 日 21:52

I think for consistency we should treat disambiguation pages as usual pages. Then we don’t lose any data when grouping by url and pick the one with latest date_modified.

meandmyhomies · 2023 年10 月 30 日 22:00

To group by URL, you need a database or the like. Back to the same place.

I already have the logic for generic tab dedupe in place so I just ran it through. Deduping by URL would be something custom made for this particular dumps. Ideally they fix it upstream.

Data cleansing can be very time consuming and tedious.

meandmyhomies · 2023 年10 月 30 日 22:01

Keep in mind, URL is not unique. In other words, group by URL is not enough. It is a subtle but important issue. Several distinct URLs can point to the exact same article. It’s so frequent that it is ubiquitous. Essentially, a many(name)-to-one(article/entity)-to-many(URL) relationship. Then you throw in previous revisions into the raw json data. It is a mess.

Akira · 2023 年10 月 30 日 22:05

Thank you so much for all of your elaboration!

meandmyhomies · 2023 年10 月 30 日 22:09

Watch the duplicated data in this entry alone (Notice the timestamp and the redirects and categories shifting). Redirect = different URLs / Names for the SAME article content.

MDict_g8lGnJR6Ir

Akira · 2023 年10 月 30 日 22:11

It seems from your video that the url does not change at all :v

Akira · 2023 年10 月 30 日 22:20

@meandmyhomies

I have found 9651 duplicate url’s in the English Wiktionary. How many dups have you found in the English Wiktionary?

meandmyhomies · 2023 年10 月 30 日 22:23

Like I said, I don’t dedupe based on meta data. I only dedupe based on actual data (in units of tabs rather than the entire article). That is the only safe and sure way to dedupe. Any other method would not be complete (i.e. leaves identical copies everywhere) or safe (i.e. deletes contents which are not identical). In any case, half a million for a medium sized wikipedia is normal for this particular dumps set.

meandmyhomies · 2023 年10 月 30 日 22:24

Yes, because the name = url for each mdx entry we make. This is a mdx requirement. But the duplicate contents are everywhere inside of this mdx entry as you saw in the duplicated tab names (they are kept due to tab content being slightly different from each other).

Akira · 2023 年10 月 30 日 22:27

Thank you so much for your elaboration!

meandmyhomies · 2023 年10 月 30 日 22:29

No problems! I gone through this mess myself

Akira · 2023 年11 月 1 日 09:02

I guess the “single threaded database” you mentioned is sqlite3. I wonder why don’t you use cx_Oracle (or its new python-oracledb).

meandmyhomies · 2023 年11 月 1 日 09:14

Because it came built-in with python and already ready to run without any other dependencies. It’s actually far more than enough for everything but wiki mdx.

brightd · 2024 年1 月 7 日 13:07

m老师你好！
我一直在等待老师的维基系列，不知道英文，及中文维基百科哪一部是完整版，可以下载？英文版维基词典是完整版，可以下载吗？

meandmyhomies · 2024 年1 月 7 日 14:52

百科都完整了（元数据都处理了）
不建议用en字典（数据太差了）
最近做了20231220的，但是找不到放在哪里了，可能上传到百度了（一楼）

brightd · 2024 年1 月 7 日 15:39

这个是中文完整版吧（如是中文就OK了）？那么“EN20231020”还没有1.mdd, 2.mdd…特别想要英文版。根据您的建议英文词典就不要了。
谢谢！