En.wiktionary.org mdx 20231001 (10月数据完成)

hjtoh · 2023 年10 月 26 日 20:55

更新了？哪个是新样子？看了网盘，好像没有危机词典的更新。

meandmyhomies · 2023 年10 月 26 日 21:38

弄错了，百科10月20号的上传了。字典还没有，因为10月20的数据只有1/20不到。

Akira · 2023 年10 月 30 日 09:52

@meandmyhomies I tried using ijson librabry to stream the file enwiktionary_namespace_0_1.ndjson as follows:

from bs4 import BeautifulSoup
import ijson

folder_dir = "C:\\Users\\Akira\\Documents\\enwiktionary_namespace_0_1.ndjson"

# Open the JSON file
with open(folder_dir, 'r') as file:
    # Parse the JSON objects one by one
    parser = ijson.items(file, 'item', multiple_values=True)

    n = 0
    # Iterate over the JSON objects
    for item in parser:
        # Process each JSON object as needed
        print('Hello Python!')
        n = n + 1
        if n > 10:
            break

However, python returns nothing, i.e., there is no Hello Python!. Could you help me with this issue?

meandmyhomies · 2023 年10 月 30 日 11:57

try this:

with open(ndjsonfilename, 'r',encoding='utf-8') as f, open(mdxfilename, 'w',encoding='utf-8') as g:
    for line in f:
        line = line.strip()
        oJson = json.loads(line)
        hwd = '|'.join([oJson["name"],str(oJson["identifier"]),oJson["url"]])
        g.write(hwd+'\n')
        data = ''.join([x.strip() for x in oJson["article_body"]['html'].split('\n')])
        g.write(data)
        del oJson["article_body"]
        g.write('\n</>\n')

Akira · 2023 年10 月 30 日 13:29

My laptop takes only 21.3s to run your code. I think the code is sequential (i.e., it processes line by line) and thus I’m surprised by this speed. May I ask how long it takes your laptop to run?

meandmyhomies · 2023 年10 月 30 日 19:21

You have a very speedy laptop. It takes mine around at least 30 seconds to process. I would expect the CPU is not the bottleneck (single threaded here), but your timing is within expectation.

Akira · 2023 年10 月 30 日 19:33

My laptop has only 32GB RAM. The good speed I obtain may be due to the high-speed RAM.

How long does it take your PC to process the whole English Wiktionary? Above conversion (from ndjson to txt) is very fast. So I guess the HTML processing (by bs4) may be the gist… I’m very interested in optimizing the conversion.

meandmyhomies · 2023 年10 月 30 日 19:37

This step takes almost no time relatively. Total don’t exceed half an hour on 700GB of json data. The subsequent processing takes extremely long. Often 20-30 hours every 300GB of converted data.

Akira · 2023 年10 月 30 日 19:39

If you don’t mind, please share with me the function you created to process the HTML. I very want to optimize it.

meandmyhomies · 2023 年10 月 30 日 19:42

It is splashed across several files and cannot even be run without a lot of tweaking/docs and refactoring. The thing is clearly limited by the single threaded database queries. Fortunately the query does not have a memory limit. So unless I change my DB into an industrial strength one such as Oracle or db2 or SQL server, there is no way to make it go faster (my SSD is probably the next bottleneck).

meandmyhomies · 2023 年10 月 30 日 19:45

For example, adding index after insert took 3317.02 seconds on 1/3 of the data in English wikipedia.
That step I see my SSD going at sustained 500MB-1TB per second for much of the time. Cannot speed this up much IMO.

meandmyhomies · 2023 年10 月 30 日 19:48

Basically I am condensiing 700GB of json data into 45GB of mdx. I found the original json dump file contains 500K or so 【correction: 2 Million) duplicate records which takes up a lot of space too.

Akira · 2023 年10 月 30 日 19:56

Could you explain more about ‘‘single threaded database queries’’. I thought we can do multiprocessing.

On the othe hand, I have recently found the package pandas_streaming which streams data. My code is

import pandas as pd
from bs4 import BeautifulSoup
from pandas_streaming.df import StreamingDataFrame
from multiprocessing import dummy

thread = 30
P = dummy.Pool(processes = thread)

file_dir = "C:\\Users\\Akira\\Documents\\enwiktionary_1.ndjson"

sdf = StreamingDataFrame.read_json(file_dir, lines=True, chunksize = 10000, encoding='utf-8')

for df in sdf:
    res = P.map(process_html, df)

In above code, we only keep one chunk at a time to process. In this way, we will not run out of memory.

meandmyhomies · 2023 年10 月 30 日 20:08

Processing HTML sequentially or in parallel does not consume too much CPU or memory if done in chunks. It’s when you do a global dedupe, and anything involving all the data at the same time which takes forever. If I did nothing global, then the 700GB json can be turned into mdx in 3-4 hours.

Akira · 2023 年10 月 30 日 20:12

Voila Your explanation makes a lot of sense. I have never done any operation involved all the data. Could you mention some such global operations so that I can think of it?

meandmyhomies · 2023 年10 月 30 日 20:23

This is for a smaller mdx which has around 100GB data, I found 489K duplicate records in the original data (german wikipedia). Try running in single thread in a database.

Akira · 2023 年10 月 30 日 20:26

You meant it is not possible to find duplicate entries with multi-processing?

meandmyhomies · 2023 年10 月 30 日 20:26

Not in ways which are faster than the database. A database is built for such tasks without blowing up the memory or disk.

Akira · 2023 年10 月 30 日 20:29

May I ask what ‘‘duplicate tab’’ actualy means? Are they two similar entries/lines/htmls in such a ndjson file as enwiktionary_namespace_0_1.ndjson?

meandmyhomies · 2023 年10 月 30 日 20:30

I was shocked at how many records are identical to each other. Half a million is a lot, almost more than the OED, the largest dictionary on earth. I dedupe across all json files, sometimes there are 350 of then in one mdx.