Vem Ă€r Vem? is a Swedish biographical encyclopedia that was published in two editions of five volumes each in 1945â1950 and 1962â1968 by Bokförlaget Vem Ă€r Vem.
The intention was, according to the publishers, to draw attention to people who were at the height of their activities, even if they were younger, in influential or otherwise noted positions in different areas.
Biographies and career trajectories of ~ 75,000 individuals!
8 of the 10 volumes are digitized by librarians in Uppsala â thank you <3
Vem Àr Vem? Volumes
Vem Àr Vem? Example page
Vem Àr Karl Lund?
Lund, Karl Gustaf, chief engineer, born on July 22, 1893, in Hille, GĂ€vle County, Sweden, son of clerk Ferdinand L. and Maria Andersson. Married in 1936 to Sigrid Johansson. Children: Ingvar (born 1938), Lennart (born 1942). â Graduated from Bergsskolan in Filipstad in 1917, specialized studies at the Royal Institute of Technology (KTH) from 1920 to 1922, studied at the Institute of Metallurgy and Stockholm University in 1921-1922. Chemist at StrömsnĂ€s JĂ€rnverks A-B in Degerfors from 1918 to 1920, metallurgist and chemist at Westinghouse Electric & Manufacturing Co. in East Pittsburgh, PA, USA, from 1923 to 1926 and 1928 to 1929, chief metallurgist at Laclede Steel Co. in Alton, Illinois, USA, in 1927, furnace and steelworks engineer at A-B Iggesunds Bruk from 1929 to 1931, site manager at Gunnebo Bruks Nya A-B, Varbergsverket, since 1931. Member of the municipal executive committee, deputy chairman of the economic department, deputy member of the board of the power plant, chairman of Varbergs Sparbank, employer representative in the district council of the county labor board, member of the board of Varbergs Luftskyddsförening (Varberg Air Protection Association), secretary of Varbergs Högerförening (Varberg Conservative Association), chairman of the railway sick fund, and Plant Society for Small Bird Friends.
Vem Àr Karl Lund?
Traveled to Germany in 1921, 1923, 1930, and 1936, Denmark, Czechoslovakia in 1921, 1922, and 1923, Austria in 1921, and the USA from 1923 to 1929. Writings: âSome fundamental factors for obtaining sharp thermal curvesâ (Transactions of the American Society for Steel Treating, co-authored with C. Benedicks and W. H. Dearden, 1925), âContemporary production of saw blades, circular saws, and machine knivesâ (Timber Industry, 1931). Hobbies: hunting and fishing.
How can we structure the data?
NLP challenges
Many abbreviations and contractions
DOB: âf.\s*(\d{2})\/(\d{2})/(\d{2})â
GĂ€vleborg County: âGĂ€vleb. l.â
Similar structure for each entry but not exactly the same information in the same order
Data pipeline
Step
Process
1
Scrape book data from website
đâĄïžđ
2
Split records on each page of a book
đâĄïžđ
3
Structure biographies using schema
đđđ
4
Classify record into group
đđđđđ
5
Store data for analysis
đđđ
20,000 biographies in 9 volumes of Vem Àr Vem? and 2,400 firms in Svensk Industrikalender in a machine readable format for ~ 80 USD.
Output
{"original": "Lund, Karl Gustaf, överingenjör, Varberg, f. 22/7/93 i Hille, GĂ€vleb. 1., av brukstj :m. Ferdinand L. o. Maria Andersson. G. 36 m. Sigrid Johansson. Barn: Ingvar f. 38, Lennart 42. â Ex. v. bergssk. i Filipstad 17, spec:-stud. v. KTH (B) 20-22, stud. v. me-tallogr. inst. o. Sthlms högsk. 21-22. Kemist v. StrömsnĂ€s JĂ€rnverks A-B, Degerfors, 18-20, metallurg o. kemist v. Westinghouse Electric & Manuf. Co., East Pittsburgh, Pa, USA, 23-26 o. 28-29, chefsmetallurg v. Laclede Steel^Co., Ahon, 111., USA, 27, hytt-o. stĂ„lv.ing. v. A-B Iggesunds Bruk 29-31, platschef v. Gunnebo Bruks Nya A-B, Varbergsverket, sed. 31. Led. av drĂ€tselkamm., v. ordf. v. ekonomi-avd., suppl. i styr. f. elverket, huv:-man i Varbergs Sparbank, arb :giv. repr. i lĂ€nsarb:ndns kretsrĂ„d, led. av styr. f. Varbergs luftsk ifören., sekr. i Varbergs högerfören., ordfs i jĂ€rnv. sjukkassa o. Plant :sĂ€llsk. SmĂ„fĂ„gl. VĂ€nner. Res. t. Tyskl. 21, 2^ 23, 30 o. 36, Danm., Tjeckoslov. 21, 22, 23, Ăsterr. 21, USA 23-29. Skr.: Some fundamental factors for obtaining sharp thermal curves (TrĂ€ns. Am. Soc. for Steel Treating, tills. m. C. Benedicks o. W. H. Dearden 25), Nutida fabrikation av sĂ„gblad, sĂ„gklingor o. maskinknivar (TrĂ€varuind. 31). Hob-bies: jakt o. fiske.","translated": "Lund, Karl Gustaf, chief engineer, Varberg, born on July 22, 1893, in Hille, GĂ€vle County, Sweden, son of factory worker Ferdinand L. and Maria Andersson. Married in 1936 to Sigrid Johansson. Children: Ingvar (born 1938), Lennart (born 1942). â Graduated from Bergsskolan in Filipstad in 1917, specialized studies at the Royal Institute of Technology (KTH) from 1920 to 1922, studied at the Institute of Metallurgy and Stockholm University in 1921-1922. Chemist at StrömsnĂ€s JĂ€rnverks A-B in Degerfors from 1918 to 1920.","structured": {"@context": "https://schema.org","@type": "Person","name": "Karl Gustaf Lund","birthDate": "1893-07-22","birthPlace": {"name": "Hille, GĂ€vle County, Sweden","latitude": 59.916667,"longitude": 15.0 },"jobTitle": "Chief Engineer","memberOf": {"@type": "Organization","name": "Swedish Technical Association" },"children": [ {"@type": "Person","name": "Anita","birthDate": "1937" }, {"@type": "Person","name": "Peter","birthDate": "1942" } ],"spouse": {"@type": "Person","name": "Marianne Hammarberg","marriageDate": "1936" },"education": [ {"@type": "EducationalOccupationalCredential","credentialCategory": "Vocational","issuer": {"@type": "Organization","name": "Technical Gymnasium in Ărebro" },"endDate": "1928" }, {"@type": "EducationalOccupationalCredential","credentialCategory": "Degree","issuer": {"@type": "Organization","name": "KTH" },"endDate": "1934" } ],"worksFor": {"@type": "Organization","name": "Telegraph station in Norrköping","startDate": "1946" },"hasOccupation": [ {"@type": "Occupation","name": "Engineering assistant","employer": {"@type": "Organization","name": "Telegraph station in NĂ€ssjö" },"startDate": "1935","endDate": "1935" }, {"@type": "Occupation","name": "Line engineer","employer": {"@type": "Organization","name": "Telegraph station in Norrköping" },"startDate": "1946" } ] }}
How can we classify occupations according to a schema?
graph LR
A[Queried Job Title: Civilingenjör, teknisk fysik Code: 2142] --> B[Text Embeddings]
B --> C[Classifier Algorithm]
C --> D[Similarity Ranking]
D --> E1[1. 2142 - Civilingenjörsyrken inom bygg och anlÀggning]
D --> E2[2. 8212 - Montörer, elektrisk och elektronisk utrustning]
D --> E3[3. 7215 - StÄlkonstruktionsmontörer och grovplÄtslagare]
D --> E4[4. 7319 - Musikinstrumentmakare och övriga konsthantverkare]
D --> E5[5. 1212 - Ekonomi- och finanschefer nivÄ 2]
style E1 fill:#ffff
style A fill:#ffff
Two-part classification process for engineers
Makes use of occupational title, e.g. âCivilingenjör, teknisk fysikâ and workplace information, e.g. âStrömsnĂ€s JĂ€rnverks A-Bâ to classify individuals into groups.
Assign HISCOs: codes projected from 1536 dimensions to 2 with UMAP
Results: What are the most common occupations?
Where were the individuals from?
Where do they come from? Norway edition
Are engineers immigrants, or from very rural areas more than others?
Just 3.6% of engineers were born outside of Sweden, not significantly different from other occupations.
Of Swedish born, priests come from places with low population density.
No statistical differences in population density between the birthplaces of doctors, general managers, dentists, and engineers.
Engineers are born in parishes with early access to electricity
Where did engineers study?
Did engineers get advanced degrees?
Engineers in electrical appliances and machinery move the furthest from birthplace to study
Engineers in electrical appliances and machinery live further from their birthplace than other occupations
Engineers live further from their birthplace than other occupations
Conclusion
Bifurcated labour market in Sweden; middle-skill workers stay put, high-skill workers move to opportunity both for education and work.
What to do next
Intergenerational mobility: did engineersâ parents have similar jobs to doctors, dentists, businessmen?
References
Bandiera, Oriana, Andrea Prat, Stephen Hansen, and Raffaella Sadun. 2020. âCEO Behavior and Firm Performance.âJournal of Political Economy 128 (4): 1325â69. https://doi.org/10.1086/705331.
Dahl, Christian MĂžller, and Christian Vedel. 2023. âBreaking the HISCO Barrier: AI and Occupational Data Standardization.â Odense, Denmark: University of Southern Denmark.
Ford, Nicholas Martin, Kristin Ranestad, and Paul Sharp. 2023. âNot the Best Fillers in of Forms? The Danish and Norwegian Graduate Biographies and âUpper Tail Knowledgeâ.â Working Papers 0242. European Historical Economics Society (EHES). https://ideas.repec.org/p/hes/wpaper/0242.html.
Goldin, Claudia. 1994. âLabor Markets in the Twentieth Century.âNational Bureau of Economic Research. https://doi.org/10.3386/h0058.
Merouani, Youssouf. 2023. âInnovation and Gender During Industrialization: The Case of Patenting Activity in France 1791-1913.â Midway Seminar, Economic History Department, LUSEM, Lund University. https://youssoufmerouani.com/talks/midway_seminar.html#/title-slide.
Mokyr, Joel. 2017. âA Culture of Growth: The Origins of the Modern Economy.âPrinceton University Press.
Moretti, Enrico. 2012. The New Geography of Jobs. Houghton Mifflin Harcourt.
Nekoei, Arash, and Fabian Sinn. 2020. âHERSTORY the Rise of Self-Made Women.âSSRN Electronic Journal, December. https://doi.org/10.2139/ssrn.3741332.
Appendix
How to get a text embedding?
# Define a function to get the text embeddingdef get_embedding(text, model="text-embedding-ada-002"): text = text.replace("\n", " ")return openai.Embedding.create(input=[text], model=model)['data'][0]['embedding']# Apply the get_embedding function to your 'hisco_text' columndf['ada_embedding'] = df['hisco_text'].apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))
Assign HISCOs: example
026 Metallurgists (HISCO description)
Workers in this unit group advise on metallurgical problems and develop and control processes for the extraction of metals from their ores, study the properties and engineering characteristics of metals and develop new alloys, and develop and supervise metal manufacturing processes for making finished metal products.
Karl Lund in 1923 (Input)
Metallurgist and Chemist at Westinghouse Electric & Manufacturing Co.
Assign HISCOs: codes projected from 1536 dimensions to 2 with UMAP
Assign HISCOs: Watchmaker Apprentice
Closest HISCO codes to Watchmaker Apprentice
Rank
HISCO CODE
Occupational Description
1
842
Watch, Clock and Precision Instrument Makers
2
830
Blacksmiths, Toolmakers and Machine-Tool Operators
3
810
Woodworkers
4
811
Cabinetmakers
5
839
Blacksmiths, Toolmakers and Machine-Tool Operators
Assign HISCOs: Shipyard Assistant Engineer
Closest HISCO Codes to âShipyard Assistant Engineerâ
Rank
HISCO CODE
Occupational Description
1
43
Ships' Engineers
2
24
Mechanical Engineer
3
982
Ships' Engine-Room Workers
4
42
Ships' Deck Officers and Pilots
5
959
Construction Workers
Keep titles in Swedish?
Instead of using the HISCO schema in English, I am using the SSYK classification from 2012.
Instead of using the multilingual OpenAI ADA model, I now use The National Library of Sweden Swedish BERT model. It is trained on 200M sentences, 3000M tokens from books, news, government publications, swedish wikipedia and internet forums.
The input is from the Job Tech Data so we have a ground truth that they have labelled.
SSYK and Swedish BERT examples:
Closest SSYK Codes to âAmbulanssjukvĂ„rdareâ (Code: 5326)
Rank
Closest Match Title and Code
1
5326 - AmbulanssjukvÄrdare
2
2226 - Ambulanssjuksköterskor m.fl.
3
2227 - Geriatriksjuksköterskor
4
5112 - TÄgvÀrdar och ombordansvariga m.fl.
5
2231 - Operationssjuksköterskor
SSYK and Swedish BERT examples:
Closest Matches to âProcessingenjör, kemiteknik (Code: 2512)â
Rank
Closest Match Title and Code
1
3115 - Ingenjörer och tekniker inom kemi och kemiteknik
2
2145 - Civilingenjörsyrken inom kemi och kemiteknik
3
3113 - Ingenjörer och tekniker inom elektroteknik
4
7321 - Prepresstekniker
5
2163 - Planeringsarkitekter m.fl.
Can we score how successful this is?
We can use a crosswalk from one schema to another as a test set, e.g. O * Net to ISCO:
Vem Àr Karl Lund?
Lund, Karl Gustaf, chief engineer, born on July 22, 1893, in Hille, GĂ€vleborgs lĂ€n, Sweden, son of clerk Ferdinand L. and Maria Andersson. Married in 1936 to Sigrid Johansson. Children: Ingvar (born 1938), Lennart (born 1942). â Graduated from Bergsskolan in Filipstad in 1917, specialized studies at the Royal Institute of Technology (KTH) from 1920 to 1922, studied at the Institute of Metallurgy and Stockholm University in 1921-1922. Chemist at StrömsnĂ€s JĂ€rnverks A-B in Degerfors from 1918 to 1920, metallurgist and chemist at Westinghouse Electric & Manufacturing Co. in East Pittsburgh, PA, USA, from 1923 to 1926 and 1928 to 1929, chief metallurgist at Laclede Steel Co. in Alton, Illinois, USA, in 1927, furnace and steelworks engineer at A-B Iggesunds Bruk from 1929 to 1931, site manager at Gunnebo Bruks Nya A-B, Varbergsverket, since 1931. Member of the municipal executive committee, deputy chairman of the economic department, deputy member of the board of the power plant, chairman of Varbergs Sparbank, employer representative in the district council of the county labor board, member of the board of Varbergs Luftskyddsförening (Varberg Air Protection Association), secretary of Varbergs Högerförening (Varberg Conservative Association), chairman of the railway sick fund, and Plant Society for Small Bird Friends.
NLP challenges
Many abbreviations and contractions
DOB: âf.\s*(\d{2})\/(\d{2})/(\d{2})â
GĂ€vleborg County: âGĂ€vleb. l.â
Similar structure for each entry but not exactly the same information in the same order
Structure using OpenAI API đ€
Using OpenAIâs GPT-3.5 chat API:
We provide the system a schema to structure the record in, e.g. keys.
def structure_biography_info(page_text):try:# Create a prompt for the system structure_prompt =f"Task: read the schema and return RFC compliant JSON information about the Swedish individuals from the 1950 biographical dictionary 'Vem Àr Vem' that is provided below. Use a numeric index for each biography in your JSON output and return information about all of them, including all career information available. Keep the biographic descriptions in Swedish and remove any abbreviations based on your knowledge, e.g. 'fil. kand.' is 'filosofie kandidat', and 'Skarab. l.' is 'Skaraborgs LÀn'. Put years in full based on context. Put dates in dd/mm/yyyy format where possible. If there is no information for a key, leave it out. If there is no information for a required key, put NULL as the value.\nHere is the schema: {schema}.\nHere is the text: {page_text}. Go!" structure_response = client.chat.completions.create( model="gpt-3.5-turbo-1106", response_format={"type": "json_object"}, messages=[ {"role": "system","content": "You are an expert on Swedish biographies.", }, {"role": "user", "content": f"{structure_prompt}"}, ], ) structured_biography_info = json.loads( structure_response.choices[0].message.content )return structured_biography_infoexceptExceptionas e:print(f"Error in structure_biography_info: {e}")returnNone
Add coordinates to firms and individuals
Birthplaces of individuals who studied at Royal Tehcnical Institute (KTH) in Who is Who in Industry and Business
Fuzzy string matching doesnât capture semantic similarity between words
âchauffeurâ is close to âcar driverâ semantically, but not close in text.
Many possible groups makes classification challenging for traditional ML models
HISCO schema has 1,600 unit groups.
Building one labelled classification scheme manually does not generalize easily
If you code all of your occupations to HISCO and then want to move to SSYK, you have to redo all of the codings or try and find a crosswalk.
I spent months doing this last year to classify occupations to HISCLASS groups, hand labelling data and optimizing a support vector machine model.
It did not work
Why is classification easier now?
We make use of pre-trained large language models
Benefit from semantic similarity
Text embeddings mean we can have an arbitrary number of classes
Switch out classification target schema easily!
What is an embedding?
What is an embedding?
Use embeddings to cluster skills
Figure 1: Scatter plot showing the relative similarity of occupational titles in the HISCO schema
Use clusters to classify individuals into the groups
Figure 2: Most common HISCO occupations in Vem Àr Vem
Use clusters to classify new skills into the groups
graph TB
A[Collect HISCO Codes<br>and Descriptions] -->|Use OpenAI API| B[Convert HISCO Titles<br>and Descriptions to Vectors]
C[Receive Occupational<br>Strings] -->|Use OpenAI API| D[Convert Occupational<br>Strings to Vectors]
B --> E[Compare Vectors in<br>Vector Space]
D --> E
E --> F[Assign HISCO Code to<br>Occupational String<br>using Cosine Distance]
style A fill:#2B8CBE
style B fill:#be5d2b
style C fill:#2B8CBE
style D fill:#be5d2b
style E fill:#6f6fbf
style F fill:#F4FA58
Here the HISCO descriptions are in English, as are the translated titles from my biographies in Swedish.
## Cluster firms by location{visibility=âuncountedâ}
Figure 3: Map of the geographic clusters of businesses by most common business type